> But it doesn't matter, because LLMs that try to use a class will get an error message and rewrite their code to not use classes instead.
This is true in a sense, but every little papercut at the lower levels of abstraction degrades performance at higher levels as the LLM needs to spend its efforts on hacking around jank in the Python interpreter instead of solving the real problem.
It is a workaround, so we can assume that this will be temporary and in the future the ai will then start using them once it can. Probably just like we would do.
Thw entire AI stack is built on a lot of "assumes" about intelligent selection.
Reminds of evolutionary debate. Whats important is just because something can learn to adapt doesnt mean theyll find an optimized adaption, nor will they continually refine it.
As far as i can tell AI will only solve problems well where the problem space is properly defined. Most people wont know how to do that.
This is very cool, but I'm having some trouble understanding the use cases.
Is this mostly just for codemode where the MCP calls instead go through a Monty function call? Is it to do some quick maths or pre/post-processing to answer queries? Or maybe to implement CaMeL?
It feels like the power of terminal agents is partly because they can access the network/filesystem, and so sandboxed containers are a natural extension?
> Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code.
> Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.
Oh I did read the README, but still have the question: while it does save on cost, latency and complexity, the tradeoff is that the agents can't run whatever they want in a sandbox, which would make them less capable too.
For extremely rapid iteration - they can run a quick script with this in under 1ms - it removes a significant bottleneck, especially for math-heavy reasoning
You're really stretching things here to classify me pointing out that LLMs can handle syntax errors caused by partial implementations of Python as "being a vapid propagandist".
(This kind of extremely weak criticism often seems to come from newly created Hacker News accounts, which makes me wonder if it's mostly the same person using sockpuppets.)
Sorry for this, Simon. But just know that this non-newly-created hacker news account does not think you are a “vapid propagandist” and appreciates your content.
Warning: another fake troll account just created for this comment. The same one left a comment last night on a new account under Simon's comment as well but was flagged.
This feels like the time I was a Mercurial user before I moved to Git.
Everyone was using git for reasons to me that seemed bandwagon-y, when Mercurial just had such a better UX and mental model to me.
Now, everyone is writing agent `exec`s in Python, when I think TypeScript/JS is far better suited for the job (it was always fast + secure, not to mention more reliable and information dense b/c of typing).
3 reasons why Python is much better than JS for this IMO.
1. Large built-in standard library (CSV, sqlite3, xml/json, zipfile).
2. In Python, whatever the LLM is likely to do will probably work. In JS, you have the Node / Deno split, far too many libraries that do the same thing (XMLHTTPRequest / Axios / fetch), many mutually-incompatible import syntaxes (E.G. compare tsx versus Node's native ts execution), and features like top-level await (very important for small scripts, and something that an LLM is likely to use!), which only work if you pray three times on the day of the full moon.
3. Much better ecosystem for data processing (particularly csv/pandas), partially resulting from operator overloading being a thing.
> JSX/TSX, despite what React people might want you to believe, are not part of the language.
I think you misunderstood this. tsx in this context is/was a way to run typescript files locally without doing tsc yourself first, ie make them run like a script. You can just use Node now, but for a long time it couldn’t natively run typescript files.
The only limitation I run into using Node natively is you need to do import types as type imports, which I doubt would be an issue in practice for agents.
Yes, thank you for pointing that out. Forgot that there's a another thing named "tsx" out there.
I wouldn't call it running TS natively - what they're doing is either using an external tool, or just stripping types, so several things, like most notably enums, don't work by default.
I mean, that's more than enough for my use cases and I'm happy that the feature exists, but I don't think we'll ever see a native TypeScript engine. Would have been cool, though, considering JS engines define their own internal types anyway.
> Do you think there are 22 competing package managers in python because the package/import system "just works"?
There aren't; a large fraction of tools people mention in this context aren't actually package managers and don't try to be package managers. Sometimes people even conflate standards and config files with tools. It's really amazing how much FUD there is around it.
But more importantly, there is no such thing as "the package/import system". Packaging is one thing, and the language's import system is a completely different thing.
And none of that actually bears on the LLM's ability to choose libraries and figure out language syntax and APIs. For that matter, you don't have to let it set up the environment (or change your existing setup) if you don't want to.
Having been doing Python for over a decade and JavaScript. I would pick Python any day of the week over JavaScript. JavaScript is beautiful, and also the most horrific programming language all at once. It still feels incomplete, there's too many oddities I've run into over the years, like checking for null, empty, undefined values is inconsistent all around because different libraries behave differently.
TBF is the Python ecosystem any different? None and dict everywhere, requirements.txt without pinned versions... I'm not complaining either, as I wouldn't expect a unified typed experience in ecosystems where multiple competing type checkers and package managers have been introduced gradually. How could any library from the python3.4 era foresee dataclasses or the typing module?
Such changes take time, and I favor an "evolution trumps revolution"-approach for such features. The JS/TS ecosystem has the advantage here, as it has already been going through its roughest time since es2015. In hindsight, it was a very healthy choice and the type system with TS is something to be left desired in many programming languages.
If it weren't for its rich standard library and uv, I would still clearly favor TS and a runtime like bun or deno. Python still suffers from spread out global state and some multi-paradigm approach when it comes to concurrency (if concurrency has even been considered by the library author). Python being the first programming language for many scientists shows its toll too: rich libraries of dubious quality in various domains. Whereas JS' origins in browser scripting contributed to the convention to treat global state as something to be frowned upon.
I wish both systems would have good object schema validation build into the standard library. Python has the upper hand here with dataclasses, but it still follows some "take it or throw"-approach, rather than to support customization for validations.
It was better because it had no silent errors, like 1+”1”. Far from perfect, the fact it raised exceptions and enforced the philosophy of “don’t ask for permission but forgiveness” makes the difference.
IMHO It’s irrelevant it has a slightly better typesystem and runtime but that’s totally irrelevant nowadays.
With AI doing mostly everything we should forget these past riddles. Now we all should be looking towards fail-safe systems, formal verification and domain modeling.
Conflating types in binary operations hasn't been an issue for me since I started using TS in 2016. Even before that, it was just the result of domain modeling done badly, and I think software engineers got burned enough for using dynamic type systems at scale... but that's a discussion to be had 10 years ago. We all moved on from that, or at least I hope we did.
> Now we all should be looking towards fail-safe systems, formal verification and domain modeling.
We were looking forward to these things since the term distributed computing has been coined, haven't we? Building fail-safe systems has always been the goal since long-running processes were a thing.
Despite any "past riddles", the more expressive the type system the better the domain modeling experience, and I'd guess formal methods would benefit immensely from a good type system. Is there any formal language that is usable as general-purpose programming language I don't know of? I only ever see formal methods used for the verification of distributed algorithms or permission logic, on the theorem proving side of things, but I have yet to see a single application written only in something like Lean[0] or LiquidHaskell[1]...
For historical reasons (FFI), Python has access to excellent vector / tensor mathematics (numpy / scipy / pandas / polars) and ML / AI libraries, from OpenCV to PyTorch. Hence the prevalence of Python in science and research. "Everybody knows Python".
I do like Typescript (not JS) better, because of its highly advanced type system, compared to Python's.
TS/JS is not inherently fast, it just has a good JIT compiler; Python still ships without one. Regarding security, each interpreter is about as permissive as the other, and both can be sealed off from environment pretty securely.
(Pydantic AI lead here) That’s exactly what we built this for: we’re implementing Code Mode in https://github.com/pydantic/pydantic-ai/pull/4153 which will use Monty by default, with abstractions to use other runtimes / sandboxes.
Monty’s overhead is so low that, assuming we get the security / capabilities tradeoff right (Samuel can comment on this more), you could always have it enabled on your agents with basically no downsides, which can’t be said for many other code execution sandboxes which are often over-kill for the code mode use case anyway.
For those not familiar with the concept, the idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see (all of) the intermediate value. Every step that depends on results from an earlier step requires a new LLM turn, limiting parallelism and adding a lot of overhead, expensive token usage, and context window bloat.
With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.
Why do you think python without access to the library ecosystem is a good approach? I think you will end up with small tool call subgraphs (i.e. more round trips) or having to generate substantially more utility code.
Just want to say Kudos to you and the team. This is a brilliantly conceived chunk of functionality that IMHO hits exactly a sweet spot I didn’t realize was missing. I’m working on a chat bot system now and definitely plan to incorporate Monty into it for all the reasons y’all foresaw.
But even my simple class project reveals this. You actually do want a simple tool wrapper layer (abstraction) over every API. It doesn't even need to be an API. It can be a calculator that doesn't reach out anywhere.
There is a ton of wheel reinvention going on right now cause everyone wants to be cool in the age of ai
Use boring tech, you'll thank me and yourself later
Which in this case means, just use regular python. Your devops team is unlikely to allow knock off python in production. TS is fine too, I mainly write Go
I remember the time when Python was the underdog and most of AI/ML code was written in the Matlab or Lua (torch). People would roll their eyes when you told them that you were doing deep learning with Python (theano).
Why would one drag this god forsaken abomination on server-side is beyond me.
Even effing C# nowdays can be run in script-like manner from a single file.
—
Even the latest Codex UI app is Electron. The one that is supposed to write itself with AI wonders but couldn’t manage native swiftui, winui, and qt or whatever is on linux this days.
My favourite languages are F# and OCaml, and from my perspective, TypeScript is a far better language than C#.
Typescript’s types are far more adaptable and malleable, even with the latest C# 15 which is belatedly adding Sum Types. If I set TypeScript to its most strict settings, I can even make it mimic a poor man’s Haskell and write existential types or monoids.
And JS/TS have by far the best libraries and utilities for JSON and xml parsing and string manipulation this side of Perl (the difference being that the TypeScript version is actually readable), and maybe Nushell but I’ve never used Nushell in production.
Recently I wrote a Linux CLI tool for managing podman/quadlett containers and I wrote it in TypeScript and it was a joy to use. The Effect library gave me proper Error types and immutable data types and the Bun Shell makes writing shell commands in TS nearly as easy as Bash. And I got it to compile a single self contained binary which I can run on any server and has lower memory footprint and faster startup time than any equivalent .NET code I’ve ever written.
And yes had I written it in rust it would have been faster and probably even safer but for a quick a dirty tool, development speed matters and I can tell you that I really appreciated not having to think about ownership and fighting the borrow checker the whole time.
TypeScript might not be perfect, but it is a surprisingly good language for many domains and is still undervalued IMO given what it provides.
Python has the advantage that everybody sort of knows it is bad and slow, which is an important trait for a glue language. This increases the incentive to do the right thing: call a library written in C or Fortran or something.
That used to be true, but much of modern python code I see looks nothing like pseudocode. That advantage was lost around version 3, if not even before that.
This is a really interesting take on the sandboxing problem. This reminds me of an experiment I worked on a while back (https://github.com/imfing/jsrun), which embedded V8 into Python to allow running JavaScript with tightly controlled access to the host environment. Similar in goal to run untrusted code in Python.
I’m especially curious about where the Pydantic team wants to take Monty. The minimal-interpreter approach feels like a good starting point for AI workloads, but the long tail of Python semantics is brutal. There is a trade-off between keeping the surface area small (for security and predictability) and providing sufficient language capabilities to handle non-trivial snippets that LLMs generate to do complex tasks
Can't be sure where this might end, but the primary goal is to enable codemode/programmatic tool calling, using the external function call mechanism for anything more complicated.
I think in the near term we'll add support for classes, dataclasses, datetime, json. I think that should be enough for many use cases.
there’s no way around VMs for secure, untrusted workloads. everything else, like Monty has too many tradeoffs that makes it non-viable for any real workloads
As discussed on twitter, v8 shows that's not true.
But to be clear, we're not even targeting the same "computer use" use case I think e2b, daytona, cloudflare, modal, fly.io, deno, google, aws are going after - we're aiming to support programmatic tool calling with minimal latency and complexity - it's a fundamentally different offering.
V8 itself is intended to be heavily sandboxed. Not through a microvm, but otherwise it's probably the most heavily sandboxed piece of code ever ie: in Chrome it can make virtually no system calls and runs with every restriction an OS can possibly provide and moreand seccomp-bpf was basically invented for it.
Perhaps you're using v8 isolates, which then you're back into the "heavily restricted environment within the process" and you lose the things you'd want your AI to be able to do, and even then you still have to sandbox the hell out of it to be safe and you have to seriously consider side channel leaks.
And even after all of that you'd better hope you're staying up to date with patches.
MicroVMs are going to just be way simpler IMO. I don't really get the appeal of using V8 for this unless you have platform/ deployment limitations. Talking over Firecracker's vsock is extremely fast. Firecracker is also insanely safe - 3 CVEs ever, and IMO none are exploitable.
There's been a constant stream of v8 VM sandbox escape discoveries since its dawn of course. Considering those have mostly existed for a long time before publication it's very porous most of the time.
Then there's of course hypervisor based virtualization and the vulnerabilities and VM escapes there.
Browsers use belt-and-suspenders approaches of employing both language runtime VMs and hardware memory protection as layers to some effect, but still are the star act at pwn2own etc.
It's all layers of porous defenses. There'd definitely be room in the world for performant dynamic language implementations with provably secure foundations.
part of why rexec is "historical" is that Guido was looking at some lockdown work and asked (twitter, probably?) the community to come up with attack ideas (on a specific more-locked-down-than-default proposed version.) After a couple of hours, it was clear that "patching the problems" was entirely doomed given how flexible python is and it was better to do something else entirely and stop pretending...
Interesting trade-off: build a minimal interpreter that's "good enough" for AI-generated code rather than trying to match CPython feature-for-feature.
The security angle is probably the most compelling part. Running arbitrary AI-generated Python in a full CPython runtime is asking for trouble — the attack surface is enormous. Stripping it down to a minimal subset at least constrains what the generated code can do.
The bet here seems to be that AI-generated code can be nudged to use a restricted subset through error feedback loops, which honestly seems reasonable for most tool-use scenarios. You don't need metaclasses and dynamic imports to parse JSON or make API calls.
the papercut argument jstanley made is valid but there's a flip side - when you're running AI-generated code at scale, every capability you give it is also a capability that malicious prompts can exploit. the real question isn't whether restrictions slow down the model (they do), it's whether the alternative - full CPython with file I/O, network access, subprocess - is something you can safely give to code written by a language model that someone else is prompting.
that said, the class restriction feels weird. classes aren't the security boundary. file access, network, imports - that's where the risk is. restricting classes just forces the model to write uglier code for no security gain. would be curious if the restrictions map to an actual threat model or if it's more of a "start minimal and add features" approach.
My understanding is that "the class restriction" isn't trying to implement any kind of security boundary — they just haven't managed to implement support yet.
ah that makes sense - I was reading too much into it as a deliberate security trade-off. makes way more sense as a "not implemented yet" thing. thanks for clarifying.
Totally reasonable project for many reasons but fast tools for AI always makes me chuckle. Imagine your job is delivering packages and along the delivery route one of your coworkers is a literal glacier. It doesn't really matter how fast you walk, run, bike, or drive. If part of your delivery chain tops out at 30 meters per day you're going to have a slow delivery service. The ratio between the speed of code execution and AI "thinking" is worse than this analogy.
> Instead, it let's you run safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.
Perhaps if the interpreter is in turn embedded in the executable and runs in-process, but even a do-nothing `uv` invocation takes ~10ms on my system.
I like the idea of a minimal implementation like this, though. I hadn't even considered it from an AI sandboxing perspective; I just liked the idea of a stdlib-less alternative upon which better-thought-out "core" libraries could be stacked, with less disk footprint.
Have to say I didn't expect it to come out of Pydantic.
Yes. That's why I compare it (a compiled Rust executable) to Monty (a compiled Rust executable). The point is that loading large compiled executables into memory takes long enough to raise an objection to the "startup times measured in single digit microseconds not hundreds of milliseconds" claim.
I'm of the mind that it will be better to construct more strict/structured languages for AI use than to reuse existing ones.
My reasoning is 1) AIs can comprehend specs easily, especially if simple, 2) it is only valuable to "meet developers where they are" if really needing the developers' history/experience which I'd argue LLMs don't need as much (or only need because lang is so flexible/loose), and 3) human languages were developed to provide extreme human subjectivity which is way too much wiggle-room/flexibility (and is why people have to keep writing projects like these to reduce it).
We should be writing languages that are super-strict by default (e.g. down to the literal ordering/alphabetizing of constructs, exact spacing expectations) and only having opt-in loose modes for humans and tooling to format. I admit I am toying w/ such a lang myself, but in general we can ask more of AI code generations than we can of ourselves.
I think the hard part about that is you first have to train the model on a BUTT TON of that new language, because that's the only way they "learn" anything. They already know a lot of Python, so telling them to write restricted and sandboxed Python ("you can only call _these_ functions") is a lot easier.
But I'd be interested to see what you come up with.
I think skills and other things have shown that a good bit of learning can be done on-demand, assuming good programming fundamentals and no surprise behavior. But agreed, having a large corpus at training time is important.
I have seen, given a solid lang spec to a never-before-seen lang, modern models can do a great job of writing code in it. I've done no research on ability to leverage large stdlib/ecosystem this way though.
> But I'd be interested to see what you come up with.
Maybe a dumb question, but couldn't you use seccomp to limit/deny the amount of syscalls the Python interpreter has access to? For example, if you don't want it messing with your host filesystem, you could just deny it from using any filesystem related system calls? What is the benefit of using a completely separate interpreter?
Yours is a valid approach. But you always gotta wonder if there’s some way around it. Starting with runtime that has ways of accessing every aspect of your system - there are a lot of ways an attacker might try to defeat the blocks you put in place. The point of starting with something super minimal is that the attack surface is tiny. Really hard to see how anything could break out.
Why do SWE build tools in the open that are openly hostile to their own trade? Like I can understand someone selfishly building tools for themselves, but by contributing to these efforts you're basically donating free software tools to companies that will only be used to shrink their own engineering teams by making llms more capable/efficient.
While I think all LLMs are shit, they probably eventually will not be shit, and it will because people like you contributed to their progress. Nothing good will come of it for you or your peers. The Billionaires who own everything will kick you out to the curb as soon as you train your replacement that doesn't sleep, eat or complain. Have some class solidarity.
How do you feel about software engineers who build open source libraries?
Open source has been responsible for enormous productivity boosts in our industry, because we don't all have to build duplicates of exactly the same thing time and time again.
But think of all of the jobs that were lost by people who would otherwise been employed building the 500th version of a CSS design system, or a template engine, or code to handle website logins!
What makes AI tools different? (And I actually do agree that they feel different, but I'm interested in hearing arguments stronger than "it feels different".)
Of course comparing open source and AI is like comparing apples and oranges, but the question makes a lot of sense. Just the first thing that comes to mind: open source is about transparency whereas LLMs are opaque by nature. This is a radical shift and challenge for engineering and has consequences way beyond it.
It's about the role of technologies in evolution, responsibility versus utilitarian take, etc. It should be developed and discussed seriously, but not in a buried sub-thread.
Because beforehand engineers could be reasonably confident that their work would simply accelerate a the growth of a growing pie; today, most expect that further development will be used, first and foremost, to replace labor. Most sectors do not grow indefinitely, so there's no reason to assume software has to.
To put it gently, yes it feels different: for people who haven't already saved a lifetime of SWE wages, this is the first credible threat to the sector in which they're employed since the dot com bubble. People need to work to eat.
Previously, open source software didn't contribute to automating away jobs, at least not at scale. Open Source libraries weren't potentially maintaining themselves (I know we aren't there yet, but that seems to be the goal).
You cannot compare any open source software, even as a whole, to the impact that LLMs have had on labor and are projected too. However, I might now argue it would have been better to not have so much open source, as its clearly being processed through these plagiarism laundering training regimes.
I don't really think LLMs, robotics and ML in general are going to increase GDP globally, they will instead just replace the inputs that were maintain the status quo (the workers). If they can't successfully replace human labor, it will at minimum greatly reduce its value, which is extremely dangerous.
Jobs grew greatly during the last 30 years of open source development but over the last 16 months we've had 350-400k SWE layoffs in the last 16 months in the USA. Many of these layoffs have been directly correlated to AI enhanced productivity. 25% of recent college graduates are unemployed. Jobs data is super unreliable at the moment, but we also will see large swaths of the lower skilled sectors, customer service for example, see huge layoffs in the coming 24 months.
Despite what C-Suites say about AI giving them more free time for their hobbies or whatever, they've yet to answer how people are going to afford those hobbies. Working as a barista lol? These same mouthpieces will say that llms are going to allow the same amount of engineers to get 10x more done, but they're not reflecting that in their business decisions. They are laying people off in swaths when equities are at all time highs, its abnormal.
I think its more likely the ruling classes will give us something to do by making us so poor that young men will beg to go fight wars. Put us to use on behalf of their conquest for more resources, that certainly did the trick in the 20s, 30s and 40s :/
I'm an optimist on this and I remain hopeful that AI will create more and better jobs, but I'm not at all certain about that. It's possible it will play out the way you describe, and that will suck.
I'm not ready to blame the 100,000s of software layoffs on AI though - I think the more likely explanation for those is over-hiring during Covid combined with the end of ZIRP.
I think there are two use cases of open source, one is for people who need a solution to grab and use. In this case, I think LLM Agents will pick up quickly and replace grab and use type of engineering.
The second use case is for HUMAN to learn from human. Your open source projects are excellent examples, same with Django and Python open source ecosystem.
I just hope humans will not stop learning. As long as you share your passion of learning, people will learn from you. It has nothing to do automation.
It liberates those who have massive resources to run gigantic models at whatever scale they want.
Corporations and billionaires will get Ti-Nspires we get Ti-83s.
I do not agree that inference will get more affordable in time to prevent harm. It will cause way more problems with the devaluation of labor before it starts to solve those problems, and in that period they will solidify their control over society.
We already see it in how ML is being used on a vast scale to build advanced surveillance infrastructure. Lets not build the advanced calculators for them for free in open source please, they'd like nothing better. I wrote a lot more in the comments above also.
Billionaires and corporations can hire teams of people to work for them full-time. You, likely, can hire one or two (or zero!). Not to make it personal.
Staying true to your username at least. While I hear you in principle, I don’t think shaming people into not building things is going to work out. Even if you could convince some people, you’ll never reach them all. Someone will build it. IMO energy is better spent figuring out how to best structure our society to handle the seemingly inevitable end state where superhuman AI is commonplace.
Sorry if I'm shaming. I suppose you're right, someone will probably build them. But in order to prevent bad outcomes for the average joe/worker we are can't just hand optimizations over to corporations for free in the form of open source. We know all too well how open source is exploited.
I don't know how to prevent people from stopping this without shaming them. I think more shaming might be required, as uncomfortable as that may be. It's a societal wide prisoner's dilemma (well if I don't build it, someone else will), except we this isn't a prisoners dilemma and we can coordinate, sort of.
It would be one thing if GPUs and Tokens were cheap and everyone could take these implementations and out compete the corporations, but that's not the game theoretical terms we're on here. They have the resources, and I promise they are not going to let the average joe be able afford to out compete them. They are the ones that are going to be able to get the most advantage from these tools.. Why give them the extra leverage. It will be used to displace you. The ruling class or those with the resources, have zero intention of letting the tide rise all boats. And if there are any in the ruling class that do have good intentions, they will be rooted out.
We see this evidence all across literature, history, and in their own actions. This year in Telluride Colorado the Ski Patrol Union went on strike over wages. The billionaire owner who lives in California, Chuck Horning, did not want to concede to the Ski Patrolers over a $66k spread out over 3 years, like 22k a year over the contract length. He shutdown the ski resort during the Christmas holidays, and brought the town to its knees. This is just one example, but there are many. It is ideological to these people, its about maintaining their control over the working class. We are at the beginning of a class struggle that Earth has never witnessed before, with way more lives at stake.
I do not think LLMs are going to lead to super intelligence btw, I do believe it will get decent enough to uproot many lives when its used as a weapon against the value of labor and to accelerate concentration of resources into the few(er). We are up against people like Chuck Horner, who'd rather destroy an entire town of workers over 22k a year than concede any power. They have zero interest in building a equitable society, or we wouldn't see this type of behavior. This will 100% get used to replace you, then what will they do with us? They aren't going to just let everyone chill, I promise you that.
I believe the devaluation (and surveillance )of labor because of LLMs, robotics (machine learning in general) is the most pressing issue of our time.
I get the draw to building cool tools with these things, but please don't do it in the open. Let someone else do it, and then we can call them out too. The slower these developments can happen the better.
This is very cool, but I'm having some trouble understanding the use cases.
Is this mostly just for codemode where the MCP calls instead go through a Monty function call? Is it to do some quick maths or pre/post-processing to answer queries? Or maybe to implement CaMeL?
It feels like the power of terminal agents is partly because they can access the network/filesystem, and so sandboxed containers are a natural extension?
I don't quite understand the purpose. Yes, it's clearly stated, but, what do you mean "a reasonable subset of Python code" while "cannot use the standard library"? 99.9% of Python I write for anything ever uses standard library and then some (requests?). What do you expect your LLM-agent to write without that? A pseudo-code sorting algorithm sketch? Why would you even want to run that?
They plan to use to for "Code Mode" which mean the LLM will use this to run Python code that it writes to run tools instead of having to load the tools up front into the LLM context window.
The idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see the intermediate value. Every step that depends on results from an earlier step also requires a new LLM turn, limiting parallelism and adding a lot of overhead.
With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.
I like your effort. Time savings and strict security are real and important. In modern orchestration flows, however, a subagent handles the extra processing of tool results, so the context of the main agent is not poluted.
It's pydantic, they're verifying types and syntax, those don't require the stdlib. Type hints, syntax checks, likely logical issues,etc.. static type checking is good with that, but LLMs can take to the next level where they analyze the intended data flow and find logical bugs, or good syntax and typing but not the intended syntax.
For example, incorrect levels of indentation. Let me use dots instead of space because of HN formatting:
for key,val in mydict.items():
..if key == "operation":
....logging.info("Executing operation %s",val)
..if val == "drop_table":
....self.drop_table()
This uses good syntax, and I the logging part is not in the stdlib, so I assume it would ignore it or replace it with dummy code? That shouldn't prevent it from analyzing that loop and determining that the second if-block was intended to be under the first, and the way it is written now, the key check isn't done.
In other words, if you don't want to do validate proper stdlib/module usage, but proper __Python__ usage, this makes sense. Although I'm speculating on exactly what they're trying to do.
EDIT: I think I my speculation was wrong, it looks like they might have developed this to write code for pydantic-ai: https://github.com/pydantic/pydantic-ai , i'll leave the comment above as-is though, since I think it would still be cool to have that capability in pydantic.
I wish someone commanded their agent to write a Python "compiler" targeting WASM. I'm quite surprised there is still no such thing at this day and age...
Is ai running regular python really a problem? I see that in principle there is an issue. But in practice I don't know anyone who's had security issues from this. Have you?
i think there’s a confusion around what use-case Monty is solving (i was confused as well). this seems to isolate in a scope of execution like function calls, not entire Python applications
Didn’t Anthropic recently acquire some JavaScript engine, though?
I figured that that was because they want tighter integration and a safer execution environment for code written by the LLM. And sandboxing is already very common for JavaScript in browsers.
It seems that AI finally give the space to true pure-blood system software systems to unleash their potential.
Pretty much all morn software tooling, removing the parts that aim at appeal to humans, becomes much more reliable tools. But it's not clear if the performance will be better or not.
I like the idea a lot but it's still unclear from the docs what the hard security boundary is once you start calling LLMs - can it avoid "breaking out" into the host env in practice?
best answer is probably to have a layered approach - use this to limit what the generated code can do, wrap it in a secure VM to prevent leaking out to other tenants.
Of course not, especially when the security model is about access to resources like file systems that are outside the scope of what the Rust compiler can verify. While you won't have a data race in safe Rust you absolutely can have data races accessing the file system in any language.
Their security model, as explained in the README, is in not including the standard library and limiting all access to the environment to functions you write & control. Does that make it secure? I'll leave it to you to evaluate that in the context of your use case/threat model.
It would appear to me that they used Rust primarily because a.) they want to deliver very fast startup times and b.) they want it to be accessible from a variety of host languages (like Python and JavaScript). Those are things Rust does well, though not to the exclusion of C or other GC-free compiled languages. They certainly do not claim that Rust is pixie dust you sprinkle on a project to make it secure. That would clearly be cargo culting.
I find this language war tiring. Don't you? Let's make 2026 the year we all agree to build cool stuff in whatever language we want without this pointless quarreling. (I've personally been saying this for three years at this point.)
Serious question: why won’t JUST use SELinux on generated scripts?
It will have access to the original runtimes and ecosystems and it can’t be tampered, it’s well tested, no amount of forks and tricky indirections to bypass syscalls.
Such runtimes come with a bill of technical debt, no support, specific documentation and lack of support for ecosystem and features. And let’s hope in two years isn’t abandoned.
Same could be applied for docker or nix Linux, or isolated containers, etc… the level of security should be good enough for LLMs, not even secure against human (specialist hackers) directed threads
I don't get what "the complexity of a sandbox" is. You don't have to use Docker. I've been running agents in bubblewrap sandboxes since they first came out.[0]
If the agent can only use the Python interpreter you choose then you could just sandbox regular Python, assuming you trust the agent. But I don't trust any of them because they've probably been vibe coded, so I'll continue to just sandbox the agent using bubblewrap.
It is absurd for any user to use a half baked Python interpreter, also one that will always majorly lag behind CPython in its support. I advise sandboxing CPython instead using OS features.
(Genuine question, I've been trying to find reliable, well documented, robust patterns for doing this for years! I need it across macOS and Linux and ideally Windows too. Preferably without having to run anything as root.)
One might have different profiles with different permissions. A network service usually wouldn't need your hone directory while a personal utility might not need networking.
Also, that concept could be mixed with subprocess-style sandboxing. The two processes, main and sandboxed, might have different policies. The sandboxed one can only talk to main process over a specific channel. Nothing else. People usually also meter their CPU, RAM, etc.
INTEGRITY RTOS had language-specific runtimes, esp Ada and Java, that ran directly on the microkernel. A POSIX app or Linux VM could run side by side with it. Then, some middleware for inter-process communication let them talk to each other.
Every time I use Docker as a sandbox people warn me to watch out for "container escapes".
I trust Firecracker more because it was built by AWS specifically to sandbox Lambdas, but it doesn't work on macOS and is pretty fiddly to run on Linux.
True, but while CPython does have a reputation for slow startup, completely re-implementing isn't the only way to work around it - e.g. with eryx [1] I've managed to pre-initialize and snapshots the Wasm and pre-compile it, to get real CPython starting in ~15ms, without compromising on language features. It's doable!
Not having parity is a property they want, similar to Starlark. They explicitly want a less capable language for sandboxing.
Think of it as a language for their use case with Python's syntax and not a Python implementation. I don't know if it's a good idea or not, I'm just an intrigued onlooker, but I think lifting a familiar syntax is a legitimate strategy for writing DSLs.
Compile times, I can live with. You can run previous models on the gpu while your new model is compiling. Or switch from cargo to bazel if it is that bad.
this is pretty performant for short scripts if you measure time "from code to rust" which can be as low as 1us.
Of course it's slow for complex numerical calculations, but that's the primary usecase.
I think the consensus is that LLMs are very good at writing python and ts/js, generally not quite as good at writing other languages, at least in one shot. So there's an advantage to using python/js/ts.
I think the next big breakthrough will be cost effective model specialization, maybe through modular models. The monolithic nature of today’s models is a major weakness.
But most real world code needs to use (standard/3rd party) library, no? Or is this for AI's own feedback loop?
It doesn't have class support yet!
But it doesn't matter, because LLMs that try to use a class will get an error message and rewrite their code to not use classes instead.
Notes on how I got the WASM build working here: https://simonwillison.net/2026/Feb/6/pydantic-monty/
This is true in a sense, but every little papercut at the lower levels of abstraction degrades performance at higher levels as the LLM needs to spend its efforts on hacking around jank in the Python interpreter instead of solving the real problem.
Reminds of evolutionary debate. Whats important is just because something can learn to adapt doesnt mean theyll find an optimized adaption, nor will they continually refine it.
As far as i can tell AI will only solve problems well where the problem space is properly defined. Most people wont know how to do that.
Is this mostly just for codemode where the MCP calls instead go through a Monty function call? Is it to do some quick maths or pre/post-processing to answer queries? Or maybe to implement CaMeL?
It feels like the power of terminal agents is partly because they can access the network/filesystem, and so sandboxed containers are a natural extension?
> Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code.
> Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.
My models are writing code all day in 3/4 different languages, why would I want to:
a) Restrict them to Python
b) Restrict them to a cutdown, less-useful version of Python?
My models write me Typescript and C# and Python all day with zero issues. Why do I need this?
Only if the training data has enough Python code that doesn't use classes.
(We're in luck that these things are trained on Stackoverflow code snippets.)
(This kind of extremely weak criticism often seems to come from newly created Hacker News accounts, which makes me wonder if it's mostly the same person using sockpuppets.)
Everyone was using git for reasons to me that seemed bandwagon-y, when Mercurial just had such a better UX and mental model to me.
Now, everyone is writing agent `exec`s in Python, when I think TypeScript/JS is far better suited for the job (it was always fast + secure, not to mention more reliable and information dense b/c of typing).
But I think I'm gonna lose this one too.
1. Large built-in standard library (CSV, sqlite3, xml/json, zipfile).
2. In Python, whatever the LLM is likely to do will probably work. In JS, you have the Node / Deno split, far too many libraries that do the same thing (XMLHTTPRequest / Axios / fetch), many mutually-incompatible import syntaxes (E.G. compare tsx versus Node's native ts execution), and features like top-level await (very important for small scripts, and something that an LLM is likely to use!), which only work if you pray three times on the day of the full moon.
3. Much better ecosystem for data processing (particularly csv/pandas), partially resulting from operator overloading being a thing.
You do? Deno is maybe a single digit percentage of the market, just hyped tremendously.
> E.G. compare tsx versus Node's native ts execution
JSX/TSX, despite what React people might want you to believe, are not part of the language.
> which only work if you pray three times on the day of the full moon.
It only doesn't work in some contexts due to legacy reasons. Otherwise it's just elaborate syntax sugar for `Promise`.
I think you misunderstood this. tsx in this context is/was a way to run typescript files locally without doing tsc yourself first, ie make them run like a script. You can just use Node now, but for a long time it couldn’t natively run typescript files.
The only limitation I run into using Node natively is you need to do import types as type imports, which I doubt would be an issue in practice for agents.
I wouldn't call it running TS natively - what they're doing is either using an external tool, or just stripping types, so several things, like most notably enums, don't work by default.
I mean, that's more than enough for my use cases and I'm happy that the feature exists, but I don't think we'll ever see a native TypeScript engine. Would have been cool, though, considering JS engines define their own internal types anyway.
Similarly: TypeScript, despite what Node people might want you to believe, is not part of the JavaScript language.
I've always used ts-node, so I forgot about tsx's existence, but still those are just tools used for convenience.
Nothing currently actually runs TypeScript natively and the blessed way was always to compile it to JS and run that.
In fact, the team has back pedaled into trying to make its own thing like in the early days.
Do you not realize how this sounds?
>many mutually-incompatible import syntaxes
Do you think there are 22 competing package managers in python because the package/import system "just works"?
There aren't; a large fraction of tools people mention in this context aren't actually package managers and don't try to be package managers. Sometimes people even conflate standards and config files with tools. It's really amazing how much FUD there is around it.
But more importantly, there is no such thing as "the package/import system". Packaging is one thing, and the language's import system is a completely different thing.
And none of that actually bears on the LLM's ability to choose libraries and figure out language syntax and APIs. For that matter, you don't have to let it set up the environment (or change your existing setup) if you don't want to.
Such changes take time, and I favor an "evolution trumps revolution"-approach for such features. The JS/TS ecosystem has the advantage here, as it has already been going through its roughest time since es2015. In hindsight, it was a very healthy choice and the type system with TS is something to be left desired in many programming languages.
If it weren't for its rich standard library and uv, I would still clearly favor TS and a runtime like bun or deno. Python still suffers from spread out global state and some multi-paradigm approach when it comes to concurrency (if concurrency has even been considered by the library author). Python being the first programming language for many scientists shows its toll too: rich libraries of dubious quality in various domains. Whereas JS' origins in browser scripting contributed to the convention to treat global state as something to be frowned upon.
I wish both systems would have good object schema validation build into the standard library. Python has the upper hand here with dataclasses, but it still follows some "take it or throw"-approach, rather than to support customization for validations.
IMHO It’s irrelevant it has a slightly better typesystem and runtime but that’s totally irrelevant nowadays.
With AI doing mostly everything we should forget these past riddles. Now we all should be looking towards fail-safe systems, formal verification and domain modeling.
> Now we all should be looking towards fail-safe systems, formal verification and domain modeling.
We were looking forward to these things since the term distributed computing has been coined, haven't we? Building fail-safe systems has always been the goal since long-running processes were a thing.
Despite any "past riddles", the more expressive the type system the better the domain modeling experience, and I'd guess formal methods would benefit immensely from a good type system. Is there any formal language that is usable as general-purpose programming language I don't know of? I only ever see formal methods used for the verification of distributed algorithms or permission logic, on the theorem proving side of things, but I have yet to see a single application written only in something like Lean[0] or LiquidHaskell[1]...
[0]: https://lean-lang.org/
[1]: https://ucsd-progsys.github.io/liquidhaskell/
I do like Typescript (not JS) better, because of its highly advanced type system, compared to Python's.
TS/JS is not inherently fast, it just has a good JIT compiler; Python still ships without one. Regarding security, each interpreter is about as permissive as the other, and both can be sealed off from environment pretty securely.
LLMs are really good at writing python for data processing. I would suspect its due to Python having a really good ecosystem around this niche
And the type safety/security issues can hopefully be mitigated by ty and pyodide (already used by cf’s python workers)
https://pyodide.org/en/stable/
https://github.com/astral-sh/ty
Monty’s overhead is so low that, assuming we get the security / capabilities tradeoff right (Samuel can comment on this more), you could always have it enabled on your agents with basically no downsides, which can’t be said for many other code execution sandboxes which are often over-kill for the code mode use case anyway.
For those not familiar with the concept, the idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see (all of) the intermediate value. Every step that depends on results from an earlier step requires a new LLM turn, limiting parallelism and adding a lot of overhead, expensive token usage, and context window bloat.
With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.
These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.
You guys and astral are my favorite groups in the python ecosystem
Thank you!!
Yes, I was also thinking.. y MCP den
But even my simple class project reveals this. You actually do want a simple tool wrapper layer (abstraction) over every API. It doesn't even need to be an API. It can be a calculator that doesn't reach out anywhere.
as the article puts it: "MCP makes tools uniform"
In hindsight, it's pretty funny and obvious
And on GPU side, the existing libraries provide DSL based JITs, thus for many scenarios the performance is not much different from C++.
Now NVidia is also on the game with the new tile based architecture, with first party support to write kernels in Python even.
Yep still using good old hg for personal repos - interop for outside project defaults to git since almost all the hg host withered.
There is a ton of wheel reinvention going on right now cause everyone wants to be cool in the age of ai
Use boring tech, you'll thank me and yourself later
Which in this case means, just use regular python. Your devops team is unlikely to allow knock off python in production. TS is fine too, I mainly write Go
Really tired of every AI-related tool released as of late being a half-GB node behemoth with hundreds of library dependencies.
Or alternatively some cryptic academic Rust codebase.
Why would one drag this god forsaken abomination on server-side is beyond me.
Even effing C# nowdays can be run in script-like manner from a single file.
—
Even the latest Codex UI app is Electron. The one that is supposed to write itself with AI wonders but couldn’t manage native swiftui, winui, and qt or whatever is on linux this days.
Typescript’s types are far more adaptable and malleable, even with the latest C# 15 which is belatedly adding Sum Types. If I set TypeScript to its most strict settings, I can even make it mimic a poor man’s Haskell and write existential types or monoids.
And JS/TS have by far the best libraries and utilities for JSON and xml parsing and string manipulation this side of Perl (the difference being that the TypeScript version is actually readable), and maybe Nushell but I’ve never used Nushell in production.
Recently I wrote a Linux CLI tool for managing podman/quadlett containers and I wrote it in TypeScript and it was a joy to use. The Effect library gave me proper Error types and immutable data types and the Bun Shell makes writing shell commands in TS nearly as easy as Bash. And I got it to compile a single self contained binary which I can run on any server and has lower memory footprint and faster startup time than any equivalent .NET code I’ve ever written.
And yes had I written it in rust it would have been faster and probably even safer but for a quick a dirty tool, development speed matters and I can tell you that I really appreciated not having to think about ownership and fighting the borrow checker the whole time.
TypeScript might not be perfect, but it is a surprisingly good language for many domains and is still undervalued IMO given what it provides.
I’m especially curious about where the Pydantic team wants to take Monty. The minimal-interpreter approach feels like a good starting point for AI workloads, but the long tail of Python semantics is brutal. There is a trade-off between keeping the surface area small (for security and predictability) and providing sufficient language capabilities to handle non-trivial snippets that LLMs generate to do complex tasks
I think in the near term we'll add support for classes, dataclasses, datetime, json. I think that should be enough for many use cases.
disclaimer: i work at E2B, opinions my own
But to be clear, we're not even targeting the same "computer use" use case I think e2b, daytona, cloudflare, modal, fly.io, deno, google, aws are going after - we're aiming to support programmatic tool calling with minimal latency and complexity - it's a fundamentally different offering.
Chill, e2b has its use case, at least for now.
Perhaps you're using v8 isolates, which then you're back into the "heavily restricted environment within the process" and you lose the things you'd want your AI to be able to do, and even then you still have to sandbox the hell out of it to be safe and you have to seriously consider side channel leaks.
And even after all of that you'd better hope you're staying up to date with patches.
MicroVMs are going to just be way simpler IMO. I don't really get the appeal of using V8 for this unless you have platform/ deployment limitations. Talking over Firecracker's vsock is extremely fast. Firecracker is also insanely safe - 3 CVEs ever, and IMO none are exploitable.
And Python VM had/has its sandboxing features too, previously rexec and still https://github.com/zopefoundation/RestrictedPython - in the same category I'd argue.
Then there's of course hypervisor based virtualization and the vulnerabilities and VM escapes there.
Browsers use belt-and-suspenders approaches of employing both language runtime VMs and hardware memory protection as layers to some effect, but still are the star act at pwn2own etc.
It's all layers of porous defenses. There'd definitely be room in the world for performant dynamic language implementations with provably secure foundations.
Also known as the "swiss cheese model" in risk management.
although you’d still need another boundary to run your app in to prevent breaking out to other tenants.
The security angle is probably the most compelling part. Running arbitrary AI-generated Python in a full CPython runtime is asking for trouble — the attack surface is enormous. Stripping it down to a minimal subset at least constrains what the generated code can do.
The bet here seems to be that AI-generated code can be nudged to use a restricted subset through error feedback loops, which honestly seems reasonable for most tool-use scenarios. You don't need metaclasses and dynamic imports to parse JSON or make API calls.
that said, the class restriction feels weird. classes aren't the security boundary. file access, network, imports - that's where the risk is. restricting classes just forces the model to write uglier code for no security gain. would be curious if the restrictions map to an actual threat model or if it's more of a "start minimal and add features" approach.
Perhaps if the interpreter is in turn embedded in the executable and runs in-process, but even a do-nothing `uv` invocation takes ~10ms on my system.
I like the idea of a minimal implementation like this, though. I hadn't even considered it from an AI sandboxing perspective; I just liked the idea of a stdlib-less alternative upon which better-thought-out "core" libraries could be stacked, with less disk footprint.
Have to say I didn't expect it to come out of Pydantic.
My reasoning is 1) AIs can comprehend specs easily, especially if simple, 2) it is only valuable to "meet developers where they are" if really needing the developers' history/experience which I'd argue LLMs don't need as much (or only need because lang is so flexible/loose), and 3) human languages were developed to provide extreme human subjectivity which is way too much wiggle-room/flexibility (and is why people have to keep writing projects like these to reduce it).
We should be writing languages that are super-strict by default (e.g. down to the literal ordering/alphabetizing of constructs, exact spacing expectations) and only having opt-in loose modes for humans and tooling to format. I admit I am toying w/ such a lang myself, but in general we can ask more of AI code generations than we can of ourselves.
But I'd be interested to see what you come up with.
I think skills and other things have shown that a good bit of learning can be done on-demand, assuming good programming fundamentals and no surprise behavior. But agreed, having a large corpus at training time is important.
I have seen, given a solid lang spec to a never-before-seen lang, modern models can do a great job of writing code in it. I've done no research on ability to leverage large stdlib/ecosystem this way though.
> But I'd be interested to see what you come up with.
Under active dev at https://github.com/cretz/duralade, super POC level atm (work continues in a branch)
Tokenization joke?
everything that you don’t want your agent to access should live outside of the sandbox.
Just beware of panics!
While I think all LLMs are shit, they probably eventually will not be shit, and it will because people like you contributed to their progress. Nothing good will come of it for you or your peers. The Billionaires who own everything will kick you out to the curb as soon as you train your replacement that doesn't sleep, eat or complain. Have some class solidarity.
Open source has been responsible for enormous productivity boosts in our industry, because we don't all have to build duplicates of exactly the same thing time and time again.
But think of all of the jobs that were lost by people who would otherwise been employed building the 500th version of a CSS design system, or a template engine, or code to handle website logins!
What makes AI tools different? (And I actually do agree that they feel different, but I'm interested in hearing arguments stronger than "it feels different".)
It's about the role of technologies in evolution, responsibility versus utilitarian take, etc. It should be developed and discussed seriously, but not in a buried sub-thread.
To put it gently, yes it feels different: for people who haven't already saved a lifetime of SWE wages, this is the first credible threat to the sector in which they're employed since the dot com bubble. People need to work to eat.
You cannot compare any open source software, even as a whole, to the impact that LLMs have had on labor and are projected too. However, I might now argue it would have been better to not have so much open source, as its clearly being processed through these plagiarism laundering training regimes.
I don't really think LLMs, robotics and ML in general are going to increase GDP globally, they will instead just replace the inputs that were maintain the status quo (the workers). If they can't successfully replace human labor, it will at minimum greatly reduce its value, which is extremely dangerous.
Jobs grew greatly during the last 30 years of open source development but over the last 16 months we've had 350-400k SWE layoffs in the last 16 months in the USA. Many of these layoffs have been directly correlated to AI enhanced productivity. 25% of recent college graduates are unemployed. Jobs data is super unreliable at the moment, but we also will see large swaths of the lower skilled sectors, customer service for example, see huge layoffs in the coming 24 months.
Despite what C-Suites say about AI giving them more free time for their hobbies or whatever, they've yet to answer how people are going to afford those hobbies. Working as a barista lol? These same mouthpieces will say that llms are going to allow the same amount of engineers to get 10x more done, but they're not reflecting that in their business decisions. They are laying people off in swaths when equities are at all time highs, its abnormal.
I think its more likely the ruling classes will give us something to do by making us so poor that young men will beg to go fight wars. Put us to use on behalf of their conquest for more resources, that certainly did the trick in the 20s, 30s and 40s :/
I'm an optimist on this and I remain hopeful that AI will create more and better jobs, but I'm not at all certain about that. It's possible it will play out the way you describe, and that will suck.
I'm not ready to blame the 100,000s of software layoffs on AI though - I think the more likely explanation for those is over-hiring during Covid combined with the end of ZIRP.
The second use case is for HUMAN to learn from human. Your open source projects are excellent examples, same with Django and Python open source ecosystem.
I just hope humans will not stop learning. As long as you share your passion of learning, people will learn from you. It has nothing to do automation.
The invention of the digital calculator turned human calculators into accountants, and that's great! We're contributing to the same process now
Corporations and billionaires will get Ti-Nspires we get Ti-83s.
I do not agree that inference will get more affordable in time to prevent harm. It will cause way more problems with the devaluation of labor before it starts to solve those problems, and in that period they will solidify their control over society.
We already see it in how ML is being used on a vast scale to build advanced surveillance infrastructure. Lets not build the advanced calculators for them for free in open source please, they'd like nothing better. I wrote a lot more in the comments above also.
If anyone has time, this is required reading imho: https://archive.nytimes.com/www.nytimes.com/books/97/05/18/r...
These inequalities already exist
I don't know how to prevent people from stopping this without shaming them. I think more shaming might be required, as uncomfortable as that may be. It's a societal wide prisoner's dilemma (well if I don't build it, someone else will), except we this isn't a prisoners dilemma and we can coordinate, sort of.
It would be one thing if GPUs and Tokens were cheap and everyone could take these implementations and out compete the corporations, but that's not the game theoretical terms we're on here. They have the resources, and I promise they are not going to let the average joe be able afford to out compete them. They are the ones that are going to be able to get the most advantage from these tools.. Why give them the extra leverage. It will be used to displace you. The ruling class or those with the resources, have zero intention of letting the tide rise all boats. And if there are any in the ruling class that do have good intentions, they will be rooted out.
We see this evidence all across literature, history, and in their own actions. This year in Telluride Colorado the Ski Patrol Union went on strike over wages. The billionaire owner who lives in California, Chuck Horning, did not want to concede to the Ski Patrolers over a $66k spread out over 3 years, like 22k a year over the contract length. He shutdown the ski resort during the Christmas holidays, and brought the town to its knees. This is just one example, but there are many. It is ideological to these people, its about maintaining their control over the working class. We are at the beginning of a class struggle that Earth has never witnessed before, with way more lives at stake.
I do not think LLMs are going to lead to super intelligence btw, I do believe it will get decent enough to uproot many lives when its used as a weapon against the value of labor and to accelerate concentration of resources into the few(er). We are up against people like Chuck Horner, who'd rather destroy an entire town of workers over 22k a year than concede any power. They have zero interest in building a equitable society, or we wouldn't see this type of behavior. This will 100% get used to replace you, then what will they do with us? They aren't going to just let everyone chill, I promise you that.
I believe the devaluation (and surveillance )of labor because of LLMs, robotics (machine learning in general) is the most pressing issue of our time.
I get the draw to building cool tools with these things, but please don't do it in the open. Let someone else do it, and then we can call them out too. The slower these developments can happen the better.
Is this mostly just for codemode where the MCP calls instead go through a Monty function call? Is it to do some quick maths or pre/post-processing to answer queries? Or maybe to implement CaMeL?
It feels like the power of terminal agents is partly because they can access the network/filesystem, and so sandboxed containers are a natural extension?
The idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see the intermediate value. Every step that depends on results from an earlier step also requires a new LLM turn, limiting parallelism and adding a lot of overhead.
With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.
These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.
For example, incorrect levels of indentation. Let me use dots instead of space because of HN formatting:
for key,val in mydict.items():
..if key == "operation":
....logging.info("Executing operation %s",val)
..if val == "drop_table":
....self.drop_table()
This uses good syntax, and I the logging part is not in the stdlib, so I assume it would ignore it or replace it with dummy code? That shouldn't prevent it from analyzing that loop and determining that the second if-block was intended to be under the first, and the way it is written now, the key check isn't done.
In other words, if you don't want to do validate proper stdlib/module usage, but proper __Python__ usage, this makes sense. Although I'm speculating on exactly what they're trying to do.
EDIT: I think I my speculation was wrong, it looks like they might have developed this to write code for pydantic-ai: https://github.com/pydantic/pydantic-ai , i'll leave the comment above as-is though, since I think it would still be cool to have that capability in pydantic.
Web demo: https://pyodide.org/en/stable/console.html
My current security model is to give it a separate Linux user.
So it can blow itself up and... I think that's about it?
You don't have to give it bash, depending on your tools at least.
> So it can blow itself up and... I think that's about it?
And exfiltrate data via the Internet, fill up disk space...
Any human or AI want to take the challenge?
https://play.rust-lang.org
https://github.com/rust-lang/rust-playground
Will explore this for https://toolkami.com/, which allows plug and play advanced “code mode” for AI agents.
And now for something, completely different.
Claude Code always resorts to running small python scripts to test ideas when it gets stuck.
Something like this would mean I dont need to approve every single experiment it performs.
I figured that that was because they want tighter integration and a safer execution environment for code written by the LLM. And sandboxing is already very common for JavaScript in browsers.
Pretty much all morn software tooling, removing the parts that aim at appeal to humans, becomes much more reliable tools. But it's not clear if the performance will be better or not.
Or is all Rust code secure unquestionably?
Their security model, as explained in the README, is in not including the standard library and limiting all access to the environment to functions you write & control. Does that make it secure? I'll leave it to you to evaluate that in the context of your use case/threat model.
It would appear to me that they used Rust primarily because a.) they want to deliver very fast startup times and b.) they want it to be accessible from a variety of host languages (like Python and JavaScript). Those are things Rust does well, though not to the exclusion of C or other GC-free compiled languages. They certainly do not claim that Rust is pixie dust you sprinkle on a project to make it secure. That would clearly be cargo culting.
I find this language war tiring. Don't you? Let's make 2026 the year we all agree to build cool stuff in whatever language we want without this pointless quarreling. (I've personally been saying this for three years at this point.)
It will have access to the original runtimes and ecosystems and it can’t be tampered, it’s well tested, no amount of forks and tricky indirections to bypass syscalls.
Such runtimes come with a bill of technical debt, no support, specific documentation and lack of support for ecosystem and features. And let’s hope in two years isn’t abandoned.
Same could be applied for docker or nix Linux, or isolated containers, etc… the level of security should be good enough for LLMs, not even secure against human (specialist hackers) directed threads
I also want my models to be able to write typescript, python, c# etc, or any language and run it.
Having the model have access to a completely minimal version of python just seems like a waste of time.
If the agent can only use the Python interpreter you choose then you could just sandbox regular Python, assuming you trust the agent. But I don't trust any of them because they've probably been vibe coded, so I'll continue to just sandbox the agent using bubblewrap.
[0] https://blog.gpkb.org/posts/ai-agent-sandbox/
https://en.wikipedia.org/wiki/List_of_Python_software#Python...
(Genuine question, I've been trying to find reliable, well documented, robust patterns for doing this for years! I need it across macOS and Linux and ideally Windows too. Preferably without having to run anything as root.)
https://danwalsh.livejournal.com/28545.html
One might have different profiles with different permissions. A network service usually wouldn't need your hone directory while a personal utility might not need networking.
Also, that concept could be mixed with subprocess-style sandboxing. The two processes, main and sandboxed, might have different policies. The sandboxed one can only talk to main process over a specific channel. Nothing else. People usually also meter their CPU, RAM, etc.
INTEGRITY RTOS had language-specific runtimes, esp Ada and Java, that ran directly on the microkernel. A POSIX app or Linux VM could run side by side with it. Then, some middleware for inter-process communication let them talk to each other.
https://github.com/microsoft/litebox might somehow allow it too if a tool can be built on top of it, but there is no documentation.
I trust Firecracker more because it was built by AWS specifically to sandbox Lambdas, but it doesn't work on macOS and is pretty fiddly to run on Linux.
It's not industrial-grade safety for public use, but it'll do for personal use. Other tools for it are also mentioned.
[1] https://github.com/eryx-org/eryx
Think of it as a language for their use case with Python's syntax and not a Python implementation. I don't know if it's a good idea or not, I'm just an intrigued onlooker, but I think lifting a familiar syntax is a legitimate strategy for writing DSLs.
Of course it's slow for complex numerical calculations, but that's the primary usecase.
I think the consensus is that LLMs are very good at writing python and ts/js, generally not quite as good at writing other languages, at least in one shot. So there's an advantage to using python/js/ts.