zlacker

Monty: A minimal, secure Python interpreter written in Rust for use by AI

submitted by dmpetr+(OP) on 2026-02-06 21:16:36 | 311 points 154 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
4. simonw+fj[view] [source] 2026-02-06 23:13:00
>>dmpetr+(OP)
I got a WebAssembly build of this working and fired up a web playground for trying it out: https://simonw.github.io/research/monty-wasm-pyodide/demo.ht...

It doesn't have class support yet!

But it doesn't matter, because LLMs that try to use a class will get an error message and rewrite their code to not use classes instead.

Notes on how I got the WASM build working here: https://simonwillison.net/2026/Feb/6/pydantic-monty/

◧◩◪
13. sd2k+Hp[view] [source] [discussion] 2026-02-07 00:00:54
>>avaer+en
True, but while CPython does have a reputation for slow startup, completely re-implementing isn't the only way to work around it - e.g. with eryx [1] I've managed to pre-initialize and snapshots the Wasm and pre-compile it, to get real CPython starting in ~15ms, without compromising on language features. It's doable!

[1] https://github.com/eryx-org/eryx

◧◩
15. shoeb0+Fr[view] [source] [discussion] 2026-02-07 00:20:25
>>avaer+Mm
A big benefit of letting agents run code is they can process data without bloating their context.

LLMs are really good at writing python for data processing. I would suspect its due to Python having a really good ecosystem around this niche

And the type safety/security issues can hopefully be mitigated by ty and pyodide (already used by cf’s python workers)

https://pyodide.org/en/stable/

https://github.com/astral-sh/ty

◧◩
23. bityar+4u[view] [source] [discussion] 2026-02-07 00:43:26
>>OutOfH+0n
Python already has a lot of half-baked (all the way up to nearly-fully-baked) interpreters, what's one more?

https://en.wikipedia.org/wiki/List_of_Python_software#Python...

◧◩
27. notepa+7v[view] [source] [discussion] 2026-02-07 00:53:33
>>krick+0u
It's pydantic, they're verifying types and syntax, those don't require the stdlib. Type hints, syntax checks, likely logical issues,etc.. static type checking is good with that, but LLMs can take to the next level where they analyze the intended data flow and find logical bugs, or good syntax and typing but not the intended syntax.

For example, incorrect levels of indentation. Let me use dots instead of space because of HN formatting:

for key,val in mydict.items():

..if key == "operation":

....logging.info("Executing operation %s",val)

..if val == "drop_table":

....self.drop_table()

This uses good syntax, and I the logging part is not in the stdlib, so I assume it would ignore it or replace it with dummy code? That shouldn't prevent it from analyzing that loop and determining that the second if-block was intended to be under the first, and the way it is written now, the key check isn't done.

In other words, if you don't want to do validate proper stdlib/module usage, but proper __Python__ usage, this makes sense. Although I'm speculating on exactly what they're trying to do.

EDIT: I think I my speculation was wrong, it looks like they might have developed this to write code for pydantic-ai: https://github.com/pydantic/pydantic-ai , i'll leave the comment above as-is though, since I think it would still be cool to have that capability in pydantic.

28. imfing+Dv[view] [source] 2026-02-07 01:00:44
>>dmpetr+(OP)
This is a really interesting take on the sandboxing problem. This reminds me of an experiment I worked on a while back (https://github.com/imfing/jsrun), which embedded V8 into Python to allow running JavaScript with tightly controlled access to the host environment. Similar in goal to run untrusted code in Python.

I’m especially curious about where the Pydantic team wants to take Monty. The minimal-interpreter approach feels like a good starting point for AI workloads, but the long tail of Python semantics is brutal. There is a trade-off between keeping the surface area small (for security and predictability) and providing sufficient language capabilities to handle non-trivial snippets that LLMs generate to do complex tasks

◧◩
30. JoshPu+Mv[view] [source] [discussion] 2026-02-07 01:02:02
>>JoshPu+Jv
rlm-rs: https://crates.io/crates/rlm-rs src: https://github.com/synth-laboratories/Horizons
◧◩◪
32. DouweM+Ax[view] [source] [discussion] 2026-02-07 01:19:44
>>impuls+Gu
(Pydantic AI lead here) We’re implementing Code Mode in https://github.com/pydantic/pydantic-ai/pull/4153 with support for Monty and abstractions to use other runtimes / sandboxes.

The idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see the intermediate value. Every step that depends on results from an earlier step also requires a new LLM turn, limiting parallelism and adding a lot of overhead.

With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.

These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.

◧◩◪
35. DouweM+qy[view] [source] [discussion] 2026-02-07 01:31:03
>>shoeb0+Fr
(Pydantic AI lead here) That’s exactly what we built this for: we’re implementing Code Mode in https://github.com/pydantic/pydantic-ai/pull/4153 which will use Monty by default, with abstractions to use other runtimes / sandboxes.

Monty’s overhead is so low that, assuming we get the security / capabilities tradeoff right (Samuel can comment on this more), you could always have it enabled on your agents with basically no downsides, which can’t be said for many other code execution sandboxes which are often over-kill for the code mode use case anyway.

For those not familiar with the concept, the idea is that in “traditional” LLM tool calling, the entire (MCP) tool result is sent back to the LLM, even if it just needs a few fields, or is going to pass the return value into another tool without needing to see (all of) the intermediate value. Every step that depends on results from an earlier step requires a new LLM turn, limiting parallelism and adding a lot of overhead, expensive token usage, and context window bloat.

With code mode, the LLM can chain tool calls, pull out specific fields, and run entire algorithms using tools with only the necessary parts of the result (or errors) going back to the LLM.

These posts by Cloudflare: https://blog.cloudflare.com/code-mode/ and Anthropic: https://platform.claude.com/docs/en/agents-and-tools/tool-us... explain the concept and its advantages in more detail.

44. SafeDu+hA[view] [source] 2026-02-07 01:46:53
>>dmpetr+(OP)
Sandboxing is going to be of growing interests as more agents go “code mode”.

Will explore this for https://toolkami.com/, which allows plug and play advanced “code mode” for AI agents.

◧◩◪
51. OutOfH+HG[view] [source] [discussion] 2026-02-07 03:00:58
>>simonw+1z
Docker and other container runners allow it. https://containers.dev/ allows it too.

https://github.com/microsoft/litebox might somehow allow it too if a tool can be built on top of it, but there is no documentation.

◧◩◪
58. nickps+fM[view] [source] [discussion] 2026-02-07 04:14:10
>>simonw+1z
It could be difficult. My first thought would be a SELinux policy like this article attempted:

https://danwalsh.livejournal.com/28545.html

One might have different profiles with different permissions. A network service usually wouldn't need your hone directory while a personal utility might not need networking.

Also, that concept could be mixed with subprocess-style sandboxing. The two processes, main and sandboxed, might have different policies. The sandboxed one can only talk to main process over a specific channel. Nothing else. People usually also meter their CPU, RAM, etc.

INTEGRITY RTOS had language-specific runtimes, esp Ada and Java, that ran directly on the microkernel. A POSIX app or Linux VM could run side by side with it. Then, some middleware for inter-process communication let them talk to each other.

◧◩◪◨⬒⬓
66. IhateA+NP[view] [source] [discussion] 2026-02-07 05:02:42
>>JoshPu+XM
It liberates those who have massive resources to run gigantic models at whatever scale they want.

Corporations and billionaires will get Ti-Nspires we get Ti-83s.

I do not agree that inference will get more affordable in time to prevent harm. It will cause way more problems with the devaluation of labor before it starts to solve those problems, and in that period they will solidify their control over society.

We already see it in how ML is being used on a vast scale to build advanced surveillance infrastructure. Lets not build the advanced calculators for them for free in open source please, they'd like nothing better. I wrote a lot more in the comments above also.

If anyone has time, this is required reading imho: https://archive.nytimes.com/www.nytimes.com/books/97/05/18/r...

◧◩
67. thunde+SP[view] [source] [discussion] 2026-02-07 05:03:52
>>c2xlZX+1u
https://github.com/butter-dot-dev/bvisor is pushing in that direction
◧◩◪◨
71. fulafe+6S[view] [source] [discussion] 2026-02-07 05:38:07
>>scolvi+Xy
There's been a constant stream of v8 VM sandbox escape discoveries since its dawn of course. Considering those have mostly existed for a long time before publication it's very porous most of the time.

And Python VM had/has its sandboxing features too, previously rexec and still https://github.com/zopefoundation/RestrictedPython - in the same category I'd argue.

Then there's of course hypervisor based virtualization and the vulnerabilities and VM escapes there.

Browsers use belt-and-suspenders approaches of employing both language runtime VMs and hardware memory protection as layers to some effect, but still are the star act at pwn2own etc.

It's all layers of porous defenses. There'd definitely be room in the world for performant dynamic language implementations with provably secure foundations.

◧◩◪
72. kodabl+lS[view] [source] [discussion] 2026-02-07 05:42:13
>>bityar+yt
> that's the only way they "learn" anything

I think skills and other things have shown that a good bit of learning can be done on-demand, assuming good programming fundamentals and no surprise behavior. But agreed, having a large corpus at training time is important.

I have seen, given a solid lang spec to a never-before-seen lang, modern models can do a great job of writing code in it. I've done no research on ability to leverage large stdlib/ecosystem this way though.

> But I'd be interested to see what you come up with.

Under active dev at https://github.com/cretz/duralade, super POC level atm (work continues in a branch)

80. globul+zY[view] [source] 2026-02-07 07:29:42
>>dmpetr+(OP)
I don't get what "the complexity of a sandbox" is. You don't have to use Docker. I've been running agents in bubblewrap sandboxes since they first came out.[0]

If the agent can only use the Python interpreter you choose then you could just sandbox regular Python, assuming you trust the agent. But I don't trust any of them because they've probably been vibe coded, so I'll continue to just sandbox the agent using bubblewrap.

[0] https://blog.gpkb.org/posts/ai-agent-sandbox/

◧◩
105. johndo+td1[view] [source] [discussion] 2026-02-07 11:15:36
>>theano+M31
Not sure if this is what you are looking for, but here is Python compiled to WASM: https://pyodide.org/en/stable/

Web demo: https://pyodide.org/en/stable/console.html

◧◩◪◨⬒
106. whilen+Gd1[view] [source] [discussion] 2026-02-07 11:17:30
>>nudpie+381
Conflating types in binary operations hasn't been an issue for me since I started using TS in 2016. Even before that, it was just the result of domain modeling done badly, and I think software engineers got burned enough for using dynamic type systems at scale... but that's a discussion to be had 10 years ago. We all moved on from that, or at least I hope we did.

> Now we all should be looking towards fail-safe systems, formal verification and domain modeling.

We were looking forward to these things since the term distributed computing has been coined, haven't we? Building fail-safe systems has always been the goal since long-running processes were a thing.

Despite any "past riddles", the more expressive the type system the better the domain modeling experience, and I'd guess formal methods would benefit immensely from a good type system. Is there any formal language that is usable as general-purpose programming language I don't know of? I only ever see formal methods used for the verification of distributed algorithms or permission logic, on the theorem proving side of things, but I have yet to see a single application written only in something like Lean[0] or LiquidHaskell[1]...

[0]: https://lean-lang.org/

[1]: https://ucsd-progsys.github.io/liquidhaskell/

◧◩
118. ontouc+pk1[view] [source] [discussion] 2026-02-07 12:36:01
>>ontouc+fk1
We already have a starting point:

https://play.rust-lang.org

https://github.com/rust-lang/rust-playground

152. hypert+Cm2[view] [source] 2026-02-07 20:07:52
>>dmpetr+(OP)
Potentially unrelated tangent thought:

The Man Who Listens to Horses (1997) is an excellent book by Monty Roberts about learning the language of horses and observing and listening to animals: https://www.biblio.com/search.php?stage=1&title=The+Man+Who+...

Video demonstration of the above: https://www.youtube.com/watch?v=vYtTz9GtAT4

◧◩◪◨⬒
153. its-su+5X2[view] [source] [discussion] 2026-02-07 23:47:22
>>simonw+EL
Outside of VM usage, the answer seems to be (on top of containerization and selinux) writing a tight seccomp filter.

Gleaned from https://github.com/containers/bubblewrap/blob/0c408e156b12dd... and https://github.com/containers/bubblewrap/tree/0c408e156b12dd...

[go to top]