Salami: Designing a runtime for AI-generated code
NB: Salami is the new name for the project (changed from “Slalom” as it was taken). Salami: “Safe LLM-driven Interpreter”.
Here’s a vision. We’re bound to have chatbots in every little app, and the users might want to ask to do something easy like this:
Send the wedding invitations, wait a week for the responses. Use GPT to read them: allow extensions, and make sure everyone specifies both the +1 and the meal responses. Contact me immediately if I have anything to worry about.
If I get a message from the landlord, forward it to my lawyer. Wait a week. If the lawyer says nothing, notify me. Otherwise, thank him and contact me directly for further instructions.
Complex triggers, awaiting indefinitely and far into the future, retained state, continuations from the user… Can LLM write code that does this kind of stuff? Can you? Will it use setTimeout(..., 7*24*60*60*1000)
? (GPT-4 did use time.sleep(604800)
for those examples). Will it have eval(code_for_next_steps)
?1 Would you run any of that? How?
To give this power to our users, we want to be able to run this kind of code cheaply, easily, safely, and reliably at scale, and using a solution that works with the architecture of the app, and doesn’t impose its own. This problem is greatly simplified by the fact that none of those little scripts need to be performant, and they don't demand state synchronization at a large scale.
The true power of LLMs for this kind of task is not that they can write Python code that looks like it’s written by a human. It’s that they wouldn’t be against writing much more boring, explicit code in a much more boring, syntactically poor, foolproof language. And given enough guardrails, that’s the kind of code we might consider running blindly2.
I am developing a custom runtime, supporting the kind of features the users would reasonably request, the kind of language that LLMs can reasonably write, the kind of architecture that engineers can easily integrate, and the kind of foolproof APIs that won’t be dangerous to expose. Below is the feature set I consider unusual, yet essential.
I’m working on Salami: a runtime, and a little language, for safely and prudently running AI-generated code. I am trying to approach it thoughtfully, and I started this blog to collect my thoughts and get feedback. Please subscribe3, if you’re interested in the topic: today I’m talking about the runtime, and I’ll go deeper into the specifics of its internal language next week. I’m also planning to release the first batch of code on GitHub in a couple weeks, under the MIT license: do mark it with your star.
Foolproof APIs
At its core, Salami would be a simple VM, deterministic and Turing-complete. But some things need to be supported on the runtime level, to enable the API integration designers to give something safe and useful to the users and to let the LLM write simplistic code without worrying about careful prompt-engineering-level safety checks. So let’s talk for a second about what we want from the API, and then we go more technical about the kinds of things we can do on the runtime level to enable this.
How do we make an API to be called from the LLM-written code that doesn’t lead to many regrets? I spent last year as an LLM-integration consultant (as a self-punishment for curiosity), to learn how people even go about it. My biggest gripe is that this essentially probabilistic industry is dominated by happy-path thinking. If one sends something to GPT4, and it kinda works from the third try, sure it might feel “magical”, but it’s still the opposite of a feature. The real work in LLM integration is the work of covering the sad paths, the ugly paths, the boring paths.
Running LLM-generated little workflows, there’s a trust issue with that. The notion of trust in this case is a bit different from the security experts’ idea of it. Obviously, as an app developer, all the code that comes from your users is untrusted in the security sense, whether it’s written by an LLM or not. But even for the users it’s not trusted – because LLMs might be stupid. Because users might be confused about what they want. They don't have enough experience to avoid all the problems, nobody does. It can’t be their responsibility not to ask the LLM to produce too many paperclips, because they will.
How do we design an API to prevent the LLMs from shooting the users in the foot? There’s no absolute solution. But we have some tools that can help. Ideally, all the exposed actions should be undoable. If it’s impossible, we can ask the user for permission for every particular action. We can ask them for a sanity check about a plan. We can slow things down when a lot of them happen. We can try and find ways for all of this to not be too annoying. We should do all of this without the users asking for it in the first place.
We should design something that makes it much easier to ask for permission than forgiveness. When the user asks to delete a file, put it in a “Bin”. When she asks to clear the “Bin”, put its contents into an even deeper “Bin”. And when the user says “send an email to my lawyer”, and LLM says “send_email”, what should actually happen is: create a draft of the letter, show it to the user, let her edit it, require that it is her who hits “Send”. This can’t be solved using a safety-oriented prompt, as you can never trust an LLM to always add the right checks. No, API should sound bold and clear to the LLM, and be actually careful and permission-seeking in its implementation.
The runtime must have everything to help the developers do the right thing (instead of simply dumping an OpenAPI spec into the “function calling” interface). That’s why the following features are important. They help give the user much more control over what happens, without complicating the code behind it. Giving the user more control is something that always makes writing business logic much more convoluted and hard; runtime has the power to make it much easier.
So, here are some of the features I think are essential, and a bit unusual for a VM:
One can pause, serialize, and restore any running (or waiting) process
Green threads… with backtracking
Foreign functions can return a syntactic continuation
Resource management by slowing down
Storable runtime
First unusual feature for the Salami VM is that you can pause it (or let it pause itself waiting for something), serialize it into a binary blob to store in some kind of a database, and then restore and continue running it when it's time to do so. The reason we need it is that it'd be very hard to integrate and run this kind of code safely if you have to keep the actual processes around.
The performance problem of Salami is not about making running one thread fast. Make your API fast, the Salami code should be high-level and can take its time. No, the performance problem is about running, say, a million of those threads4. Threads written by LLMs under unclear commands from users, that don’t care, or know anything, about your architecture or whatever you consider good async practices. Threads that mostly wait for events, or do something small. Making them independently storable (in any database you like), and easy to move from one machine to another — this is the key to the scalability approach that can adapt to whatever your app’s architecture is based on. You can run it on your server, on edge, or as a WASM module in the user’s browser.
This ability is an innocent-looking, but actually quite a serious complication. It prevents a lot of easy ways out of the burden of making a VM, such as a meta-circular interpretation, or using most of the existing VMs. It puts some burden on the API interface designers. E.g. it's still unclear if it's worth to differentiate between “fast async” calls and the calls that support pausing and shelving in progress (and which thus would often require some kind of state management).
But such a feature also makes a lot of patterns easy, even ones that don't require killing a process. Say you're writing a script for a video game: a character needs to go to the castle, then visit the bathroom, get out, and look at the stars. There might be a couple of conditions along the way. How do you code this up? Yield every destination from a coroutine? Typical coroutines in scripting languages are not storable and often require advanced cooperation from the runtime. One ends up reaching for an explicit state machine. Salami is designed from the ground up to enable this kind of implicit-state-machine coding without the typical limitations.
Green threads, implicit process state
Oh, coroutines. I am getting ahead of myself.
The green threads requirement was born directly from my experiments trying to get LLMs to write event-driven code for the workflows described above5. I was trying to be prudent and not implement an async runtime, so I wanted to see how well would LLMs manage callback hell and state synchronization. Well, I’m not impressed. They can do it, but they don’t want to. If something should happen after something else, they just want to call the functions, one after another.
Green threads can be thought of as a great way to implicitly define the stack machine behind an asynchronous process. But the state machines you are defining that way are kinda limited, they are basically linear, or looping. So I am experimenting with some kind of a “backtracking” technique for coroutines, where you can “undo” a little bit. Given that “undoability” is also a great thing to demand from the foreign APIs (see above), and we can likely be able to statically encode and deduce the undoability of some parts of code, this might enable something interesting. But it’s still more of an experiment, so no promises here.
Green threads are not a performance feature. Much more important here, in my opinion, might be determinism. So the threads have priorities, and they will actually wait for other threads before emitting their own effects. This is also an area of ongoing research and experimentation.
Any of the threads can yield itself to wait for an event. The event-matching values are defined by the foreign API so that it’s easy for the external executor to see what kinds of events are being awaited, and perform the needed effects accordingly. If ten threads are awaiting a value to be supplied from the database, you might be able to pack them into one SQL call. The events being awaited are also stored and serialized separately from the thread itself, so it’s easier to query your database of threads only for the ones that might need waking up.
When I talk about Salami to people, they sometimes mention BEAM (Erlang VM), its actor model, massive parallelism, and let-it-crash philosophy. BEAM is a great inspiration, but I think that none of the three is useful in this context. If there’s an error in a process like the one we want to support, it shouldn’t simply crash: it should notify the user, ask for a way out. The massive parallelism that Salami should enable is very different from Erlang’s one, its processes don’t need to stay in memory, their sleep is cold sleep in storage. As for the actor model, none of my experiments with the potential use cases made this model seem very essential, or nicer to use (we’ll talk about the language approach for the state synchronization in the next post)6.
Syntactic continuations
Sometimes you’re giving some instructions, but you are not in the mood to cover all the cases. We’ll get there when we get there, and we’ll see. In Rust, you put a todo!()
in that place, and the process will happily panic. But can we actually make these deferred instructions work, without a crash and restart?
Every step into a foreign function in Salami can currently return one of those things: a return value, a reference to a closure to call, an event to wait for (see above), or a “syntactic continuation”. Basically, it’s a piece of code that will execute as if it were written after that function call. This piece of code has access to all variables binding from all the parent environments. It's just a fancy name for "eval", I suppose; except this code will go through the compile-time type-checks, that I don’t think are typical: do strongly-typed languages ever have eval?
Syntactic continuations enable one to make it up as he goes along. One can go and ask users for directions, or even go directly to an LLM to propose a good next step. When doing so, the users will see the exact context and values that their continuation will be executed in, and that should make it much easier for them to make good decisions.
Fair resource management
Explicit and fair resource management is not only about making sure no user will hog all the computing space-time to themselves. It will also help to make it likely that if any shooting-in-the-foot happens, it might just be slow enough to notice and prevent much damage.
The language facilities I described already make it quite easy to implement all kinds of limits by, first of all, slowing down the code. By saying that e.g. one can’t remove more than one email per minute, we don’t introduce much inconvenience7, as our threads can easily run without any supervision. And then we can also prevent a catastrophe: if, in ten minutes, the user finds that the emails should’ve been kept, they only lost ten of them.
That’s what I have on my mind, and to some extent, in code as well. What do you think? Would you want to try it? Is it at least interesting? Misguided? Is it an affront to all the good VM writing? Will this never ever work? Am I missing something obvious? Is there already a language or a runtime that does all of this? Am I misusing an important concept? Should I read a nice little paper? Is there a neat bimonoidal *-autonomous category that is perfect for what I’m building? Do tell me!
Next week, I’ll be talking about the internal language: will it have braces or tabs? Should it maybe be visual? How strong are its types, how shared is its state, how delimited continuations? Subscribe to find out.
Will it be a callback hell with a custom trigger API for every case you can foresee? Will it rely on a random cron-style SaaS that would be gone in 2025? Will it be a truly massive YAML? Will it spend dollars/hour on GPT API to ask it directly for the next steps every minute?
If we don't, someone will end up blindly running LLM-generated JS. Avoiding that is the ethical imperative behind my work.
It’s not my ambition, it’s your ambition!
I use quantized deepseek-coder, quantized StableCode, and calls to the cloud to gpt3.5-turbo-instruct and codellama. Salami is aimed at the commodity LLMs.
Anthropomorphizing two different things is bound to make them interesting to equalize, so “LLM Actors” seems to be a thing people like to say, but I’m personally not sold on this stuff.
Except, like, maybe in a shred-all-evidence situation…