Companies building long-running LLM agents hit a wall: context degrades quality, and debugging 200k token runs over tons of iterations is brutal. The fix turns out to be old-school, create well-documented code, store it, and re-execute it. The only difference now is that the LLMs do this themselves, not developers. Think of it as a private Stack Overflow for your LLM agents: they store scripts from prior runs, search them semantically, and vote on what worked.

A surreal landscape of clouds and monolithic structures with a river flowing through

The better way to execute long complex work was in front of us the whole time: LLMs have gotten good enough at code generation that we can generate and execute code live in production. The missing piece is caching and reusing it. The production agent needs lightweight, high-level, task-specific scripts, not the full prod codebase.

Many bookfacers seemed curious when I previously explored running code in production agents in a post last week. After talking to these curious founders, I realized this could be a specific product!

One of the challenges of executing code in production is that it is slow and not completely consistent to generate >200 lines of code that works with a single LLM call. Many companies have the model construct “TODOs” or have “playbooks” for instructions on how to run things that take a long time. Combining these ideas, what if the LLM agent, when it spawned in, had a bunch of scripts that it could execute based on prior runs?

Desert landscape with code overlay and geometric blocks representing cached code modules

Context management through a coding agent: a subtle but very important improvement that coding agents had in the last year is that their ability to ls in a folder and understand what they actually need to read dramatically improved. When an agent spawns in with 8 well named scripts in the same folder, it will know which one to read and potentially execute for a given user task.

Why Anthropic “skills” is insufficient: Anthropic has been pushing a similar idea with “skills”, and skills is almost all the way there in terms of dynamically loading executable code, but the issue is that skills are developer maintained, NOT AI agent maintained. If you are a company that has 50 clients, and each client has just 20 common tool call patterns, suddenly you're building and maintaining 1000 skills? Essentially impossible.

A person coding on a laptop at a beach with code flowing into the ocean waves

What I made in the last 3 weeks: At raysurfer.com, I am providing infrastructure for AI agent maintained skills, aka semantic code caching. LLM agents can look for code they previously executed, and upload, comment or vote thumbs or thumbs down on code they did execute. If you have a long-running LLM agent in prod that is hitting consistency issues, DM me, happy to personally look at what you're doing and help.