Code as Agent Harness

or, How Chatbots Became Agents

Tim Dingman

Jun 03, 2026

Originally presented as a live talk on June 3, 2026

Paper · Source papers

Background

To start off, I want to take us back. All the way back, to before the dawn of GenAI.

When I first started at Scale back in early 2021, the first project I was ever on used ML. Specifically, we used classifiers - models that take whatever input and spit out a class, like taking in a drawing of furniture and giving a confidence or probability for each class. For my project, we took as input a description of a grocery item, and the model predicted what category that item was: Canned Tomatoes, Frozen Fish, Packaged Cookies, Thank You Cards, and so on.

When you train a classifier, you give it the whole set of classes it could predict in advance. Like for our grocery classifier, we had to say in advance what all the classes were. You couldn’t just pass in a description and say “What item is this?” the way you can now with LLMs. That’s what generative AI means, it can generate new stuff on its own.

So you have this vocabulary of classes, and you make training data by saying, this description is Precut Melon, this description is Wet Cat Food, whatever.

The architecture, the way those classifiers worked, is broadly the same as the architecture of most LLMs now: a transformer, using attention and feed-forward layers. If you don’t know what those are, no problem - just being concrete for those who do.

Now let’s extrapolate. Let’s say that instead of these grocery items as classes, I had every word in the English dictionary as my vocabulary. And instead of grocery item descriptions, I had just any old text. Then my “classifier” could “classify” each input by what word came next.

Now you have a generative model. It’s still just text in, text out, but now the output text can be anything.

That oversimplifies somewhat, about how generative models are typically decoders rather than encoder-decoders and about how it’s tokens rather than words the model is predicting. But it really is the same core idea, that if you can just predict one thing at a time really well, you can eventually just keep predicting basically forever, on anything.

Your training data is just whatever text you can find, above some minimum quality bar anyway. You can take a bit of text, train your model to predict the next token, see if it was right, give it feedback, over and over and over again for every next token you can find. That is how you train - or more specifically, “pretrain” - an LLM.

After enough pretraining, like tens of trillions of tokens worth, you now have a base model, i.e. a model that can predict the next token for some input text. It can’t chat, can’t engage in dialog yet, but it can autocomplete like nobody’s business. And as with our classifier, each next token prediction comes with a probability.

This is what the first GPT did, back when it came out in 2018.

Now since then LLMs have gotten a lot better of course. For one, they’ve gotten a lot smarter; they know much more, they are better at reasoning, they are more perceptive and so on.

For another, they’re no longer just autocompletes! Through the magic of post-training, for which Scale provides data in our GenAI line of business, base models become instruct models. Instruct models converse, they don’t simply continue the train of thought you provided.

This is what OpenAI did with GPT-3 to turn it into InstructGPT - which Scale contributed training data to - and then GPT-3.5, the model that powered ChatGPT at launch. But it wasn’t just the model that changed - it was the model’s environment too. Whereas GPT-1, GPT-2, and GPT-3 were almost exclusively API services, ChatGPT was a consumer experience: on the web, with a UI, engaging in dialog.

Now remember, the model itself was not different between the API of GPT-3.5 and the website of ChatGPT. Ultimately it received text input and produced text output, like all its predecessors. But exactly what text it got was different. For example, a user on ChatGPT has a location and a timezone, which the model might use to adapt its slang or know to say “Good night” instead of “Good morning”. So where the model interacts is already starting to cause diverging behavior a bit.

As many of us will recall, the earliest version of ChatGPT was really just a chatbot: you put in a message, it sent a message back. No web search, no running code it wrote, no memory outside of the current conversation. And by the way, the conversation could only be 4096 tokens - about 3k words.

As time went on, the barebones chatbot experience changed in a few important ways.

For one, that context window got longer. GPT-4 doubled it to 8192, then quadrupled it to 32k. By late 2023, only a year after ChatGPT came out, GPT-4 Turbo bumped it all the way up to 128k. That ceiling generally stuck around for a while, although now 1M is becoming the standard.

Of course the intelligence improved dramatically too. A fun toy of a chatbot became a productive tool for many people, by API or by GUI.

Crucially for us, the software around the model improved too. Web search, code execution, and memory - what we broadly call “tool use” now - all enhanced the model’s performance without altering the model itself. And they provided a source of information and interaction aside from the user.

So the software grew alongside the model. And new software capabilities influenced model training.

For one, training models to use tools at all takes work, since that type of interaction didn’t exist before LLMs and so isn’t going to be in the pretraining data you scrape from the internet. But more specifically, the LLMs need new tools, tools with different affordances and abilities.

That’s the whole reason behind Model Context Protocol, or MCP as it’s more commonly known. Models have different strengths and weaknesses than either humans or regular old software, so your tools have to lean into those. So for instance, models like structured text, like regular software does, but it can’t handle a flood of output like regular software does. So tools should return a large-but-still-limited amount of structured text.

Or think about memory, which in most cases is just a set of well structured folders and Markdown files. A normal software program would use a database and would have clear rules for storing and fetching data. An LLM needs to work memory in naturally and quickly. It’s a different sort of tool, and models need training on when and how to best use it.

So we see that the initially clean split between model and software has started to heal, where the optimal model performance comes only within particular software. Kind of like a marriage, where two separate people come to share one mind in many ways.

So the marriage between model and software gets us some pretty slick chatbot capabilities. But the fundamental framing of chat is still around a conversation, not about getting work done “out there” so to speak.

One capability in ChatGPT and other chat portals that started to break that pattern was Deep Research, which came out for ChatGPT in February 2025. In a regular chat session, you could start out with one question but typically would need to keep directing the model on research steps, then ask it to compile its work, etc.

With Deep Research though, the model was autonomous: you gave it a question, it might request some initial clarification, then it would go off and work. It felt like the model was “out there” on the internet, not there in your conversation, and that it would “come back” when it was done, report in hand.

That was the first in-production peek at agentic workflows. The framing had shifted; instead of having an AI assistant, you now had an AI worker, returning you substantially complete work in one go.

But some problems require more than just a chat interface in the browser, or even a chat interface with a secret room on the server for the model to go Deep Researching in. You might need a different environment for some work, and really a different paradigm for working.

Anthropic realized that. And a few weeks after ChatGPT Deep Research came out, they released Claude Code.

If you’re not familiar, Claude Code is a program you install on your computer. That program lets Claude do things on your computer, basically anything that can be done with a command line, which is most things. You ask it to do stuff, it goes off and does it until it needs you again.

That software is called a scaffold or a harness, people use both terms but they mean the same thing: a program that turns a model into an agent. The harness provides prompts and tools and guardrails that the model follows and uses and obeys when working in the environment of your computer.

And while the form factor and capabilities seem radically different, it really is a gentle spectrum all the way from pure LLM endpoint to simple chatbot to modern chatbot to agent. Yes, the points further along the spectrum require model capabilities, like long context for those long agent trajectories and better tool use for the increasing array of tool options - that’s why ChatGPT didn’t start out as an agent - but the actual variable on that LLM-to-agent axis is the software. And that’s what today’s paper is really about.

The Paper

So this is a survey paper, which means they basically found and read all the papers they could on a certain topic and then composed their findings into something coherent, with a bit of original analysis on top. Surveys are by nature broad, whereas the papers we typically cover are narrow but deep.

This graphic from the start of the paper is our roadmap. After defining terms at the top, we’ll go through how the harness looks, what the harness does, how agents can work together using shared harness infra, and what the leading edge looks like.

Let’s isolate the top part first. An agent is a model plus a harness. The harness lets the model act in an environment, like on your computer in the case of Claude Code.

One of the main points of original analysis in this paper is how vital code is as a working medium for agents, why code is “agent harness infrastructure” in their jargon.

In fact it’s literally a medium in their graphic; code is in between the agent and the environment. It’s what the agent writes and executes in order to act on, to make changes to the environment. And then through code it receives feedback, signals like a printout or a confirmation or an error.

That’s quite different from chat, where the model interacts with us through natural language. If we think of ourselves as the environment that the agent is acting on during a chat, the agent uses words to act on us and we use words to send signals back to it.

And as the graphic points out, code has some nice properties that make it better as a working medium for agents:

Code has hard rules and is basically machinery for reasoning. Natural language has soft rules and can work for reasoning, but provides no guarantees or checks
Code provides feedback, like detailed errors and stack traces. Natural language does not
Code can store state, basically it has memory. Again, natural language does not - it is not machinery

I would also add that model makers, the big labs like OpenAI and Anthropic, want agents to be good at code so they can achieve Recursive Self Improvement (RSI). The better the model gets at code, the more it can contribute to research to make itself more capable. That’s what Anthropic hired Andrej Karpathy to do, and OpenAI says they will have an intern-level autonomous ML researcher agent by September 2026.

So code composes the harness, yes, but code is also what a model does via the harness, when you combine the model and the harness to form an agent.

This is not a new realization, and basically since the dawn of the GenAI era people have been working on code from AIs.

The top timeline, Code for Reasoning, looks at code as machinery for logic. The very first one, Program of Thought (PoT), is one I remember reading. Back when LLMs couldn’t reliably do even simple math like two-digit multiplication, people realized you could get math right if you just got your model to write the formulas as code instead of just numbers. So if your model can write a quick Python block that just says the two numbers you want to multiply, some software can watch for that Python block, run it, and return the result.

LLMs have gotten a lot better at calculation now, but they still know to invoke a calculator like that instead of relying on mental math.

The middle timeline is basically for writing and then using scripts as tools. Like if your model wants to navigate a browser, what it often does is actually write little code snippets for different action and then stitch those together depending on the task. We’ll get to it later, but there are many ways code acts as a medium instead of a model visually perceiving a screen and just saying what pixel it wants to click.

The bottom timeline is for code as an end, not just as a means. Many of you will recognize SWE-Bench, which is the OG agentic coding benchmark. CWM, which stands for Code World Model, is an interesting one; they taught the model to predict how code will run in order to improve its coding skills.

Now in terms of how the harness operates, we’re gonna tick through five different skills: planning, memory, tool use, control, and optimization.

Planning has evolved significantly from the early days. Kind of inherited from the chatbot days, you start with something linear and direct. Then people tried getting clever by putting those lines into different shapes, different workflows. But that’s not very Bitter-pilled, and agent intelligence quickly advanced to not need so much prescriptive planning from humans, and more open-ended planning methods like search took over. Now the leading edge is in orchestrating between multiple agents, which we’ll come back to later on.

Memory has also gotten significantly more sophisticated, and has different forms as we know from human cognition. So now people work on different types, like storing the outcome of experiences or storing the meaning of data as metadata in addition to storing the data raw. And again, on the leading edge we see multi-agent cropping up.

Tool use they call out explicitly but is a bit of a catch-all, since technically any action a model takes other than simply returning text is tool use. But they do show the general trend from input-based (“do this”) to output-based (“achieve this”), and they again end on multi-agent.

Control is a less intuitive one if you’re not coming from a programming background, but it’s basically about how much autonomy you grant the agent while maintaining safety. The narrative here is that software infrastructure is starting to change to accommodate and fit agents, just as it changed to fit different types of users when computers got popular.

The “plan, execute, and verify loop” from the top is also a crucial point. One reason code is so great for agents is that the feedback is automatic for the most part. Like until the user needs to make a judgment call, the agent can just iterate freely. The more free iteration you can enable, the more autonomous the agent can be. Part of that is safety, part of that is verification.

Now harness optimization is where things get really interesting. If agents are so good at code, shouldn’t they be able to to improve their own harnesses? It’s just code after all.

In fact that’s starting to happen, and it’s a big part of the RSI story. I don’t want to minimize the raw model capabilities of course, which Scale’s training data is critical for, but harnesses have gone from nice add-ons to integral parts of the product. In fact, I wouldn’t be surprised the big model makers stopped generally offering API access to their models and only offered them as part of their products. Like you wouldn’t be able to use Claude via API anymore, only on claude.ai or in Claude Code or in Claude Cowork etc. Or maybe it would only be available to enterprises for use in their products on certain terms.

Note the timelines on each of these areas. Memory in particular is a longstanding and thorny problem, and I’m sure anyone who has used AI products with memory enabled has had recent terrible experiences. I think it will take some time before memory as fluid as ours is available.

I’ll also call out optimization as a recent hot area. Again, models have only recently become capable enough for harnesses to become critical, but now that harnesses are first-class players, expect a lot more effort to go into self-improvement.

Now as I’ve mentioned a few times, multi-agent is also a hot area. Multi-agent swarms are valuable because they parallelize work and they break up the context window amongst the different agents. So you can get more work done faster.

I personally think multi-agent is still in its infancy, and that it will take a long time to work out organizational principles and sociology of machine intelligences, but there has already been some progress.

One immediate split people have imported from human organizations is division of labor, specialization basically. Usually that’s just a harness change on the same model, but sometimes for cost and speed reasons you might use different models for different jobs.

Another import from human organizations is ways of working, like interaction and topologies as the top-right shows. The topologies one in particular is interesting as the bottom-right shows; instead of humans encoding a topology from the start, emergent or task-dependent reorganization has become prominent.

Of course code as the working medium recurs as a theme, as the green and red boxes towards the bottom show. Anyone who has used git to coordinate software projects knows how important it is to have everyone working off the same codebase rather than individual copies or going through one person who has control. For the non-technical folks, the equivalent would be using a shared Google Doc to work instead of everyone having their own local files.

As we continue looking to the future, there are areas besides code we want our agents to be good at, like graphically using the computer or doing science or steering robots.

In keeping with the theme, the authors recommend framing all these new capabilities in terms of code. The more code-like you can make it, the better agents are going to be.

Obviously coding agents are going to be good at coding. One non-obvious detail they do note though is how trajectories from the coding agents end up as training data. So the model now has training in its particular harness, not just on harness-agnostic capabilities. That means that Claude will always work better in Claude Code than in any other coding harness. Same with GPT in Codex. That’s yet another way the model and the harness are becoming inseparable, and why pretty much every model maker has their own harness now. There are still model-agnostic ones like OpenCode, but to me it seems they’re at a structural disadvantage.

Similarly, generalist agents like OpenClaw and Hermes are also seeing proprietary competition. Google and Microsoft both announced their own versions. Meta has an internal one I suspect they will release. A lot of generalist agent stuff like personalization is on a different axis than the raw performance you need with coding, so I do see a world where model-agnostic options are alive and well

So What?

Since we’ve been mentally drawing trendlines and now extrapolating, I wanted to make the trendline clear.

This is the famous METR chart, which shows how long of tasks models can accomplish, in terms of how long it would take a competent human to do them.

We’re at a point now where the best model on the planet, Claude Mythos, is beyond our current capabilities to measure accurately. I’m sure we’ll catch up in this case, but the gap will emerge again and one day we won’t be able to catch up.

In terms of the trend, depending on how you draw the line, it predicts a 10x annual increase in task duration. So if Opus 4.6 was at a 10-hour horizon in February 2026, then in February 2027 we’ll have a model that can do the equivalent of 100 hours of human work autonomously - at 50% success rate anyway. For 80% success rate the respective numbers are 1 and 10 hours.

My Takeaways

RSI will be a system, not a model
- How many more optimizations loops can we add?
  - Innermost = model
  - Outer = harness/agent
  - Outer outer = optimizer? swarm?
- Will it require optimizing further-flung aspects, e.g. the CLI or the OS or the hardware? The org chart or society of agents?
- Will we Goodhart intelligence to just be coding, or is intelligence a package that includes e.g. creative writing?
The optimal arrangement of swarms will depend on the task, but will often look different from how humans self-organize
- The more Bitter the better

Discussion about this post

Ready for more?