Although my model of choice for most internal workflows remains ChatGPT 4.1
for its predictable speed and high-adherence to instructions, even its 1,047,576-token context window can run out of space.
When you run out of space in the context window, your agent either needs to give up, or it needs to compact that large context
window into a smaller one. Here are our notes on implementing compaction.
This is part of the Building an internal agent series.
Why compaction matters
Long-running workflows with many tool calls or user messages, along with any workflow
dealing with large files, often run out of space in their context window.
Although context window exhaustion is not relevant in most cases you’ll find for internal agents,
ultimately it’s not possible to implement a robust, reliable agent without solving
for this problem, and compaction is a straightforward solution.
How we implemented it
Initially, in the beautiful moment where we assumed compaction wouldn’t be a relevant concern to
our internal workflows, we implemented an extremely naive solution to compaction:
if we ever ran out of tokens, we discarded older tool responses until we had more space,
then continued.
Because we rarely ran into compaction, the fact that this worked poorly wasn’t a major
issue, but eventually the inelegance began to weigh on me as we started dealing with
more workflows with large files.
In our initial brainstorm on our 2nd iteration of compaction, I initially got anchored
on this beautiful idea that compaction should be sequenced after
implementing support for sub-agents, but ultimately that didn’t prove
to be accurate.
The gist of our approach to compaction is:
- After every user message (including tool responses), add a system message with the consumed and available tokens
in the context window. In that system message, we also include the updated list of available “virtual files” that can
be read from - User messages, again including tool responses, greater than 10,000 tokens are exposed as a new “virtual file”, with only
their first 1,000 tokens included in the context window. The agent must use file manipulation tools to read more than
those first 1,000 tokens - Add a set of “base tools” that are always available to agents, specifically including the virtual file manipulation tools,
as we’d finally reached a point where most agents simply could not operate without a large number of mostly invisible internal
tools - If a message pushed us over 80% (configurable value) of the model’s available context window,
use the compaction prompt that Reddit claims Claude Code uses.
The prompt isn’t particularly special, it just already exists and seems pretty good - After compacting, add the prior context window as a virtual file to allow the agent to retrieve pieces of context
that it might have lost - Add a new tool,
file_regexto allow the agent to perform regex searches against files, including the prior context window
Each of these steps is quite simple, but in combination they really do provide a fair amount of power
for handling complex, prolonged workflows.
Admittedly, we still have a configurable cap on the number of tools that can be called in a workflow
(to avoid agents spinning out), but this means that agents dealing with large or complex data are much
more likely to succeed usefully.
How is it working? / What’s next?
Whereas for most of our new internal agent features,
there are obvious problems or iterations, this one feels
like it’s good enough to forget for a long, long time.
There are two reasons for this:
first, most of our workflows don’t require large context windows,
and, second, honestly this seems to work quite well.
If context windows get significantly larger in the future, which I don’t see
too much evidence will happen at this moment in time, then we will simply
increase some of the default values to use more tokens, but the core algorithm
here seems good enough.