zlacker

Why not replace the context tokens on the GPU during inference when they become no longer relevant? i.e. some tool reads a 50k token document, LLM processes it, so then just flush those document tokens out of active context, rebuild QKV caches and store just some log entry in the context as "I already did this ... with this result"?

replies(2): >>zozbot+i1 >>killer+3f

>>storus+(OP)
This is what agent calls do under the hood, yes.

replies(1): >>storus+kl

>>storus+(OP)
Anthropic added features like this into 4.5 release:

https://claude.com/blog/context-management

> Context editing automatically clears stale tool calls and results from within the context window when approaching token limits.

> The memory tool enables Claude to store and consult information outside the context window through a file-based system.

But it looks like nobody has it as a part of an inference loop yet: I guess it's hard to train (i.e. you need a training set which is a good match for what people use context in practice) and make inference more complicated. I guess more high-level context management is just easier to implement - and it's one of things which "GPT wrapper" companies can do, so why bother?

>>zozbot+i1
I don't think so, those things happen when agent yields the control back at the end of its inference call, not during the active agent inference with multiple tool calls ongoing. These days an agent can finish the whole task with 1000s tool calls during a single inference call without yielding control back to whatever called it to do some housekeeping.

replies(1): >>vidarh+0L

>>storus+kl
For agent, read sub-agent. E.g. the contents of your .claude/agents directory. When Claude Code spins up an agent, it provides the sub-agent with a prompt that combines the agents prompt and information composed by Claude from the outer context based on what Claude thinks needs to be communicated to the agent. Claude Code can either continue, with the sub-agent running in the background, or wait until it is complete. In either case, by default, Claude Code effectively gets to "check in" on messages from the sub-agent without seeing the whole thing (e.g. tool call results etc.), so only a small proportion of what the agent does will make it into the main agents context.

So if you want to do this, the current workaround is basically to have a sub-agent carry out tasks you don't want to pollute the main context.

I have lots of workflows that gets farmed out to sub-agents that then write reports to disk, and produce a summary to the main agent, who will then selectively read parts of the report instead of having to process the full source material or even the whole report.

replies(1): >>storus+9U1

>>vidarh+0L
OK, so you are essentially using sub-agents as summarizing tools of the main agent, something you could implement by specialized tools that wrap independent LLM calls with the prompts of your sub-agents.

replies(1): >>vidarh+B33

>>storus+9U1
That is effectively how sub-agents are implemented at least conceptually, and yes, if you build your own coding agent, you can trivially implement sub-agents by having your coding agent recursively spawn itself.

Claude Code and others have some extras, such as the ability for the main agent to put them in the background, spawn them in parallel, and use tool calls to check on the status of them (so basic job control), but "poor mans sub-agents" only requires the ability for the coding agent to run an executable the equivalent of e.g. "claude --print <someprompt" (the --print option is real, and enables headless use; in practise you'd also want --stream-json, set allowed tools, and specify a conversation id so you can resume the sub-agents conversation).

And calling it all "summarising" understates it. It is delegation, and a large part of the value of delegation in a software system is abstraction and information hiding. The party that does the delegation does not need to care about all of the inner detail of the delegated task.

The value is not the summary. The value is the work done that the summary describes without unnecessary detail.