zlacker

It's all about managing context. The bitter lesson applies over the long haul - and yes, over the long haul, as context windows get larger or go away entirely with different architectures, this sort of thing won't be needed. But we've defined enough skills in the last month or two that if we were to put them all in CLAUDE.md, we wouldn't have any context left for coding. I can only imagine that this will be a temporary standard, but given the current state of the art, it's a helpful one.

replies(5): >>ledaup+E >>stingr+P3 >>OtherS+c4 >>storus+k5 >>iainme+xK

>>smithk+(OP)
how is it different or better than maintaining an index page for your docs? Or a folder full of docs and giving Claude an instruction to `ls` the folder on startup?

replies(2): >>d1sxey+D1 >>Aviceb+s2

>>ledaup+E
Vercel think it isn’t:

https://vercel.com/blog/agents-md-outperforms-skills-in-our-...

>>ledaup+E
It's hard to tell unless they give some hard data comparing the approaches systematically.. this feels like a grift or more charitably trying to build a presence/market around nothing. But who knows anymore, apparently saying "tell the agent to write it's own docs for reference and context continuity" is considered a revelation.

>>smithk+(OP)
Not sure why you’re being downvoted so much, it’s a valid point.

It’s also related to attention — invoking a skill “now” means that the model has all the relevant information fresh in context, you’ll have much better results.

What I’m doing myself is write skills that invoke Python scripts that “inject” prompts. This way you can set up multi-turn workflows for eg codebase analysis, deep thinking, root cause analysis, etc.

Works very well.

>>smithk+(OP)
I use Claude pretty extensively on a 2.5m loc codebase, and it's pretty decent at just reading the relevant readme docs & docstrings to figure out what's what. Those docs were written for human audiences years (sometimes decades) ago.

I'm very curious to know the size & state of a codebase where skills are beneficial over just having good information hierarchy for your documentation.

replies(2): >>pertym+9z >>SOLAR_+F42

>>smithk+(OP)
Why not replace the context tokens on the GPU during inference when they become no longer relevant? i.e. some tool reads a 50k token document, LLM processes it, so then just flush those document tokens out of active context, rebuild QKV caches and store just some log entry in the context as "I already did this ... with this result"?

replies(2): >>zozbot+C6 >>killer+nk

>>storus+k5
This is what agent calls do under the hood, yes.

replies(1): >>storus+Eq

>>storus+k5
Anthropic added features like this into 4.5 release:

https://claude.com/blog/context-management

> Context editing automatically clears stale tool calls and results from within the context window when approaching token limits.

> The memory tool enables Claude to store and consult information outside the context window through a file-based system.

But it looks like nobody has it as a part of an inference loop yet: I guess it's hard to train (i.e. you need a training set which is a good match for what people use context in practice) and make inference more complicated. I guess more high-level context management is just easier to implement - and it's one of things which "GPT wrapper" companies can do, so why bother?

>>zozbot+C6
I don't think so, those things happen when agent yields the control back at the end of its inference call, not during the active agent inference with multiple tool calls ongoing. These days an agent can finish the whole task with 1000s tool calls during a single inference call without yielding control back to whatever called it to do some housekeeping.

replies(1): >>vidarh+kQ

>>OtherS+c4
Skills are more than code documentation. They can apply to anything that the model has to do, outside of coding.

>>smithk+(OP)
To clarify, when I mentioned the bitter lesson I meant putting effort into organising the "skills" documentation in a very specific way (headlines, descriptions, etc).

Splitting the docs into neat modules is a good idea (for both human readers and current AIs) and will continue to be a good idea for a while at least. Getting pedantic about filenames, documentation schemas and so on is just bikeshedding.

>>storus+Eq
For agent, read sub-agent. E.g. the contents of your .claude/agents directory. When Claude Code spins up an agent, it provides the sub-agent with a prompt that combines the agents prompt and information composed by Claude from the outer context based on what Claude thinks needs to be communicated to the agent. Claude Code can either continue, with the sub-agent running in the background, or wait until it is complete. In either case, by default, Claude Code effectively gets to "check in" on messages from the sub-agent without seeing the whole thing (e.g. tool call results etc.), so only a small proportion of what the agent does will make it into the main agents context.

So if you want to do this, the current workaround is basically to have a sub-agent carry out tasks you don't want to pollute the main context.

I have lots of workflows that gets farmed out to sub-agents that then write reports to disk, and produce a summary to the main agent, who will then selectively read parts of the report instead of having to process the full source material or even the whole report.

replies(1): >>storus+tZ1

>>vidarh+kQ
OK, so you are essentially using sub-agents as summarizing tools of the main agent, something you could implement by specialized tools that wrap independent LLM calls with the prompts of your sub-agents.

replies(1): >>vidarh+V83

>>OtherS+c4
Claude can always self discover its own context. The question becomes whether it's way more efficient to have it grepping and lsing and whatever else it needs to do randomly poking around to build a half-baked context, or whether having a tailor made context injection that is dynamic can speed that up.

In other words, if you run an identical prompt, one with skill and one without, on a test task that requires discovering deeply how your codebase works, which one performs better on the following metrics, and how much better?

1. Accuracy / completion of the task

2. Wall clock time to execute the task

3. Token consumption of the task

replies(1): >>croon+Gu3

>>storus+tZ1
That is effectively how sub-agents are implemented at least conceptually, and yes, if you build your own coding agent, you can trivially implement sub-agents by having your coding agent recursively spawn itself.

Claude Code and others have some extras, such as the ability for the main agent to put them in the background, spawn them in parallel, and use tool calls to check on the status of them (so basic job control), but "poor mans sub-agents" only requires the ability for the coding agent to run an executable the equivalent of e.g. "claude --print <someprompt" (the --print option is real, and enables headless use; in practise you'd also want --stream-json, set allowed tools, and specify a conversation id so you can resume the sub-agents conversation).

And calling it all "summarising" understates it. It is delegation, and a large part of the value of delegation in a software system is abstraction and information hiding. The party that does the delegation does not need to care about all of the inner detail of the delegated task.

The value is not the summary. The value is the work done that the summary describes without unnecessary detail.

>>SOLAR_+F42
It's not about one with skill and one without, but about one with skill vs one with regular old human documentation for stuff you need to know to work on a repo/project, or even more accurate comparison, take the skill and don't load it as a skill and just put it as context in the repo.

I think the main conflict in this thread is whether skills are anything more than just structuring documentation you were lacking in your repo, regardless if it was for Claude or Steve starting from scratch.

replies(1): >>SOLAR_+E87

>>croon+Gu3
well, the key difference is that one is auto-injected into your context for dynamic lookup and the other is loaded on-demand as needed and is contingent upon the llm discovering it.

That difference alone likely accounts for some not insignificant discrepancies. But without numbers, it's hard to say.