It’s also related to attention — invoking a skill “now” means that the model has all the relevant information fresh in context, you’ll have much better results.
What I’m doing myself is write skills that invoke Python scripts that “inject” prompts. This way you can set up multi-turn workflows for eg codebase analysis, deep thinking, root cause analysis, etc.
Works very well.
I'm very curious to know the size & state of a codebase where skills are beneficial over just having good information hierarchy for your documentation.
https://claude.com/blog/context-management
> Context editing automatically clears stale tool calls and results from within the context window when approaching token limits.
> The memory tool enables Claude to store and consult information outside the context window through a file-based system.
But it looks like nobody has it as a part of an inference loop yet: I guess it's hard to train (i.e. you need a training set which is a good match for what people use context in practice) and make inference more complicated. I guess more high-level context management is just easier to implement - and it's one of things which "GPT wrapper" companies can do, so why bother?
Splitting the docs into neat modules is a good idea (for both human readers and current AIs) and will continue to be a good idea for a while at least. Getting pedantic about filenames, documentation schemas and so on is just bikeshedding.
So if you want to do this, the current workaround is basically to have a sub-agent carry out tasks you don't want to pollute the main context.
I have lots of workflows that gets farmed out to sub-agents that then write reports to disk, and produce a summary to the main agent, who will then selectively read parts of the report instead of having to process the full source material or even the whole report.
In other words, if you run an identical prompt, one with skill and one without, on a test task that requires discovering deeply how your codebase works, which one performs better on the following metrics, and how much better?
1. Accuracy / completion of the task
2. Wall clock time to execute the task
3. Token consumption of the task
Claude Code and others have some extras, such as the ability for the main agent to put them in the background, spawn them in parallel, and use tool calls to check on the status of them (so basic job control), but "poor mans sub-agents" only requires the ability for the coding agent to run an executable the equivalent of e.g. "claude --print <someprompt" (the --print option is real, and enables headless use; in practise you'd also want --stream-json, set allowed tools, and specify a conversation id so you can resume the sub-agents conversation).
And calling it all "summarising" understates it. It is delegation, and a large part of the value of delegation in a software system is abstraction and information hiding. The party that does the delegation does not need to care about all of the inner detail of the delegated task.
The value is not the summary. The value is the work done that the summary describes without unnecessary detail.
I think the main conflict in this thread is whether skills are anything more than just structuring documentation you were lacking in your repo, regardless if it was for Claude or Steve starting from scratch.
That difference alone likely accounts for some not insignificant discrepancies. But without numbers, it's hard to say.