zlacker

This stuff smells like maybe the bitter lesson isn't fully appreciated.

You might as well just write instructions in English in any old format, as long as it's comprehensible. Exactly as you'd do for human readers! Nothing has really changed about what constitutes good documentation. (Edit to add: my parochialism is showing there, it doesn't have to be English)

Is any of this standardization really needed? Who does it benefit, except the people who enjoy writing specs and establishing standards like this? If it really is a productivity win, it ought to be possible to run a comparison study and prove it. Even then, it might not be worthwhile in the longer run.

replies(29): >>smithk+s1 >>postal+V5 >>storus+g6 >>idopms+vb >>tcdent+Og >>fassss+Lh >>zby+Gi >>killer+3j >>avaer+zo >>3371+Yp >>mhalle+xu >>Lerc+0F >>runjak+HF >>ashdks+XS >>apsurd+IU >>ianbut+I81 >>JohnMa+ia1 >>MattRo+Hc1 >>0dayma+ck1 >>theshr+3y1 >>whh+dN1 >>nxobje+fQ1 >>brooks+702 >>davnic+M52 >>miki12+jx2 >>sbinne+LI2 >>peepee+zj3 >>losved+vy3 >>canuck+624

>>iainme+(OP)
It's all about managing context. The bitter lesson applies over the long haul - and yes, over the long haul, as context windows get larger or go away entirely with different architectures, this sort of thing won't be needed. But we've defined enough skills in the last month or two that if we were to put them all in CLAUDE.md, we wouldn't have any context left for coding. I can only imagine that this will be a temporary standard, but given the current state of the art, it's a helpful one.

replies(5): >>ledaup+62 >>stingr+h5 >>OtherS+E5 >>storus+M6 >>iainme+ZL

>>smithk+s1
how is it different or better than maintaining an index page for your docs? Or a folder full of docs and giving Claude an instruction to `ls` the folder on startup?

replies(2): >>d1sxey+53 >>Aviceb+U3

>>ledaup+62
Vercel think it isn’t:

https://vercel.com/blog/agents-md-outperforms-skills-in-our-...

>>ledaup+62
It's hard to tell unless they give some hard data comparing the approaches systematically.. this feels like a grift or more charitably trying to build a presence/market around nothing. But who knows anymore, apparently saying "tell the agent to write it's own docs for reference and context continuity" is considered a revelation.

>>smithk+s1
Not sure why you’re being downvoted so much, it’s a valid point.

It’s also related to attention — invoking a skill “now” means that the model has all the relevant information fresh in context, you’ll have much better results.

What I’m doing myself is write skills that invoke Python scripts that “inject” prompts. This way you can set up multi-turn workflows for eg codebase analysis, deep thinking, root cause analysis, etc.

Works very well.

>>smithk+s1
I use Claude pretty extensively on a 2.5m loc codebase, and it's pretty decent at just reading the relevant readme docs & docstrings to figure out what's what. Those docs were written for human audiences years (sometimes decades) ago.

I'm very curious to know the size & state of a codebase where skills are beneficial over just having good information hierarchy for your documentation.

replies(2): >>pertym+BA >>SOLAR_+762

>>iainme+(OP)
Folks have run comparisons. From a huggingface employee:

  codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.

  I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.

https://xcancel.com/ben_burtenshaw/status/200023306951767675...

That said, it's not a perfect comparison because of the Codex model mismatch between runs.

The author seems to be doing a lot of work on skills evaluation.

https://github.com/huggingface/upskill

replies(6): >>xrd+u7 >>pton_x+q8 >>8cvor6+Z9 >>iainme+pb >>bburte+FN2 >>oofbey+iU3

>>iainme+(OP)
This is pushed by Antropic, OpenAI doesn't seem to care much about "skills". Maybe Anthropic is doing some extra training to better follow sections of text marked as skill, who knows? Or you can just store what worked as a skill and share with others without any need to do their own prompt for common tasks?

replies(1): >>jonath+M9

>>smithk+s1
Why not replace the context tokens on the GPU during inference when they become no longer relevant? i.e. some tool reads a 50k token document, LLM processes it, so then just flush those document tokens out of active context, rebuild QKV caches and store just some log entry in the context as "I already did this ... with this result"?

replies(2): >>zozbot+48 >>killer+Pl

>>postal+V5
Does this indicate running locally with a very small (quantized?) model?

I am very interested in finding ways to combine skills + local models + MCP + aider-ish tools to avoid using commercial LLM providers.

Is this a path to follow? Or, something different?

replies(1): >>postal+Ld

>>storus+M6
This is what agent calls do under the hood, yes.

replies(1): >>storus+6s

>>postal+V5
I think the point is it smells like a hack, just like "think extra hard and I'll tip you $200" was a few years ago. It increases benchmarks a few points now but what's the point in standardizing all this if it'll be obsolete next year?

replies(3): >>mbesto+nn >>9dev+Ir >>dragon+DB5

>>storus+g6
OpenAI has already adopted Agent Skills:

- https://community.openai.com/t/skills-for-codex-experimental...

- https://developers.openai.com/codex/skills/

- https://github.com/openai/skills

- https://x.com/embirico/status/2018415923930206718

replies(1): >>storus+2A

>>postal+V5
Sounds like the benchmark matrix just got a lot bigger, model * skill combinations.

>>postal+V5
I can't quite tell what's being compared there -- just looks like several different LLMs?

To be clear, I'm suggesting that any specific format for "skills.md" is a red herring, and all you need to do is provide the LLM with good clear documentation.

A useful comparison would be between: a) make a carefully organised .skills/ folder, b) put the same info anywhere and just link to it from your top-level doc, c) just dump everything directly in the top-level doc.

My guess is that it's probably a good idea to break stuff out into separate sections, to avoid polluting the context with stuff you don't need; but the specific way you do that very likely isn't important at all. So (a) and (b) would perform about the same.

replies(3): >>postal+Kf >>anupam+IP >>dragon+pq5

>>iainme+(OP)
I have been using Claude Code to automate a bunch of my business tasks, and I set up slash commands for each of them. Each slash command starts by reading from a .md file of instructions. I asked Claude how this is different from skills and the only substantive thing it could come up with was that Claude wouldn't be able to use these on its own, without me invoking the slash command (which is fine; I wouldn't want it to go off and start checking my inventory of its own volition).

So yeah, I agree that it's all just documentation. I know there's been some evidence shown that skills work better, but my feeling is that in the long run it'll fall to the wayside, like prompt engineering, for a couple of reasons. First, many skills will just become unnecessary - models will be able to make slide decks or do frontend design without specific skills (Gemini's already excellent at design without anything beyond the base model, imho). Second, increased context windows and overall intelligence will obviate the need for the specific skills paradigm. You can just throw all the stuff you want Claude to know in your claude.md and call it a day.

replies(5): >>kurthr+Ze >>stevek+Qo >>vidarh+wL >>mordym+4Y >>chrisr+bi3

>>xrd+u7
Check out the guy's work. He's doing a lot of work on precisely what you're talking about.

https://xcancel.com/ben_burtenshaw

https://huggingface.co/blog/upskill

https://github.com/huggingface/upskill

>>idopms+vb
So how is this slash command limit enforced? Is it part of the Claude API/PostTraining etc? It seems like a useful tool if it is!

I'd like a user writeable, LLM readable, LLM non-writable character/sequence. That would make it a lot easier to know at a glance that a command/file/directory/username/password wasn't going to end up in context and being used by a rogue agent.

It wouldn't be fool proof, since it could probably find some other tool out there to generate it (eg write-me some unicode python), but it's something I haven't heard of that sounds useful. If it could be made fool/tool proof (fools and tools are so resourceful) that would be even better.

replies(1): >>idopms+Zg

>>iainme+pb
Your skepticism is valid. Vercel ran a study where they said that skills underperform putting a docs index in AGENTS.md[0].

My guess is that the standardization is going to make its way into how the models are trained and Skills are eventually going to pull out ahead.

0: https://vercel.com/blog/agents-md-outperforms-skills-in-our-...

replies(1): >>vidarh+MK

>>iainme+(OP)
Skills are for the most part already generated by LLMs. And, if you're implementing them in your own workflow, they're tailored to real-world problems you've encountered.

Having a super repo of everyone else's slop is backwards thinking; you are now in the era where creating written content and verifying it's effectiveness is easier than ever.

>>kurthr+Ze
It's part of the Claude Code harness. I honestly haven't thought at all about security related to it; it's just a nice convenience to trigger a commonly run process.

>>iainme+(OP)
Post training can make known formats more reliable.

>>iainme+(OP)
The instructions are standard documents - but this is not all. What the system adds is an index of all skills, built from their descriptions, that is passed to the llm in each conversation. The idea is to let the llm read the skill when it is needed and not load it into context upfront. Humans use indexes too - but not in this way. But there are some analogies with GUIs and how they enhance discoverability of features for humans.

I wish they arranged it around READMEs. I have a directory with my tasks and I have a README.md there - before codex had skills it already understood that it needs to read the readme when it was dealing with tasks. The skills system is less directory dependent so is a bit more universal - but I am not sure if this is really needed.

replies(3): >>iainme+7r >>gianca+KW >>ethbr1+Li1

>>iainme+(OP)
> Is any of this standardization really needed?

This standardization, basically, makes a list of docs easier to scan.

As a human, you have a permanent memory. LLMs don't have it, they have to load it into the context, and doing it only as necessary can help.

E.g. if you had anterograde amnesia, you'd want everything to be optimally organized, labeled, etc, right? Perhaps an app which keeps all information handy.

replies(1): >>iainme+BS

>>storus+M6
Anthropic added features like this into 4.5 release:

https://claude.com/blog/context-management

> Context editing automatically clears stale tool calls and results from within the context window when approaching token limits.

> The memory tool enables Claude to store and consult information outside the context window through a file-based system.

But it looks like nobody has it as a part of an inference loop yet: I guess it's hard to train (i.e. you need a training set which is a good match for what people use context in practice) and make inference more complicated. I guess more high-level context management is just easier to implement - and it's one of things which "GPT wrapper" companies can do, so why bother?

>>pton_x+q8
I think this tweet sums it correctly doesn't?

   A +6 jump on a 0.6B model is actually more impressive than a +2 jump on a 100B model. It proves that 'intelligence' isn't just parameter count; it is context relevance. You are proving that a lightweight model with a cheat sheet beats a giant with amnesia. This is the death of the 'bigger is better' dogma

Which is essentially the bitter lesson that Richard Sutton talks about?

replies(1): >>Der_Ei+Ng1

>>iainme+(OP)
It's not about instructions, it's about discoverability and data.

Yeah, WWW is really just text but that doesn't mean you don't need HTTP + HTML and a browser/search engine. Skills is just that, but for agent capabilities.

Long term you're right though, agents will fetch this all themselves. And at some point they will not be our agents at all.

replies(1): >>iainme+Bs

>>idopms+vb
Claude Code recently deprecated slash commands in favor of skills because they were so similar. Or another way of looking at it is, they added the ability to invoke a skill via /skill-name.

replies(1): >>idopms+fx

>>iainme+(OP)
You are right about it's just natural language but Standarization is very improtant, because it's never just about the model itself, the so called Harness is a big factor on LLM performance and standarization allows all harness to index all skills.

>>zby+Gi
Humans use indexes too - but not in this way.

What's different?

replies(1): >>zby+LG

>>pton_x+q8
Standards have to start somewhere to gain traction and proliferate themselves for longer than that.

Plus, as has been mentioned multiple times here, standard skills are a lot more about different harnesses being able to consistently load skills into the context window in a programmatic way. Not every AI workload is a local coding agent.

>>zozbot+48
I don't think so, those things happen when agent yields the control back at the end of its inference call, not during the active agent inference with multiple tool calls ongoing. These days an agent can finish the whole task with 1000s tool calls during a single inference call without yielding control back to whatever called it to do some housekeeping.

replies(1): >>vidarh+MR

>>avaer+zo
I guess what I mean is that standardizing this bit of the problem right now feels sort of like XHTML. Many people thought that was a big deal back in the day, but it turned out to be a pointless digression.

Long term you're right though, agents will fetch this all themselves

It's not "long term", it's right now. If your docs are well-written and well-organised, agents can already use them. The most you might need to do is copy your README.md into CLAUDE.md.

>>iainme+(OP)
Skills are not just documentation. They include computability (programs/scripts), data (assets), and the documentation (resources) to use everything effectively.

Programs and data are the basis of deterministic results that are accessible to the llm.

Embedding an sqlite database with interesting information (bus schedules, dietary info, or a thousand other things) and a python program run by the skill can access it.

For Claude at least, it does it in a VM and can be used from your phone.

Sure, skills are more convention than a standard right now. Skills lack versioning, distribution, updates, unique naming, selective network access. But they are incredibly useful and accessible.

replies(1): >>Spivak+Fv

>>mhalle+xu
Am I missing something because what you describe as the pack of stuff sounds like S tier documentation. I get full working examples and a pre-populated database it works on?

>>stevek+Qo
Yeah, I saw that announcement but still can't figure out what the actual impact is - doesn't change anything for me (my non-skill slash commands still work).

replies(1): >>stevek+cX

>>jonath+M9
Yeah but this seems like a bolt-on and not something they train their model to understand at the token level like how they do tool calls. Maybe Anthropic has a token-level skills support (e.g. <SKILL_START>skill prompt<SKILL_END>).

>>OtherS+E5
Skills are more than code documentation. They can apply to anything that the model has to do, outside of coding.

>>iainme+(OP)
The main thing here would need standardisation is the environment in which the skill operates. The skill instructions are interpreted by the AI, any support scripts are. Interpreted by the environment.

You don't want to give an English description of how to compress LZMA and then let the AI do it token by token. Although that would be a pretty good arduous methodical benchmark task for an AI.

>>iainme+(OP)
You may be right, but I find myself writing English differently depending on the audience: people vs AI.

I haven't done a formal study, so I can't prove it, but it seems like I get better output from agents if I tailor my English more towards the LLM way of "thinking".

>>iainme+7r
Hmm - maybe I should not call it index - people lookup stuff in the index when needed. Here the whole index is inserted in the conversation - it is as if when starting a task human read the whole table of contents of the manual for that task.

>>postal+Kf
Agents add a docs index in context for skills, so this is an issue of finding that the current specific implementation of skills in Claude Code is suboptimal.

Their reasoning about it is also flawed. E.g. "No decision point. With AGENTS.md, there's no moment where the agent must decide "should I look this up?" The information is already present." - but this is exactly the case for skills too. The difference is just where in the context the information is, and how it is structured.

Having looked at their article, ironically I think the reason it works is that they likely force more information into context by giving the agent less information to work with:

Instead of having a description, which might convince the agent a given skill isn't relevant, their index is basically a list of vague filenames, forcing the agent to make a guess, and potentialy reading the wrong thing.

This is basically exactly what skills were added to avoid. But it will break if the description isn't precise enough. And it's perfectly possible that current tooling isn't aggressive enough about pruning detail that might tempt the agent to ignore relevant files.

replies(1): >>SOLAR_+L52

>>idopms+vb
A bit of caution: it's perfectly able to look up and read the slash-command, so while it may be true it technically can't "invoke" a slash-command via TaskTool, it most certainly can execute all of the steps in it if the slash-command is somewhere you grant it read access, and will tend to try to do so if you tell it to invoke a slash command.

>>smithk+s1
To clarify, when I mentioned the bitter lesson I meant putting effort into organising the "skills" documentation in a very specific way (headlines, descriptions, etc).

Splitting the docs into neat modules is a good idea (for both human readers and current AIs) and will continue to be a good idea for a while at least. Getting pedantic about filenames, documentation schemas and so on is just bikeshedding.

>>iainme+pb
> If you want a clean comparison, I’d test three conditions under equal context budgets: (A) monolithic > AGENTS.md, (B) README index that links to docs, (C) skills with progressive disclosure. Measure task > success, latency, and doc‑fetch count across 10–20 repo tasks. My hunch: (B)≈(C) on quality, but (C) > wins on token efficiency when the index is strong. Also, format alone isn’t magic—skills that reference > real tools/assets via the backing MCP are qualitatively different from docs‑only skills, so I’d > separate those in the comparison. Have you seen any benchmarks that control for discovery overhead?

>>storus+6s
For agent, read sub-agent. E.g. the contents of your .claude/agents directory. When Claude Code spins up an agent, it provides the sub-agent with a prompt that combines the agents prompt and information composed by Claude from the outer context based on what Claude thinks needs to be communicated to the agent. Claude Code can either continue, with the sub-agent running in the background, or wait until it is complete. In either case, by default, Claude Code effectively gets to "check in" on messages from the sub-agent without seeing the whole thing (e.g. tool call results etc.), so only a small proportion of what the agent does will make it into the main agents context.

So if you want to do this, the current workaround is basically to have a sub-agent carry out tasks you don't want to pollute the main context.

I have lots of workflows that gets farmed out to sub-agents that then write reports to disk, and produce a summary to the main agent, who will then selectively read parts of the report instead of having to process the full source material or even the whole report.

replies(1): >>storus+V02

>>killer+3j
Everybody wants that, though, no? At least some of the time?

For example, if you've just joined a new team or a new project, wouldn't you like to have extensive, well-organised documentation to help get you started?

This reminds me of the "curb-cut effect", where accommodations for disabilities can be beneficial for everybody: https://front-end.social/@stephaniewalter/115841555015911839

>>iainme+(OP)
We’re working with the models that are available now, not theoretical future models with infinite context.

Claude is programmed to stop reading after it gets through the skill’s description. That means we don’t consume more tokens in the context until Claude decides it will be useful. This makes a big difference in practice. Working in a large repo, it’s an obvious step change between me needing to tell Claude to go read a particular readme that I know solves the problem vs Claude just knowing it exists because it already read the description.

Sure, if your project happened to already have a perfect index file with a one-sentence description of each other documentation file, that could serve as a similar purpose (if Claude knew about it). It’s worthwhile to spread knowledge about how effective this pattern is. Also, Claude is probably trained to handle this format specifically.

replies(1): >>iainme+KV

>>iainme+(OP)
yeah the boon of LLM is how it gives a masked incentive for every jane and joe to be intentional communicators.

>>ashdks+XS
To clarify, the bit where I think the bitter lesson applies is trying to standardize the directory names, the permitted headings and paragraph lengths, etc. It's pointless bikeshedding.

Making your docs nice and modular, and having a high-level overview that tells you where to find more detailed info on specific topics, is definitely a good idea. We already know that when we're writing docs for human readers. The LLMs are already trained on a big corpus written by and for humans. There's no compelling reason why we need to do anything radically different to help them out. To the contrary, it's better not to do anything radically different, so that new LLM-assisted code and docs can be accessible to humans too.

Well-written docs already play nicely with LLM context.

replies(1): >>ashdks+262

>>zby+Gi
Claude reads from .claude/instructions.md whenever you make a new convo as a default thing. I usually have Claude add things like project layout info and summaries, preferred tooling to use, etc. So there's a reasonable expectation of how it should run. If it starts 'forgetting' I tell it to re-read it.

replies(1): >>krinch+du2

>>idopms+fx
The actual impact is that there should be less confusion in the future about "what's the difference between these two" because there isn't really.

To overly programmer-brain it, a slash command is just a skill with a null frontmatter. This means that it doesn't participate in progressive disclosure, aka Claude won't consider invoking it automatically.

>>idopms+vb
Workflow-wise, the important distinction for me has been that I can refine a Skill by telling Claude Code to use it for related tasks until it does exactly what I want, correctly, the first time. Having a solid, iteratively perfected Skill really cuts down on subsequent iteration.

>>iainme+(OP)
I'd argue we jumped that shark since the shift in focus to post training. Labs focus on getting good at specific formats and tasks. The generalization argument was ceded (not in the long term but in the short term) to the need to produce immediate value.

Now if a format dominates it will be post trained for and then it is in fact better.

replies(1): >>Der_Ei+ug1

>>iainme+(OP)
I agree with this and it's a conversation I've struggled to have with coworkers about using these -

IMO it's great if a plugin wants to have their own conventions for how to name and where to put these files and their general structure. I get the sense it doesn't matter to agents much (talking mostly claude here) and the way I use it I essentially give its own "skills" based on my own convention. It's very flexible and seems to work. I don't use the slash commands, I just script with prompts into claude CLI mostly, so if that's the only thing I gain from it, meh. I do see other comments speculating these skills work more efficiently but I'm not sure I have seen any evidence for that? Like a sibling comment noted I can just re-feed the skill knowledge back into the prompt.

>>iainme+(OP)
On the one hand, I agree.

The whole point of LLM-based code execution is, well, I can just type in any old language it understands and it ought to figure out what I mean!

A "skill" for searching a pdf could be :

* "You can search PDFs. The code is in /lib/pdf.py"

or it could be:

* "Here's a pile of libraries, figure out which you want to use for stuff"

or it could be:

* "Feel free to generate code (in any executable programming language) on the fly when you want to search a PDF."

or it could be:

* "Solve this problem <x>" and the LLM sees a pile of PDFs in the problem and decides to invent a parser.

or any other nearly infinite way of trying to get a non-deterministic LLM to do a thing you want it to do.

At some level, this is all the same. At least, it rounds to the same in a sort of kinda "Big O" order-of-magnitude comparison.

On the other hand, I also agree, but I can definitely see present value in trying to standardize it because humans want to see what is going on (see: JSON - it's highly desirable for programmers to be able to look at a string representation of data than send opaque binary over the wire, even though to a computer binary is gonna be a lot faster).

There is probably an argument, too, for optimization of context windows and tokens burned and all that kinda jazz. `O(n)` is the same as `O(10*n)` (where n is tokens burned or $$$ per second or context window size) and that doesn't matter in theory but certainly does in practice when you're the one paying the bill or you fill up the context window and get nonsense.

So if this is a _thoughtful_ standard that takes that kinda stuff into account then, well, great! It gives a benchmark we can improve and iterate upon.

With some hypothetical super LLM that has a nearly infinite context window and a cost/tok of nearly zero and throughput nearing infinity, you can just say "solve my problem" and it will (eventually) do it. But for now, I can squint and see how this might be helpful.

>>ianbut+I81
Anthropic and Gemini still release new pre-training checkpoints regularly. It's just OpenAI who got stupid on that. RIP GPT-4.5

replies(1): >>ianbut+ek1

>>mbesto+nn
Nice ChatGPT generated response in that tweet. Anyone too lazy to deslop their tweet shouldn't be listened to.

>>zby+Gi
> What the system adds is an index of all skills, built from their descriptions, that is passed to the llm in each conversation. The idea is to let the llm read the skill when it is needed and not load it into context upfront.

This is different from swagger / OpenAPI how?

I get cross trained web front-end devs set a new low bar for professional amnesia and not-invented-here-ism, but maybe we could not do that yet another time?

replies(2): >>gbaldu+fh4 >>dragon+kB4

>>iainme+(OP)
what a great comment

>>Der_Ei+ug1
All models released from those providers go through stages of post training too, none of the models you interact with go from pre-training to release. An example of the post training pipeline is tool calling, that is to my understanding a part of post training and not pre training in general.

I can't speak to what the exact split is or what is a part of post training versus pre training at various labs but I am exceedingly confident all labs post train for effectiveness in specific domains.

replies(1): >>Der_Ei+Nk1

>>ianbut+ek1
I did not claim that post training doesn't happen on these models, and you are being extremely patronizing (I publish quite a bit of research on LLMs at top conferences).

I claimed that OpenAI overindexed on getting away with aggressive post-training on old pre-training checkpoints. Gemini / Anthropic correctly realized that new pre-training checkpoints need to happen to get the best gains in their latest model releases (which get post-trained too).

replies(1): >>ianbut+WS1

>>iainme+(OP)
Skills can contain scripts, making them a lot more versatile than just a document.

Of course any LLM can write any script based on a document, but that's not very deterministic.

A good example is Anthropic's PDF creator skill. It has the basic english instructions as well as actual Python code to generate PDFs

replies(3): >>rfw300+LD1 >>joe_th+9E1 >>gitgud+rH1

>>theshr+3y1
This strikes me as entirely logical in the short run, and an insane way of packaging software that we will certainly regret in the long run.

>>theshr+3y1
"Just a document" can certainly contain a script or code or whatever.

replies(1): >>theshr+lO2

>>theshr+3y1
How is this different from a README.md with a code block?

replies(1): >>samusi+8U1

>>iainme+(OP)
I’ve been scratching my head on this one too. You’re probably right about the bitter lesson... at the end of the day, plain English instructions in the context window are what do the heavy lifting.

That said, I reckon that’s actually what this project is trying to lean into. It looks like it's just standardising where those instructions live (the SKILL.md format) so tools can find them, rather than trying to force a new schema.

Fair play to them for trying to herd the cats. I think there's an xkcd comic for this one somewhere.

>>iainme+(OP)
I'm a little sad, in this case, that ongoing integration via fine-tuning hasn't taken off (not that I have enoughe expertise to know why.) It would be nice, dammit, if I could give explicit guidance for new skills by day, and have my models consolidate them by night!

>>Der_Ei+Nk1
If you read that as patronizing that says more about you than me personally, I have no idea who you are so your own insecurity at what is a rather unloaded explanation perplexes me.

>>gitgud+rH1
The code block isn't an executable script?

>>iainme+(OP)
In addition to the points others makes standardization also opens opportunities for training and RL that benefit from the standardization.

>>vidarh+MR
OK, so you are essentially using sub-agents as summarizing tools of the main agent, something you could implement by specialized tools that wrap independent LLM calls with the prompts of your sub-agents.

replies(1): >>vidarh+na3

>>vidarh+MK
The current tooling isn't aggressive enough in that it's not the first thing that the agent checks for when it is prompted, at least for claude code. Way more often than not, i remind the agent that the skill exists before it does anything. It's very rare that it will pick a skill unprompted. Which to me kind of defeats the purpose of skills, I mean if I have to tell the thing to go look somewhere, I'll just make any old document folder in any format and tell it to look there.

replies(3): >>vidarh+Ba3 >>richar+A14 >>powers+wh4

>>iainme+(OP)
I share your skepticism and think it's the classic pattern playing out, where people map practices of the previous paradigm to the new one and expect it to work.

Aspects of it will be similar but it trends to disruption as it becomes clear the new paradigm just works differently (for both better and worse) and practices need to be rethought accordingly.

I actually suspect the same is true of the entire 'agent' concept, in truth. It seems like a regression in mental model about what is really going on.

We started out with what I think is a more correct one which is simply 'feed tasks to the singular amorphous engine'.

I believe the thrust of agents is anthropomorphism: trying to map the way we think about AI doing tasks to existing structures we comprehend like 'manager' and 'team' and 'specialisation' etc.

Not that it's not effective in cases, but just probably not the right way to think about what is going on, and probably overall counterproductive. Just a limiting abstraction.

When I see for example large consultancies talking about things they are doing in terms of X thousands of agents, I really question what meaning that has in reality and if it's rather just a mechanism to make the idea fundamentally digestable and attractive to consulting service buyers. Billable hours to concrete entities etc.

replies(2): >>mhalle+p82 >>tehjok+Zn4

>>iainme+KV
Is your view that this doesn’t work based on conjecture or direct experience? It’s my understanding Anthropic and OpenAI have optimized their products to use skills more efficiently and it seems obviously true when I add skills to my repo (even when the info I put there is already in existing documentation).

replies(1): >>iainme+aR2

>>OtherS+E5
Claude can always self discover its own context. The question becomes whether it's way more efficient to have it grepping and lsing and whatever else it needs to do randomly poking around to build a half-baked context, or whether having a tailor made context injection that is dynamic can speed that up.

In other words, if you run an identical prompt, one with skill and one without, on a test task that requires discovering deeply how your codebase works, which one performs better on the following metrics, and how much better?

1. Accuracy / completion of the task

2. Wall clock time to execute the task

3. Token consumption of the task

replies(1): >>croon+8w3

>>davnic+M52
On the other hand, LLMs are trained on enormous collections of human-authored documents, many that look like "how to" documents. Perhaps the current generation of LLMs are naturally wired for skill-like human language instructions.

>>gianca+KW
No, Claude Code reads the CLAUDE.md in the root of your project. It's case sensitive so it has to be exactly that, too. Github Copilot reads from .github/copilot-instructions.md and supposedly AGENTS.md. Anigravity reads AGENTS.md and pulls subagents and the like from a .agents directory. This is probably why you have to remind it to re-read it so much, the harness isn't loading it for you.

>>iainme+(OP)
skills are "just instructions in English" in any old format (as opposed to McPs, which have a lot more weirdness behind them).

A skill is essentially just a markdown file, containing whatever instructions you want, possibly linking to other markdown files and/or scripts to avoid context pollution.

What skills give you is autodiscovery. You need to somehow tell the agent that documentation exists and when it should be looked at, and that's exactly what the skills standard does. It's a standardized format for documentation that harnesses can automatically detect and inform agents about, without them having to do many useless calls on every single turn to see if there are any skills present.

>>iainme+(OP)
There could be a market if it is standardized, and it seems there is already one [1]. I don't know exactly what they are selling because the website is just too confusing to me to understand a thing.

[1] https://skillsmp.com/

replies(1): >>rrvsh+773

>>postal+V5
thanks for sharing the work. correct, we're currently working on evals for skills so you can compare skills between models and harnesses.

we wrote a blog on getting agents to write CUDA kernels and evaluating them: https://huggingface.co/blog/upskill

>>joe_th+9E1
Of course, but the agent can't run a code block in a readme.

It _can_ run a PEP723 script without any specific setup (as long as uv and python are installed). It will automatically create a virtual environment AND install all dependencies. All with a single command without polluting the context with tons of setup.

>>ashdks+262
Hmm, that’s a good question! I think a bit of both.

In terms of experience, I’ve noticed that agents don’t always use skills the way you want; and I’ve noticed that they’re pretty good at browsing existing code and docs and figuring things out for themselves.

Is this an example of “the bitter lesson”? That’s conjecture, but I think pretty well-founded.

It could well be that specific formats for skills work better because the agents are trained on those specific formats. But if so, I think it’s just a local maximum.

replies(1): >>ashdks+WN4

>>sbinne+LI2
Thank u holy shit I hate that website

>>storus+V02
That is effectively how sub-agents are implemented at least conceptually, and yes, if you build your own coding agent, you can trivially implement sub-agents by having your coding agent recursively spawn itself.

Claude Code and others have some extras, such as the ability for the main agent to put them in the background, spawn them in parallel, and use tool calls to check on the status of them (so basic job control), but "poor mans sub-agents" only requires the ability for the coding agent to run an executable the equivalent of e.g. "claude --print <someprompt" (the --print option is real, and enables headless use; in practise you'd also want --stream-json, set allowed tools, and specify a conversation id so you can resume the sub-agents conversation).

And calling it all "summarising" understates it. It is delegation, and a large part of the value of delegation in a software system is abstraction and information hiding. The party that does the delegation does not need to care about all of the inner detail of the delegated task.

The value is not the summary. The value is the work done that the summary describes without unnecessary detail.

>>SOLAR_+L52
I agree, but this is at least partly down to how well the descriptions are composed, because that is pretty much the only difference between a skill and what Vercel does. It might well be there's a need for changes to surrounding prompting from the tools as well, of course.

replies(1): >>mirekr+Dh3

>>vidarh+Ba3
Exactly, many people seem to not understand that frontmatter’s description field needs to be longer „when?” Instead of shorter „what” - this is the only entry-point into the skill.

>>idopms+vb
Skills are chainable e.g. skill A can invoke skill B and then decide to invoke skill C etc… I don’t believe your slash commands can do this?

>>iainme+(OP)
Standardization is needed for agentic coding harnesses to be able to parse the files and inject them into the context in a way that takes the least effort for the user.

This is true for MCP as well. You could just describe a bunch of command line tools in AGENTS.md and tell the LLM when and how to call them. It would simply take more effort to set up, at least for some tools.

This is where a comparison in productivity would return a meaningful result: how much does it make it easier to set up things like that.

>>SOLAR_+762
It's not about one with skill and one without, but about one with skill vs one with regular old human documentation for stuff you need to know to work on a repo/project, or even more accurate comparison, take the skill and don't load it as a skill and just put it as context in the repo.

I think the main conflict in this thread is whether skills are anything more than just structuring documentation you were lacking in your repo, regardless if it was for Claude or Steve starting from scratch.

replies(1): >>SOLAR_+6a7

>>iainme+(OP)
The bitter lesson is different, and applies to the learning process. It's not directly relevant here.

If we're just pattern matching to adjacent memes that might provide insight, I'd also throw "sufficiently smart compiler" into the mix. Like, yes, in theory as the compiler gets better you shouldn't have to worry about implementing random optimizations yourself, but in practice you do.

In theory, you just need normal docs and a sufficiently smart LLM and agent harness can use them, but in practice there's still benefit in organizing them a certain way to more directly manage the context window yourself.

>>postal+V5
This is a neat idea for a test. But the test is badly executed. A single comparison could just be a fluke. Compare it on a dozen tasks, trying each task a dozen times. Then you get data which is believable.

>>SOLAR_+L52
(not an expert)

I see this with Cursor all the time with tools. Cursor will stop editing files in the editor and use the command line to echo edits into a file. It's so frustrating.

>>iainme+(OP)
It’s about the agent. Not the model or the format.

The "bitter lesson" only applies if the model makes the agent redundant. We aren't there yet. Agentic loops are just software engineering on top of CS constructs; they help current models produce better results.

Could models eventually internalize the logic used in Claude Code / Codex / OpenCode / Aider? Maybe. But for now, keeping that complexity in the agent is more energy-efficient. Even if complex agents eventually get replaced by simple loops, these standards save tokens and time today. That’s worth something.

>>ethbr1+Li1
> This is different from swagger / OpenAPI how?

In the way that Swagger / OpenAPI is for API endpoints, but most of the "skills" you need for your agents are not based on API endpoints

replies(1): >>ethbr1+nA4

>>SOLAR_+L52
I had good success with this after tuning my triggers + main agent prompt.

I explicitly tell it about the skills and that it should load them when the context feels correct.

```prompt.md

Company Name codebase ...

# Skills

Use the company specific skills that look like `company-*`. Load them once per conversation if they seem relevant to what you are working on.

```

```SKILL.md

---

description: Company TypeScript libraries and conventions

trigger: Writing or reading TypeScript in Company services ---

# company-ts

```

>>davnic+M52
I can see what you're getting at, but think about how humans are a general intelligence and we still ask them to perform specialized jobs. That said, they acquire knowledge in that position rather than being pre-loaded with everything they will ever know (outside of working memory).

replies(1): >>davnic+lP4

>>gbaldu+fh4
I mean conceptually.

Why not just extend the OpenAPI specification to skills? Instead of recreating something that's essentially communicating the same information?

T minus a couple years before someone declares that down-mapping skills into a known verb enumeration promotes better skill organization...

replies(1): >>dragon+8C4

>>ethbr1+Li1
> This is different from swagger / OpenAPI how?

Because the descriptions aren't API specs and the things described aren't APIs.

Its more like a structure for human-readable descriptions in an annotated table of contents for a recipe book than it is like OpenAPI.

>>ethbr1+nA4
> Why not just extend the OpenAPI specification to skills?

Because approximately none of what exists in the existing OpenAPI specification is relevant to the task, and nothing needed for the tasks is relevant to the current OpenAPI use case, so trying to jam one use case into a tool designed for the other would be pure nonsense.

It’s like needing to drive nails and asking why grab a hammer when you already have a screwdriver.

replies(1): >>ethbr1+ed8

>>iainme+aR2
I had a kind of visceral distaste for all of this rules, skills etc stuff when I first heard about it for similar reasons. This generalized text model can speak base64 encoded Klingon but a readme.md isn’t good enough? However given the reality of limited context windows, current models can’t consider everything in a big repo all at once and keep coherent. Attaching some metadata to the information that tells the model when and how to consider it (and assisting the models with tooling written in code to provide the context at the right time) seems to make a big difference in practice.

>>tehjok+Zn4
yeah I think this is exactly how the analogy breaks down.

As humans we need to specialise. Even though we're generalists and have the a priori potential to learn and do all manner of things we have to pick just a few to focus on to be effective (the beautiful dilemma etc).

I think the basic reason being we're limited by learning time and, relatedly, execution bandwidth of how many things we can reasonably do in a given time period.

LLMs don't have these constraints in the same way. As you say they come preloaded with absolutely everything all at once. There's no or very little marginal time investment per se in learning anything. As for output bandwidth, it also scales horizontally with compute supplied.

So I just think the inherent limitations that make us organise human work around this individual unit working in teams and whatnot don't apply and are counterproductive to apply. There's a real cost to all that stuff that LLMs can just sidestep around, and that's part of the power of the new paradigm that shouldn't be left on the table.

replies(1): >>tehjok+l95

>>davnic+lP4
I suppose the AI can wear many "hats" simultaneously, but it does have to be competent enough at at least one or two of them for that to be viable. I think one way to think about that is roles can be consolidated.

>>iainme+pb
> To be clear, I'm suggesting that any specific format for "skills.md" is a red herring, and all you need to do is provide the LLM with good clear documentation.

Agent Skills isn't a spec for how information is presented to the model, its a spec whose consumer is the model harness, which might present information made available to it in the format to the model in different ways for different harnesses, or even in the same harness for different models or tasks, considering things like the number and size of the skill(s) available, the size of the model context, the purpose of the harness (is it for a narrow purpose agent where some of the skills are central to that purpose?), and user preference settings.

The site itself has two different main styles of integration for harnesses described ("tool based" and "filesystem based"), but those are more of a starting point for implementers that an exhaustive listing.

The idea is that skill authors don't need to know or care how the harness is presenting the information to the model.

>>pton_x+q8
The standardization is for presentation of how the information is made available to the harness. Optimizations in how the information is presented to the model can be iterated on without impacting the presentation to the harness. Initially, agent skills have already been provided by:

(1) providing a bash tool with direct access to the filesystem storing the skills to the model,

(2) providing read_file and related tools to the model,

(3) by providing specialized tools to access skills to the model,

(4) by processing the filesystem structure and providing a structure that includes the full content of the skills up front to the model.

And probably some other ways or hybrids.

> It increases benchmarks a few points now but what's the point in standardizing all this if it'll be obsolete next year?

Standardizing the information presentation of skills to LLM harnesses lets the harnesses incorporate findings on optimization (which may be specific to models, or at least model features like context size, and use cases) and existing skills getting the benefit of that for free.

replies(1): >>0thgen+8K7

>>croon+8w3
well, the key difference is that one is auto-injected into your context for dynamic lookup and the other is loaded on-demand as needed and is contingent upon the llm discovering it.

That difference alone likely accounts for some not insignificant discrepancies. But without numbers, it's hard to say.

>>dragon+DB5
How much of a standard is it though, really? To me it just looks like "Call your docs SKILLS and organize it like this".

And if you're just making docs and letting your models go buck wild in your shell, doesn't an overspecified docs structure ruin the point of general purpose agents?

Like, a good dev should be able to walk into a codebase, look at the structure, and figure out how to proceed. If "hey your docs aren't where I was expecting" breaks the developer, you shouldn't have hired them.

Feels like a weird thing to take "this is how we organize our repos as this company" and turn that into "this is an 'open standard' that you should build your workflows around".

>>dragon+8C4
You think indexing skills in increasingly structured, parameterized formats has nothing to do with documenting REST API endpoints?