You might as well just write instructions in English in any old format, as long as it's comprehensible. Exactly as you'd do for human readers! Nothing has really changed about what constitutes good documentation. (Edit to add: my parochialism is showing there, it doesn't have to be English)
Is any of this standardization really needed? Who does it benefit, except the people who enjoy writing specs and establishing standards like this? If it really is a productivity win, it ought to be possible to run a comparison study and prove it. Even then, it might not be worthwhile in the longer run.
codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.
I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.
https://xcancel.com/ben_burtenshaw/status/200023306951767675...That said, it's not a perfect comparison because of the Codex model mismatch between runs.
The author seems to be doing a lot of work on skills evaluation.
(1) providing a bash tool with direct access to the filesystem storing the skills to the model,
(2) providing read_file and related tools to the model,
(3) by providing specialized tools to access skills to the model,
(4) by processing the filesystem structure and providing a structure that includes the full content of the skills up front to the model.
And probably some other ways or hybrids.
> It increases benchmarks a few points now but what's the point in standardizing all this if it'll be obsolete next year?
Standardizing the information presentation of skills to LLM harnesses lets the harnesses incorporate findings on optimization (which may be specific to models, or at least model features like context size, and use cases) and existing skills getting the benefit of that for free.
And if you're just making docs and letting your models go buck wild in your shell, doesn't an overspecified docs structure ruin the point of general purpose agents?
Like, a good dev should be able to walk into a codebase, look at the structure, and figure out how to proceed. If "hey your docs aren't where I was expecting" breaks the developer, you shouldn't have hired them.
Feels like a weird thing to take "this is how we organize our repos as this company" and turn that into "this is an 'open standard' that you should build your workflows around".