zlacker

[return to "Agent Skills"]
1. iainme+Qb[view] [source] 2026-02-03 15:09:04
>>moored+(OP)
This stuff smells like maybe the bitter lesson isn't fully appreciated.

You might as well just write instructions in English in any old format, as long as it's comprehensible. Exactly as you'd do for human readers! Nothing has really changed about what constitutes good documentation. (Edit to add: my parochialism is showing there, it doesn't have to be English)

Is any of this standardization really needed? Who does it benefit, except the people who enjoy writing specs and establishing standards like this? If it really is a productivity win, it ought to be possible to run a comparison study and prove it. Even then, it might not be worthwhile in the longer run.

◧◩
2. postal+Lh[view] [source] 2026-02-03 15:34:10
>>iainme+Qb
Folks have run comparisons. From a huggingface employee:

  codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.

  I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.
https://xcancel.com/ben_burtenshaw/status/200023306951767675...

That said, it's not a perfect comparison because of the Codex model mismatch between runs.

The author seems to be doing a lot of work on skills evaluation.

https://github.com/huggingface/upskill

◧◩◪
3. pton_x+gk[view] [source] 2026-02-03 15:44:48
>>postal+Lh
I think the point is it smells like a hack, just like "think extra hard and I'll tip you $200" was a few years ago. It increases benchmarks a few points now but what's the point in standardizing all this if it'll be obsolete next year?
[go to top]