zlacker

[return to "Agent Skills"]
1. iainme+Qb[view] [source] 2026-02-03 15:09:04
>>moored+(OP)
This stuff smells like maybe the bitter lesson isn't fully appreciated.

You might as well just write instructions in English in any old format, as long as it's comprehensible. Exactly as you'd do for human readers! Nothing has really changed about what constitutes good documentation. (Edit to add: my parochialism is showing there, it doesn't have to be English)

Is any of this standardization really needed? Who does it benefit, except the people who enjoy writing specs and establishing standards like this? If it really is a productivity win, it ought to be possible to run a comparison study and prove it. Even then, it might not be worthwhile in the longer run.

◧◩
2. postal+Lh[view] [source] 2026-02-03 15:34:10
>>iainme+Qb
Folks have run comparisons. From a huggingface employee:

  codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.

  I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.
https://xcancel.com/ben_burtenshaw/status/200023306951767675...

That said, it's not a perfect comparison because of the Codex model mismatch between runs.

The author seems to be doing a lot of work on skills evaluation.

https://github.com/huggingface/upskill

◧◩◪
3. pton_x+gk[view] [source] 2026-02-03 15:44:48
>>postal+Lh
I think the point is it smells like a hack, just like "think extra hard and I'll tip you $200" was a few years ago. It increases benchmarks a few points now but what's the point in standardizing all this if it'll be obsolete next year?
◧◩◪◨
4. mbesto+dz[view] [source] 2026-02-03 16:42:44
>>pton_x+gk
I think this tweet sums it correctly doesn't?

   A +6 jump on a 0.6B model is actually more impressive than a +2 jump on a 100B model. It proves that 'intelligence' isn't just parameter count; it is context relevance. You are proving that a lightweight model with a cheat sheet beats a giant with amnesia. This is the death of the 'bigger is better' dogma
Which is essentially the bitter lesson that Richard Sutton talks about?
◧◩◪◨⬒
5. Der_Ei+Ds1[view] [source] 2026-02-03 20:26:20
>>mbesto+dz
Nice ChatGPT generated response in that tweet. Anyone too lazy to deslop their tweet shouldn't be listened to.
[go to top]