zlacker

[parent] [thread] 21 comments
1. postal+(OP)[view] [source] 2026-02-03 15:34:10
Folks have run comparisons. From a huggingface employee:

  codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.

  I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.
https://xcancel.com/ben_burtenshaw/status/200023306951767675...

That said, it's not a perfect comparison because of the Codex model mismatch between runs.

The author seems to be doing a lot of work on skills evaluation.

https://github.com/huggingface/upskill

replies(6): >>xrd+z1 >>pton_x+v2 >>8cvor6+44 >>iainme+u5 >>bburte+KH2 >>oofbey+nO3
2. xrd+z1[view] [source] 2026-02-03 15:41:35
>>postal+(OP)
Does this indicate running locally with a very small (quantized?) model?

I am very interested in finding ways to combine skills + local models + MCP + aider-ish tools to avoid using commercial LLM providers.

Is this a path to follow? Or, something different?

replies(1): >>postal+Q7
3. pton_x+v2[view] [source] 2026-02-03 15:44:48
>>postal+(OP)
I think the point is it smells like a hack, just like "think extra hard and I'll tip you $200" was a few years ago. It increases benchmarks a few points now but what's the point in standardizing all this if it'll be obsolete next year?
replies(3): >>mbesto+sh >>9dev+Nl >>dragon+Iv5
4. 8cvor6+44[view] [source] 2026-02-03 15:50:45
>>postal+(OP)
Sounds like the benchmark matrix just got a lot bigger, model * skill combinations.
5. iainme+u5[view] [source] 2026-02-03 15:56:11
>>postal+(OP)
I can't quite tell what's being compared there -- just looks like several different LLMs?

To be clear, I'm suggesting that any specific format for "skills.md" is a red herring, and all you need to do is provide the LLM with good clear documentation.

A useful comparison would be between: a) make a carefully organised .skills/ folder, b) put the same info anywhere and just link to it from your top-level doc, c) just dump everything directly in the top-level doc.

My guess is that it's probably a good idea to break stuff out into separate sections, to avoid polluting the context with stuff you don't need; but the specific way you do that very likely isn't important at all. So (a) and (b) would perform about the same.

replies(3): >>postal+P9 >>anupam+NJ >>dragon+uk5
◧◩
6. postal+Q7[view] [source] [discussion] 2026-02-03 16:05:42
>>xrd+z1
Check out the guy's work. He's doing a lot of work on precisely what you're talking about.

https://xcancel.com/ben_burtenshaw

https://huggingface.co/blog/upskill

https://github.com/huggingface/upskill

◧◩
7. postal+P9[view] [source] [discussion] 2026-02-03 16:13:34
>>iainme+u5
Your skepticism is valid. Vercel ran a study where they said that skills underperform putting a docs index in AGENTS.md[0].

My guess is that the standardization is going to make its way into how the models are trained and Skills are eventually going to pull out ahead.

0: https://vercel.com/blog/agents-md-outperforms-skills-in-our-...

replies(1): >>vidarh+RE
◧◩
8. mbesto+sh[view] [source] [discussion] 2026-02-03 16:42:44
>>pton_x+v2
I think this tweet sums it correctly doesn't?

   A +6 jump on a 0.6B model is actually more impressive than a +2 jump on a 100B model. It proves that 'intelligence' isn't just parameter count; it is context relevance. You are proving that a lightweight model with a cheat sheet beats a giant with amnesia. This is the death of the 'bigger is better' dogma
Which is essentially the bitter lesson that Richard Sutton talks about?
replies(1): >>Der_Ei+Sa1
◧◩
9. 9dev+Nl[view] [source] [discussion] 2026-02-03 17:00:03
>>pton_x+v2
Standards have to start somewhere to gain traction and proliferate themselves for longer than that.

Plus, as has been mentioned multiple times here, standard skills are a lot more about different harnesses being able to consistently load skills into the context window in a programmatic way. Not every AI workload is a local coding agent.

◧◩◪
10. vidarh+RE[view] [source] [discussion] 2026-02-03 18:15:35
>>postal+P9
Agents add a docs index in context for skills, so this is an issue of finding that the current specific implementation of skills in Claude Code is suboptimal.

Their reasoning about it is also flawed. E.g. "No decision point. With AGENTS.md, there's no moment where the agent must decide "should I look this up?" The information is already present." - but this is exactly the case for skills too. The difference is just where in the context the information is, and how it is structured.

Having looked at their article, ironically I think the reason it works is that they likely force more information into context by giving the agent less information to work with:

Instead of having a description, which might convince the agent a given skill isn't relevant, their index is basically a list of vague filenames, forcing the agent to make a guess, and potentialy reading the wrong thing.

This is basically exactly what skills were added to avoid. But it will break if the description isn't precise enough. And it's perfectly possible that current tooling isn't aggressive enough about pruning detail that might tempt the agent to ignore relevant files.

replies(1): >>SOLAR_+QZ1
◧◩
11. anupam+NJ[view] [source] [discussion] 2026-02-03 18:33:22
>>iainme+u5
> If you want a clean comparison, I’d test three conditions under equal context budgets: (A) monolithic > AGENTS.md, (B) README index that links to docs, (C) skills with progressive disclosure. Measure task > success, latency, and doc‑fetch count across 10–20 repo tasks. My hunch: (B)≈(C) on quality, but (C) > wins on token efficiency when the index is strong. Also, format alone isn’t magic—skills that reference > real tools/assets via the backing MCP are qualitatively different from docs‑only skills, so I’d > separate those in the comparison. Have you seen any benchmarks that control for discovery overhead?
◧◩◪
12. Der_Ei+Sa1[view] [source] [discussion] 2026-02-03 20:26:20
>>mbesto+sh
Nice ChatGPT generated response in that tweet. Anyone too lazy to deslop their tweet shouldn't be listened to.
◧◩◪◨
13. SOLAR_+QZ1[view] [source] [discussion] 2026-02-04 01:03:41
>>vidarh+RE
The current tooling isn't aggressive enough in that it's not the first thing that the agent checks for when it is prompted, at least for claude code. Way more often than not, i remind the agent that the skill exists before it does anything. It's very rare that it will pick a skill unprompted. Which to me kind of defeats the purpose of skills, I mean if I have to tell the thing to go look somewhere, I'll just make any old document folder in any format and tell it to look there.
replies(3): >>vidarh+G43 >>richar+FV3 >>powers+Bb4
14. bburte+KH2[view] [source] 2026-02-04 07:36:34
>>postal+(OP)
thanks for sharing the work. correct, we're currently working on evals for skills so you can compare skills between models and harnesses.

we wrote a blog on getting agents to write CUDA kernels and evaluating them: https://huggingface.co/blog/upskill

◧◩◪◨⬒
15. vidarh+G43[view] [source] [discussion] 2026-02-04 10:36:16
>>SOLAR_+QZ1
I agree, but this is at least partly down to how well the descriptions are composed, because that is pretty much the only difference between a skill and what Vercel does. It might well be there's a need for changes to surrounding prompting from the tools as well, of course.
replies(1): >>mirekr+Ib3
◧◩◪◨⬒⬓
16. mirekr+Ib3[view] [source] [discussion] 2026-02-04 11:28:44
>>vidarh+G43
Exactly, many people seem to not understand that frontmatter’s description field needs to be longer „when?” Instead of shorter „what” - this is the only entry-point into the skill.
17. oofbey+nO3[view] [source] 2026-02-04 15:22:48
>>postal+(OP)
This is a neat idea for a test. But the test is badly executed. A single comparison could just be a fluke. Compare it on a dozen tasks, trying each task a dozen times. Then you get data which is believable.
◧◩◪◨⬒
18. richar+FV3[view] [source] [discussion] 2026-02-04 15:55:22
>>SOLAR_+QZ1
(not an expert)

I see this with Cursor all the time with tools. Cursor will stop editing files in the editor and use the command line to echo edits into a file. It's so frustrating.

◧◩◪◨⬒
19. powers+Bb4[view] [source] [discussion] 2026-02-04 17:03:51
>>SOLAR_+QZ1
I had good success with this after tuning my triggers + main agent prompt.

I explicitly tell it about the skills and that it should load them when the context feels correct.

```prompt.md

Company Name codebase ...

# Skills

Use the company specific skills that look like `company-*`. Load them once per conversation if they seem relevant to what you are working on.

```

```SKILL.md

---

description: Company TypeScript libraries and conventions

trigger: Writing or reading TypeScript in Company services ---

# company-ts

```

◧◩
20. dragon+uk5[view] [source] [discussion] 2026-02-04 22:26:24
>>iainme+u5
> To be clear, I'm suggesting that any specific format for "skills.md" is a red herring, and all you need to do is provide the LLM with good clear documentation.

Agent Skills isn't a spec for how information is presented to the model, its a spec whose consumer is the model harness, which might present information made available to it in the format to the model in different ways for different harnesses, or even in the same harness for different models or tasks, considering things like the number and size of the skill(s) available, the size of the model context, the purpose of the harness (is it for a narrow purpose agent where some of the skills are central to that purpose?), and user preference settings.

The site itself has two different main styles of integration for harnesses described ("tool based" and "filesystem based"), but those are more of a starting point for implementers that an exhaustive listing.

The idea is that skill authors don't need to know or care how the harness is presenting the information to the model.

◧◩
21. dragon+Iv5[view] [source] [discussion] 2026-02-04 23:31:39
>>pton_x+v2
The standardization is for presentation of how the information is made available to the harness. Optimizations in how the information is presented to the model can be iterated on without impacting the presentation to the harness. Initially, agent skills have already been provided by:

(1) providing a bash tool with direct access to the filesystem storing the skills to the model,

(2) providing read_file and related tools to the model,

(3) by providing specialized tools to access skills to the model,

(4) by processing the filesystem structure and providing a structure that includes the full content of the skills up front to the model.

And probably some other ways or hybrids.

> It increases benchmarks a few points now but what's the point in standardizing all this if it'll be obsolete next year?

Standardizing the information presentation of skills to LLM harnesses lets the harnesses incorporate findings on optimization (which may be specific to models, or at least model features like context size, and use cases) and existing skills getting the benefit of that for free.

replies(1): >>0thgen+dE7
◧◩◪
22. 0thgen+dE7[view] [source] [discussion] 2026-02-05 16:52:54
>>dragon+Iv5
How much of a standard is it though, really? To me it just looks like "Call your docs SKILLS and organize it like this".

And if you're just making docs and letting your models go buck wild in your shell, doesn't an overspecified docs structure ruin the point of general purpose agents?

Like, a good dev should be able to walk into a codebase, look at the structure, and figure out how to proceed. If "hey your docs aren't where I was expecting" breaks the developer, you shouldn't have hired them.

Feels like a weird thing to take "this is how we organize our repos as this company" and turn that into "this is an 'open standard' that you should build your workflows around".

[go to top]