Design of https://www.skillcreator.ai/explore for me it's more useful. At least I can search by category, framework, language and I also see much more information what some skill does at a glance. I don't know why vercel really wanted to do it completely black and white - colors used and done with a taste gives useful context and information.
> In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it. Adding the skill produced no improvement over baseline.
> …
> Skills aren't useless. The AGENTS.md approach provides broad, horizontal improvements to how agents work with Next.js across all tasks. Skills work better for vertical, action-specific workflows that users explicitly trigger,
https://vercel.com/blog/agents-md-outperforms-skills-in-our-...
[1]: https://code.claude.com/docs/en/skills#control-who-invokes-a... [2]: https://opencode.ai/docs/skills/#disable-the-skill-tool [3]: https://developers.openai.com/codex/skills/#enable-or-disabl...
Why do I want to throw away my dependency management system and shared libraries folder for putting scripts in skills?
What tools do they have access to, can I define this so it's dynamic? Do skills even have a concept for sub tools or sub agents? Why do I want to put references in a folder instead of a search engine? Does frontmatter even make sense, why not something closer to a package.json in a file next to it?
Does it even make sense to have skills in the repo? How do I use them across projects? How do we build an ecosystem and dependency management system for skills (which are themselves versioned)
I treat my skills the same as I would write tiny bash scripts and fish functions in the days gone to simplify my life by writing 2 words instead of 2 sentences. Tiny improvement that only makes sense for a programmer at heart.
codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.
I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.
https://xcancel.com/ben_burtenshaw/status/200023306951767675...That said, it's not a perfect comparison because of the Codex model mismatch between runs.
The author seems to be doing a lot of work on skills evaluation.
Codex started this and OpenCode followed suit with the hour.
.opencode/skills
[1]: https://opencode.ai/docs/skills/#place-fileshttps://xcancel.com/ben_burtenshaw
I wrote a rant about skills a while ago that's still relevant in some ways: https://sibylline.dev/articles/2025-10-20-claude-skills-cons...
My guess is that the standardization is going to make its way into how the models are trained and Skills are eventually going to pull out ahead.
0: https://vercel.com/blog/agents-md-outperforms-skills-in-our-...
https://claude.com/blog/context-management
> Context editing automatically clears stale tool calls and results from within the context window when approaching token limits.
> The memory tool enables Claude to store and consult information outside the context window through a file-based system.
But it looks like nobody has it as a part of an inference loop yet: I guess it's hard to train (i.e. you need a training set which is a good match for what people use context in practice) and make inference more complicated. I guess more high-level context management is just easier to implement - and it's one of things which "GPT wrapper" companies can do, so why bother?
Yes, treating the "front matter" of skill as "function definition" of tool calls as kind of an equivalence class.
This understanding helped me create an LLM agnostic (also sandboxed) open-skills[1] way before this standardization was proposed.
1. Open-skills: https://github.com/instavm/open-skills
It doesn't look like slop at all to me. GP claimed that this was written by ai without evidence, which I assumed to be based in bias, based on GP's comment history: https://news.ycombinator.com/threads?id=jondwillis They complaint they have about the writing style is not the style that is emblematic of Ai slop. Then, considering the depth of analysis and breadth of connection, this is not something current Ai is up to producing.
Are you also assuming the article was written by an Ai?
For example, if you've just joined a new team or a new project, wouldn't you like to have extensive, well-organised documentation to help get you started?
This reminds me of the "curb-cut effect", where accommodations for disabilities can be beneficial for everybody: https://front-end.social/@stephaniewalter/115841555015911839
[1] https://skills.sh/vercel-labs/agent-skills/web-design-guidel... [2] https://github.com/vercel-labs/agent-skills/blob/main/skills...
All of these SKILLS.md/AGENTS.md/COMMANDS.md are just simple prompts, maybe even some with context links.
And quite dangerous.
Skills can be MASSIVELY more efficient and powerful than MCP, if designed and used right.
Leela MOOLLM Demo Transcript: https://github.com/SimHacker/moollm/blob/main/designs/LEELA-...
2. Architecture: Skills as Knowledge Units
A skill is a modular unit of knowledge that an LLM can load, understand, and apply.
Skills self-describe their capabilities, advertise when to use them, and compose with other skills.
Why Skills, Not Just MCP Tool Calls?
MCP (Model Context Protocol) tool calls are powerful, but each call requires a full round-trip:
MCP Tool Call Overhead (per call):
┌─────────────────────────────────────────────────────────┐
│ 1. Tokenize prompt │
│ 2. LLM complete → generates tool call │
│ 3. Stop generation, universe destroyed │
│ 4. Async wait for tool execution │
│ 5. Tool returns result │
│ 6. New LLM complete call with result │
│ 7. Detokenize response │
└─────────────────────────────────────────────────────────┘
× N calls = N round-trips = latency, cost, context churn
Skills operate differently. Once loaded into context, skills can:
Iterate:
MCP: One call per iteration
Skills: Loop within single context
Recurse:
MCP: Stack of tool calls
Skills: Recursive reasoning in-context
Compose:
MCP: Chain of separate calls
Skills: Compose within single generation
Parallel characters:
MCP: Separate sessions
Skills: Multiple characters in one call
Replicate:
MCP: N calls for N instances
Skills: Grid of instances in one pass
I call this "speed of light" as opposed to "carrier pigeon". In my experiments I ran 33 game turns with 10 characters playing Fluxx — dialogue, game mechanics, emotional reactions — in a single context window and completion call. Try that with MCP and you're making hundreds of round-trips, each suffering from token quantization, noise, and cost. Skills can compose and iterate at the speed of light without any detokenization/tokenization cost and distortion, while MCP forces serialization and waiting for carrier pigeons.speed-of-light skill: https://github.com/SimHacker/moollm/tree/main/skills/speed-o...
Skills also compose. MOOLLM's cursor-mirror skill introspects Cursor's internals via a sister Python script that reads cursor's chat history and sqlite databases — tool calls, context assembly, thinking blocks, chat history. Everything, for all time, even after Cursor's chat has summarized and forgotten: it's still all there and searchable!
cursor-mirror skill: https://github.com/SimHacker/moollm/tree/main/skills/cursor-...
MOOLLM's skill-snitch skill composes with cursor-mirror for security monitoring of untrusted skills, also performance testing and optimization of trusted ones. Like Little Snitch watches your network, skill-snitch watches skill behavior — comparing declared tools and documentation against observed runtime behavior.
skill-snitch skil: https://github.com/SimHacker/moollm/tree/main/skills/skill-s...
You can even use skill-snitch like a virus scanner to review and monitor untrusted skills. I have more than 100 skills and had skill-snitch review each one including itself -- you can find them in the skill-snitch-report.md file of each skill in MOOLLM. Here is skill-snitch analyzing and reporting on itself, for example:
skill-snitch's skill-snitch-report.md: https://github.com/SimHacker/moollm/blob/main/skills/skill-s...
MOOLLM's thoughtful-commitment skill also composes with cursor-mirror to trace the reasoning behind git commits.
thoughtful-commit skill: https://github.com/SimHacker/moollm/tree/main/skills/thought...
MCP is still valuable for connecting to external systems. But for reasoning, simulation, and skills calling skills? In-context beats tool-call round-trips by orders of magnitude.
GLANCE.yml is the smallest, 5-70 lines. Just enough to answer "is this relevant?" You can inject all glances into every prompt because they're tiny. The LLM scans them like a table of contents.
CARD.yml is the interface layer, 50-200 lines. No implementation, just what the skill offers Capability advertisements, activation conditions, scoring, what it composes with. Think of it like The Sims "advertisement" system or CLOS generic function dispatch. The LLM sniffs this to decide whether to load the full SKILL.md implementation.
SKILL.md is the Anthropic-style skill file, 200-1000 lines. The actual instructions, the how. Only loaded when the skill is activated.
README.md is the largest, 500+ lines, and it's for humans. History, design rationale, examples. The LLM can dive in when developing the skill or when curious, but it's not burned on every invocation.
Reading rule: never load a lower level without first loading the level above. Start with GLANCE, sniff the CARD, load SKILL only if needed.
Even more compact than concatenating all glances:
We also found INDEX.md beats INDEX.yml for the skill catalog. YAML repeats the same keys for every entry. Markdown allows narrative explanation of how skills relate, which clusters matter for what tasks, making it both more compact and more useful.
INDEX.yml: 711 lines, 2061 words, 17509 chars, ~4380 tokens, machine readable structure
https://github.com/SimHacker/moollm/blob/main/skills/INDEX.y...
INDEX.md: 124 lines, 1134 words, 9487 chars, ~2370 tokens, human readable prose
https://github.com/SimHacker/moollm/blob/main/skills/INDEX.m...
INDEX.md is 83% fewer lines, 45% fewer words, 46% fewer chars for 121 skills. YAML repeats keys like id, tagline, why for every entry. Markdown uses headers and prose, compresses better, allows narrative grouping of related skills.
And it's simply more meaningful to both LLMs and humans, telling a coherent story instead of representing raw data!
The Semantic Image Pyramid:
https://github.com/SimHacker/moollm/blob/main/designs/LEELA-...
Same principle applies to code. A skill can wrap a sister-script that IS the documentation. A Python script with argparse defines the CLI once, readable by both humans (--help) and LLMs (sniff the top of the python file). No separate docs to maintain, no drift between what the code does and what the docs claim.
sister-script: https://github.com/SimHacker/moollm/blob/main/skills/sister-...
Sniffable-python structures code so the API is visible in the first 50 lines. Imports, constants, CLI structure up front. Implementation below the fold. The LLM can decide relevance and understand the API without reading the whole file. Single source of truth, progressive disclosure, don't repeat yourself.
sniffable-python README.md: https://github.com/SimHacker/moollm/blob/main/skills/sniffab...
sniffable-python SKILL.md: https://github.com/SimHacker/moollm/blob/main/skills/sniffab...
we wrote a blog on getting agents to write CUDA kernels and evaluating them: https://huggingface.co/blog/upskill
https://github.com/SimHacker/moollm/blob/main/designs/SPEED-...