I'm very curious to know the size & state of a codebase where skills are beneficial over just having good information hierarchy for your documentation.
In other words, if you run an identical prompt, one with skill and one without, on a test task that requires discovering deeply how your codebase works, which one performs better on the following metrics, and how much better?
1. Accuracy / completion of the task
2. Wall clock time to execute the task
3. Token consumption of the task
I think the main conflict in this thread is whether skills are anything more than just structuring documentation you were lacking in your repo, regardless if it was for Claude or Steve starting from scratch.
That difference alone likely accounts for some not insignificant discrepancies. But without numbers, it's hard to say.