We lost a benchmark. Then we rewrote everything.
Last week, a user ran our Code Review Expert skill against a competitor called Superpowers. Same codebase, same model, same task.
Superpowers found 21 issues, including 9 logic bugs with exact file:line references. Our skill found 14 issues, zero logic bugs, and made a factually wrong claim about the code using "pure functions."
Their prompt was 500 words. Ours was 4,200.
How we got here
We built supaskills with a goal: 1,000+ expert skills. We hit it. Then we kept going. 1,300 skills across 6 domains, each one scored on 6 quality dimensions, backed by research sources, wrapped in governance metadata.
The problem: we optimized for the wrong things.
We scored ourselves. Our quality pipeline generated the skill AND scored it. That's like a student grading their own exam. Every skill was "Platinum" because the scoring rubric rewarded things the generator was good at producing: word count, source citations, section completeness.
We assumed longer was better. Our Pipeline v4 expanded every skill to 6,700 words average. More context, more sources, more examples, more sections. The theory: more knowledge in the system prompt means better output.
The reality: Claude's attention degrades with context length. A 6,700-word prompt where the actual behavioral instructions are buried in paragraph 47 performs worse than a 500-word prompt where every sentence changes what Claude does.
We benchmarked once, then stopped checking. We ran an initial benchmark that showed skills winning 89% of the time. Great numbers. But that was with the v2 skills. When we later bloated them to 6,700 words in Pipeline v4, we never re-benchmarked. We assumed "more knowledge = better results" without verifying. The initial benchmark became a false safety net.
The LLM trap
Here's what nobody warns you about when building AI products: the LLM will tell you everything is great.
We used Claude to review our skills. "This is a comprehensive, well-structured system prompt with excellent coverage of the domain." We used Claude to score them. 87.5/100, Platinum tier. We used Claude to generate, review, AND evaluate.
We did look at the outputs. They sounded good. They were well-structured, comprehensive, professional. The problem: we never put the skill output and the vanilla output side by side. When you only see one answer, "good" is easy to believe. When you see both answers next to each other, you realize "good" and "better than without the skill" are different questions.
The outputs felt better. But feeling is not measuring.
What we did
In 48 hours, we rewrote all 1,296 published skills.
Phase 1: Cut. Every skill went from ~6,700 words to ~500 words. We removed: introductions explaining the domain (Claude already knows what code review is), source citations (they don't change Claude's behavior), governance metadata (useful for us, useless for Claude), examples of what NOT to do that were so detailed they confused the model.
Phase 2: Focus on behavior. Every remaining sentence had to pass one test: "Does this change what Claude does?" If removing the sentence wouldn't change the output, we removed it.
"You are a Code Review Expert with over 15 years of experience in professional software development" became "You are a Code Review Expert. You find real bugs, not just style nits."
The first sentence is a tone-wrapper. It makes Claude sound experienced. The second sentence changes what Claude looks for.
Phase 3: Distill the knowledge. Here's where it got interesting. Our v2 skills (the 6,700-word versions) actually had real domain knowledge buried in the filler. Specific DPS formulas for game balance. Exact threshold values for SaaS metrics. Framework-specific code patterns for each technology.
We used Claude Haiku to read each v2 prompt and extract the specific numbers, frameworks, and decision logic, then compress it into the v5 format. The result: 500 words that contain the same domain knowledge as 6,700 words, without the noise.
We call this v5.5: v5 format (short, behavioral) plus v2 knowledge (specific, domain-expert).
The benchmark
We built a blind A/B testing system. For each test case: run Claude with the skill, run Claude without (vanilla), randomize the outputs, have GPT-4o judge them on a rubric. No cherry-picking, no retries.
We tested v1 (original short), v2 (verbose), v5.5 (distilled), and vanilla Claude.
| Test | v1 | v2 | v5.5 | Vanilla | Winner |
|---|---|---|---|---|---|
| Code Review | 20 | 21 | 22 | 18 | v5.5 |
| Marketing Copy | 85 | - | 90 | 80 | v5.5 |
| Game Balance | - | - | 78 | 65 | v5.5 |
| Security Review | 45 | 43 | 39 | 47 | Vanilla |
v5.5 wins 3 out of 4. The security review loss is real and we're working on it.
The code review result is the one that matters most: the skill that lost to Superpowers last week now beats vanilla Claude. Not by a huge margin (22 vs 18), but the margin is real and it's in the right direction.
What we learned
1. Shorter is better, but only if knowledge is preserved. Cutting from 6,700 to 500 words without preserving the domain knowledge made things worse. Cutting to 500 words WITH the knowledge distilled in made things better than the original 6,700. Length is not the variable. Signal density is.
2. The LLM will not tell you your product is broken. Claude reviewed our skills and said they were great. Claude scored them and gave them Platinum. Claude generated them in the first place. The feedback loop had no external signal. The moment a real user tested against a real competitor, the illusion collapsed.
3. Every skill must prove it beats vanilla Claude. This is now our quality gate. If loading a skill doesn't produce measurably better output than not loading it, the skill shouldn't exist. We'd rather have 500 skills that work than 1,300 that might.
4. Human judgment still matters. We used AI to generate, AI to review, AI to score, AI to distill. At every step, the AI said "looks good." The human user who ran the benchmark is the one who found the problem. AI is a tool. The human decides if the tool is working.
What's next
Every skill in our catalog is now v5.5. The benchmark infrastructure exists. We're expanding from 6 test cases to 50, covering every major domain. Skills that don't beat vanilla will be improved or removed.
The benchmark results, methodology, and raw outputs are on our benchmark page. We show every result, including the ones where vanilla Claude won.
We got it wrong the first time. Now we know what "right" looks like.
The full technical details of the v5.5 rewrite are in our benchmark methodology. The distillation pipeline, test cases, and scoring rubrics are open for inspection.