We lost a benchmark. Then we rewrote everything. Twice.

Last week, a user ran our Code Review Expert skill against a competitor called Superpowers. Same codebase, same model, same task.

Superpowers found 21 issues, including 9 logic bugs with exact file:line references. Our skill found 14 issues, zero logic bugs, and made a factually wrong claim about the code using "pure functions."

Their prompt was 500 words. Ours was 4,200.

How we got here

We built supaskills with a goal: 1,000+ expert skills. We hit it. Then we kept going. 1,300 skills across 6 domains, each one scored on 6 quality dimensions, backed by research sources, wrapped in governance metadata.

The problem: we optimized for the wrong things.

We scored ourselves. Our quality pipeline generated the skill AND scored it. That's like a student grading their own exam. Every skill was "Platinum" because the scoring rubric rewarded things the generator was good at producing: word count, source citations, section completeness.

We assumed longer was better. Our Pipeline v4 expanded every skill to 6,700 words average. More context, more sources, more examples, more sections. The theory: more knowledge in the system prompt means better output.

The reality: Claude's attention degrades with context length. A 6,700-word prompt where the actual behavioral instructions are buried in paragraph 47 performs worse than a 500-word prompt where every sentence changes what Claude does.

We benchmarked once, then stopped checking. We ran an initial benchmark that showed skills winning 89% of the time. Great numbers. But that was with the v2 skills. When we later bloated them to 6,700 words in Pipeline v4, we never re-benchmarked. We assumed "more knowledge = better results" without verifying.

The LLM trap

Here's what nobody warns you about when building AI products: the LLM will tell you everything is great.

We used Claude to review our skills. "This is a comprehensive, well-structured system prompt with excellent coverage of the domain." We used Claude to score them. 87.5/100, Platinum tier. We used Claude to generate, review, AND evaluate.

We did look at the outputs. They sounded good. They were well-structured, comprehensive, professional. The problem: we never put the skill output and the vanilla output side by side. When you only see one answer, "good" is easy to believe. When you see both answers next to each other, "good" and "better than without the skill" turn out to be different questions.

Rewrite 1: Cut the fat (v5.5)

In 48 hours, we rewrote all 1,296 published skills.

Cut. Every skill went from ~6,700 words to ~500 words. We removed: introductions explaining the domain (Claude already knows what code review is), source citations (they don't change Claude's behavior), governance metadata (useful for us, useless for Claude).

Focus on behavior. Every remaining sentence had to pass one test: "Does this change what Claude does?" If removing the sentence wouldn't change the output, we removed it.

Distill the knowledge. Our v2 skills had real domain knowledge buried in the filler. Specific DPS formulas for game balance. Exact threshold values for SaaS metrics. Framework-specific code patterns for each technology. We extracted the specific numbers, frameworks, and decision logic, then compressed it into the v5 format. 500 words that contain the same domain knowledge as 6,700, without the noise.

This was v5.5. It worked. Skills beat vanilla Claude in 3 out of 4 benchmark tests.

But when we looked closer at the one test we lost -- the security review -- we found a deeper problem.

The real problem: encyclopedias, not expertise

The security review skill lost to vanilla Claude. 39 points vs 47. We assumed the short format was the issue. It wasn't.

We read the skill carefully. It was a 7-step process template: "First, capture context. Then, scan OWASP Top 10 systematically. Then, map each finding to a CWE. Then, calculate CVSS v3.1 scores. Then..."

Every step described something Claude already knows how to do. The skill wasn't adding expertise. It was adding process overhead. Vanilla Claude just read the code and found bugs. Our skill told Claude to fill out a form.

We called this the encyclopedia anti-pattern: skills that list frameworks and standards Claude already knows (OWASP Top 10, WCAG criteria, GDPR articles, SOLID principles) instead of teaching it what it typically misses.

We scanned the entire catalog. 150 skills had this pattern.

Rewrite 2: Hunting expertise

We rewrote all 150 skills. Six by hand, 144 via pipeline, each one checked against its v2 research sources for real domain knowledge.

The principle: tell Claude where to look, not what things are called.

The old security reviewer said: "Check A01 Access Control, A02 Cryptographic Failures, A03 Injection..." -- listing categories Claude memorized during training.

The new one says: "Hunt the subtle bugs. Race conditions: any check-then-act without atomicity. Cross-module data flow: a value parsed from env var without NaN check propagates through calculations. Prototype pollution: any deep merge without filtering __proto__."

Same domain. But the old version is a checklist. The new version tells Claude where the bugs hide that it would otherwise miss.

We tested both versions against the same codebase (real production Python code with planted vulnerabilities). Vanilla Claude found 25 issues but 11 were noise (HTTP 418 status codes, Python 2 compatibility). The skill-loaded version found 15 issues, nearly zero noise, and caught a cross-flow bug (payment object used before save) that vanilla missed entirely.

Fewer findings. Higher signal. That's what a skill should do.

The health check

After two rounds of rewrites, we built a per-skill health check. 8 automated checks across every single published skill:

Masterfile exists -- passed quality gate
Prompt quality -- length, structure, language, opening format
Content hash -- SHA-256 matches stored hash
Description -- present and reasonable length
Score and tier -- present and above minimum
Embedding -- vector exists for semantic search
Chunk -- SubSkill chunk exists for SkillStreaming
Category and domain -- properly assigned

Results across all 1,306 published skills:

Grade	Count	Meaning
A (8/8)	1,142	All checks pass
B (7/8)	164	One minor issue (cosmetic)
C	0	--
F	0	--

98.4% pass rate. Zero critical failures. Every skill has a valid hash, embedding, chunk, score, and prompt.

The 164 B-grades are all the same issue: missing "Watch For" section. It's our internal format standard, not a functional problem. The skills work.

What we learned

1. A skill that describes what Claude already knows is worthless. OWASP Top 10, WCAG criteria, GDPR articles, SOLID principles -- Claude has all of this in its training data. A skill that lists these things adds zero value. A skill that says "here's the bug pattern you'll miss in this specific context" adds real value.

2. Format overhead kills bug-finding. When a skill says "write an Executive Summary, then a Severity Table, then CWE mappings, then CVSS scores" -- Claude spends output tokens on formatting instead of finding problems. The best skills have minimal output format: just the findings and the fixes.

3. Quality infrastructure matters more than quality claims. We now have: blind A/B benchmarks, per-skill health checks, SHA-256 hash verification on every delivery, automated quality gates. The claims on our marketing page are backed by infrastructure that catches regressions.

4. Two rewrites is normal. The first rewrite (v5.5) fixed the length problem. The second rewrite fixed the content problem. Each round taught us something we couldn't see before. If you're building an AI product and you've only rewritten once, you probably haven't found the real issue yet.

What's next

We're building trashfire -- an open security benchmark with 100 planted vulnerabilities across 8 codebases. Each vulnerability is realistic, exploitable, and scored. When it's ready, we'll run every security-adjacent skill against it and publish the results.

Skills that don't beat vanilla will be improved or removed.

The benchmark results, methodology, and health check are on our benchmark page. We show every result, including the ones where vanilla Claude won.

We got it wrong twice. Now we have the infrastructure to know when we get it wrong again.

Built with Claude. Tested against Claude. Improved because a user showed us we were wrong.

Why We Rewrote All 1,300 Skills (Twice)