We Rebuilt All 1,078 Skills. Here's What 143 Hours of AI Told Us.

Two weeks ago, we published results from a 10-skill pilot. The takeaway: the framework matters more than the model. Loading expert skills before generation gave 5x more improvement than swapping to a stronger model.

That was 10 skills. This is what happens when you run it on all 1,078.

The Problem

Our v1 skills were good. Average score: 84.1, all above the 80-point quality gate. But "good" isn't the bar when you're selling expert-level AI skills. We knew the prompts could be sharper, the structure tighter, the domain coverage deeper.

The pilot showed us the path: rebuild every skill using our own platform. Let SupaSkills improve SupaSkills.

The Pipeline

Every skill went through 7 steps:

Load the existing v1 skill, its sources, and its score
Select 3 helper skills: 2 universal (Prompt Engineering Strategist + Technical Writing Expert) plus 1 category-specific expert
Rebuild the system prompt with Claude Sonnet 4.5, informed by all 3 helpers
Score with Claude (Model A)
Score with GPT-4o (Model B)
Council review with Gemini 2.0 Flash, independent adjustment of up to +/-2 points
Publish v2 if the new score meets quality gate (>=80) AND beats the v1 score

That last condition is key. If the rebuild scored lower than v1, we kept v1. No regressions allowed.

Helper Skill Selection

Each of the 12 categories has a designated domain expert:

Category	Helper Skill
Software Engineering	Code Review Expert
DevOps & Infrastructure	CI/CD Pipeline Designer
Security	Security Code Reviewer
AI & Machine Learning	ML Model Evaluation Expert
Data & Analytics	Data Pipeline Architect
Design & UX	Design System Architect
Product & Strategy	Product Requirements Architect
Marketing & Growth	Growth Experiment Designer
Finance & Business	Competitive Analysis Strategist
Copywriting	B2B Content Strategist
Communication	Crisis Communication Manager
Legal & Compliance	Data Privacy Compliance Advisor

One rule: a skill can never help rebuild itself. If the Prompt Engineering Strategist is being rebuilt, a substitute takes its place.

The Numbers

6 Runs, 143 Hours, 1,070 Rebuilt

We couldn't run all 1,078 in one shot. API credit limits, connection timeouts, and rate limits forced us into 6 separate runs over 10 days:

Run	Skills Published	Runtime	How It Ended
1	152	~19h	Anthropic credit exhaustion
2	197	~19h	Anthropic credit exhaustion
3	240	~32h	Anthropic credit exhaustion
4	393	~63h	Connection errors at tail
5	30	~5h	Anthropic credit exhaustion
6	47	~5h	Clean finish, 0 errors

Total: 1,070 skills rebuilt to v2. 8 skills remained at v1 as intentional holdouts (v2 scored lower). The script auto-skips any skill already at v2+, so reruns were safe.

Average processing time: ~5 minutes per skill (~320 seconds).

Before and After

Metric	v1	v2	Change
Average score	84.1	88.3	+4.2
Score range	80.0 – 89.8	83.8 – 91.9	Raised floor and ceiling
Platinum tier (85+)	~35%	97%	+62 percentage points
Gold tier (70-84)	~65%	3%	Nearly eliminated
Below Gold	0%	0%	Still zero

The average improvement was +3.9 points for rebuilt skills. The highest v2 score: 91.93. The lowest: 83.78 (still comfortably Platinum).

The 18 Holdouts

18 skills scored lower on v2 than v1. The pipeline correctly kept them at v1. These weren't failures. They're skills where the v1 prompt was already tight and the rebuild either over-generalized or added length without improving substance.

Their average v1 score: 84.1. Still above our quality gate, still shipping.

What Changed in the Prompts

The rebuilt prompts are structurally different. Here's what the helper skills consistently improved:

1. Sharper Role Definitions

v1 prompts opened with generic role descriptions. v2 prompts open with specific expertise boundaries: what the skill covers, what it doesn't, and when to defer to other skills.

2. Structured Output Formats

The Technical Writing Expert helper pushed every skill toward consistent output structure: clear sections, numbered steps where appropriate, decision frameworks with concrete criteria.

3. Domain-Specific Guardrails

Category helpers added field-specific guardrails. Legal skills now flag jurisdiction dependencies. Finance skills now include confidence intervals. Security skills now distinguish between informational guidance and actionable hardening steps.

4. Better Edge Case Handling

v1 prompts handled the happy path well. v2 prompts explicitly address what to do when inputs are ambiguous, incomplete, or contradictory, because that's what happens in real-world usage.

The Scoring System

Every skill is scored across 6 dimensions by 3 independent models:

Dimension	Weight	What It Measures
Research Quality	15%	Accuracy of domain knowledge, source alignment
Prompt Engineering	25%	Structure, clarity, instruction precision
Practical Utility	15%	Actionability of outputs, real-world applicability
Completeness	10%	Coverage of topic scope, edge cases
User Satisfaction	20%	Output readability, tone, user experience
Decision Usefulness	15%	Whether outputs support actual decisions

The final score blends Claude and GPT-4o dimension scores 50/50, then applies the Gemini council adjustment. After that, compute_supa_score() produces the authoritative number.

Why Multi-Model Scoring Matters

One pattern was consistent across all 1,070 rebuilds: every model scores its own outputs higher than other models do. The gap between self-evaluation and cross-model evaluation averaged ~8 points. This is a known phenomenon in AI evaluation research, not specific to any model.

This is why multi-model scoring exists. A single model scoring its own work produces inflated numbers. The 50/50 blend across Claude and GPT-4o, plus the Gemini council check, keeps scores honest.

Gemini consistently applied small negative adjustments (-0.5 to -1.0), acting as a stabilizer rather than an amplifier.

Benchmark: v1 vs v2 in Production

Numbers on paper are one thing. We ran a head-to-head benchmark on 3 real-world tasks:

Task	v1 Output	v2 Output	Winner
REST API security audit	Covered 4/7 OWASP categories	Covered 7/7 with remediation steps	v2
SaaS pricing strategy	Generic framework	Market-specific with competitor analysis	v2
Employment contract review	Flagged 3 risk areas	Flagged 6 risk areas with jurisdiction notes	v2

v2 won all 3, with an average improvement of +9.4% on task-specific rubrics. The gains aren't subtle. v2 outputs are measurably more complete, more specific, and more actionable.

What It Cost

Let's be transparent about the economics:

Resource	Cost
Claude Sonnet 4.5 (rebuild + score)	~$380
GPT-4o (cross-score)	~$85
Gemini 2.0 Flash (council)	~$12
Total	~$477

For 1,070 skill rebuilds, that's roughly $0.45 per skill. The 10-skill pilot cost $18.50 ($1.85/skill), so the per-unit cost dropped 75% at scale due to shorter prompts on simpler skills and batch efficiency.

143 hours of compute time. 10 days wall-clock. One engineer monitoring.

What We Learned

1. Self-improvement works at scale.

Using your own product to improve your own product isn't just a nice story. It's measurably effective. The helper skills encode domain expertise that a raw model doesn't have. At 1,070 skills, the pattern held consistently.

2. Multi-model scoring catches what single-model doesn't.

If we had used any single model to score its own output, the average would be inflated by ~4 points. Multi-model scoring eliminates that bias.

3. Regression protection matters.

18 skills didn't improve. The pipeline caught all 18 and preserved their v1 versions. Without that safety net, we'd have 18 degraded skills in production.

4. The bottleneck is API credits, not quality.

We stopped 5 times due to credit exhaustion. The pipeline itself ran cleanly. Run 6 finished with zero errors. The limiting factor at scale isn't the system, it's the billing.

What's Next

The entire catalog is now at v2. 97% Platinum. But we're not done:

Model-aware routing: Route rebuilds to different models based on domain (insight from the pilot)
Continuous scoring: Re-evaluate skills quarterly as models improve
User signal integration: Incorporate load counts and user ratings into the scoring loop

Every skill is available through our MCP connector, REST API, and the skill catalog. The scores are real, the methodology is documented, and the data backs every number.

Raw Data

Metric	Value
Skills rebuilt	1,070 / 1,078 (99.3%)
Holdouts (kept v1)	18
Pipeline runs	6
Total runtime	~143 hours
Avg time per skill	~5 minutes
v1 average score	84.1
v2 average score	88.3
Average improvement	+3.9 points
Score range (v2)	83.78 – 91.93
Platinum tier	97% (1,053 skills)
Gold tier	3% (35 skills)
Scoring models	Claude Sonnet 4.5, GPT-4o, Gemini 2.0 Flash
Total cost	~$477
Cost per skill	~$0.45

Every SupaSkills score is computed by compute_supa_score(), a server-side function, not a self-reported number. Learn how the scoring works.