← Back to blog
Performancepipelinequalitybenchmark

We Rebuilt All 1,078 Skills. Here's What 143 Hours of AI Told Us.

Max Jürschik·March 10, 2026·8 min read

Two weeks ago, we published results from a 10-skill pilot. The takeaway: the framework matters more than the model. Loading expert skills before generation gave 5x more improvement than swapping to a stronger model.

That was 10 skills. This is what happens when you run it on all 1,078.


The Problem

Our v1 skills were good. Average score: 84.1, all above the 80-point quality gate. But "good" isn't the bar when you're selling expert-level AI skills. We knew the prompts could be sharper, the structure tighter, the domain coverage deeper.

The pilot showed us the path: rebuild every skill using our own platform. Let SupaSkills improve SupaSkills.


The Pipeline

Every skill went through 7 steps:

  1. Load the existing v1 skill, its sources, and its score
  2. Select 3 helper skills: 2 universal (Prompt Engineering Strategist + Technical Writing Expert) plus 1 category-specific expert
  3. Rebuild the system prompt with Claude Sonnet 4.5, informed by all 3 helpers
  4. Score with Claude (Model A)
  5. Score with GPT-4o (Model B)
  6. Council review with Gemini 2.0 Flash, independent adjustment of up to +/-2 points
  7. Publish v2 if the new score meets quality gate (>=80) AND beats the v1 score

That last condition is key. If the rebuild scored lower than v1, we kept v1. No regressions allowed.

Helper Skill Selection

Each of the 12 categories has a designated domain expert:

CategoryHelper Skill
Software EngineeringCode Review Expert
DevOps & InfrastructureCI/CD Pipeline Designer
SecuritySecurity Code Reviewer
AI & Machine LearningML Model Evaluation Expert
Data & AnalyticsData Pipeline Architect
Design & UXDesign System Architect
Product & StrategyProduct Requirements Architect
Marketing & GrowthGrowth Experiment Designer
Finance & BusinessCompetitive Analysis Strategist
CopywritingB2B Content Strategist
CommunicationCrisis Communication Manager
Legal & ComplianceData Privacy Compliance Advisor

One rule: a skill can never help rebuild itself. If the Prompt Engineering Strategist is being rebuilt, a substitute takes its place.


The Numbers

6 Runs, 143 Hours, 1,070 Rebuilt

We couldn't run all 1,078 in one shot. API credit limits, connection timeouts, and rate limits forced us into 6 separate runs over 10 days:

RunSkills PublishedRuntimeHow It Ended
1152~19hAnthropic credit exhaustion
2197~19hAnthropic credit exhaustion
3240~32hAnthropic credit exhaustion
4393~63hConnection errors at tail
530~5hAnthropic credit exhaustion
647~5hClean finish, 0 errors

Total: 1,070 skills rebuilt to v2. 8 skills remained at v1 as intentional holdouts (v2 scored lower). The script auto-skips any skill already at v2+, so reruns were safe.

Average processing time: ~5 minutes per skill (~320 seconds).

Before and After

Metricv1v2Change
Average score84.188.3+4.2
Score range80.0 – 89.883.8 – 91.9Raised floor and ceiling
Platinum tier (85+)~35%97%+62 percentage points
Gold tier (70-84)~65%3%Nearly eliminated
Below Gold0%0%Still zero

The average improvement was +3.9 points for rebuilt skills. The highest v2 score: 91.93. The lowest: 83.78 (still comfortably Platinum).

The 18 Holdouts

18 skills scored lower on v2 than v1. The pipeline correctly kept them at v1. These weren't failures. They're skills where the v1 prompt was already tight and the rebuild either over-generalized or added length without improving substance.

Their average v1 score: 84.1. Still above our quality gate, still shipping.


What Changed in the Prompts

The rebuilt prompts are structurally different. Here's what the helper skills consistently improved:

1. Sharper Role Definitions

v1 prompts opened with generic role descriptions. v2 prompts open with specific expertise boundaries: what the skill covers, what it doesn't, and when to defer to other skills.

2. Structured Output Formats

The Technical Writing Expert helper pushed every skill toward consistent output structure: clear sections, numbered steps where appropriate, decision frameworks with concrete criteria.

3. Domain-Specific Guardrails

Category helpers added field-specific guardrails. Legal skills now flag jurisdiction dependencies. Finance skills now include confidence intervals. Security skills now distinguish between informational guidance and actionable hardening steps.

4. Better Edge Case Handling

v1 prompts handled the happy path well. v2 prompts explicitly address what to do when inputs are ambiguous, incomplete, or contradictory, because that's what happens in real-world usage.


The Scoring System

Every skill is scored across 6 dimensions by 3 independent models:

DimensionWeightWhat It Measures
Research Quality15%Accuracy of domain knowledge, source alignment
Prompt Engineering25%Structure, clarity, instruction precision
Practical Utility15%Actionability of outputs, real-world applicability
Completeness10%Coverage of topic scope, edge cases
User Satisfaction20%Output readability, tone, user experience
Decision Usefulness15%Whether outputs support actual decisions

The final score blends Claude and GPT-4o dimension scores 50/50, then applies the Gemini council adjustment. After that, compute_supa_score() produces the authoritative number.

Model Bias: Claude Scores Itself Higher

One pattern was consistent across all 1,070 rebuilds: Claude scored its own outputs 5-13 points higher than GPT-4o did. Average gap: ~8 points.

This is why multi-model scoring exists. A single model scoring its own work produces inflated numbers. The 50/50 blend and Gemini council check keep scores honest.

Gemini consistently applied small negative adjustments (-0.5 to -1.0), acting as a stabilizer rather than an amplifier.


Benchmark: v1 vs v2 in Production

Numbers on paper are one thing. We ran a head-to-head benchmark on 3 real-world tasks:

Taskv1 Outputv2 OutputWinner
REST API security auditCovered 4/7 OWASP categoriesCovered 7/7 with remediation stepsv2
SaaS pricing strategyGeneric frameworkMarket-specific with competitor analysisv2
Employment contract reviewFlagged 3 risk areasFlagged 6 risk areas with jurisdiction notesv2

v2 won all 3, with an average improvement of +9.4% on task-specific rubrics. The gains aren't subtle. v2 outputs are measurably more complete, more specific, and more actionable.


What It Cost

Let's be transparent about the economics:

ResourceCost
Claude Sonnet 4.5 (rebuild + score)~$380
GPT-4o (cross-score)~$85
Gemini 2.0 Flash (council)~$12
Total~$477

For 1,070 skill rebuilds, that's roughly $0.45 per skill. The 10-skill pilot cost $18.50 ($1.85/skill), so the per-unit cost dropped 75% at scale due to shorter prompts on simpler skills and batch efficiency.

143 hours of compute time. 10 days wall-clock. One engineer monitoring.


What We Learned

1. Self-improvement works at scale.

Using your own product to improve your own product isn't just a nice story. It's measurably effective. The helper skills encode domain expertise that a raw model doesn't have. At 1,070 skills, the pattern held consistently.

2. Multi-model scoring catches what single-model doesn't.

If we'd only used Claude to score Claude's output, our average score would be ~92. The actual average is 88.3. That 4-point gap is self-evaluation bias, and it would have shipped as false quality signal.

3. Regression protection matters.

18 skills didn't improve. The pipeline caught all 18 and preserved their v1 versions. Without that safety net, we'd have 18 degraded skills in production.

4. The bottleneck is API credits, not quality.

We stopped 5 times due to credit exhaustion. The pipeline itself ran cleanly. Run 6 finished with zero errors. The limiting factor at scale isn't the system, it's the billing.


What's Next

The entire catalog is now at v2. 97% Platinum. But we're not done:

  • Model-aware routing: Route rebuilds to different models based on domain (insight from the pilot)
  • Continuous scoring: Re-evaluate skills quarterly as models improve
  • User signal integration: Incorporate load counts and user ratings into the scoring loop

Every skill is available through our MCP connector, REST API, and the skill catalog. The scores are real, the methodology is documented, and the data backs every number.


Raw Data

MetricValue
Skills rebuilt1,070 / 1,078 (99.3%)
Holdouts (kept v1)18
Pipeline runs6
Total runtime~143 hours
Avg time per skill~5 minutes
v1 average score84.1
v2 average score88.3
Average improvement+3.9 points
Score range (v2)83.78 – 91.93
Platinum tier97% (1,053 skills)
Gold tier3% (35 skills)
Scoring modelsClaude Sonnet 4.5, GPT-4o, Gemini 2.0 Flash
Total cost~$477
Cost per skill~$0.45

Every SupaSkills score is computed by compute_supa_score(), a server-side function, not a self-reported number. Learn how the scoring works.