Two weeks ago, we published results from a 10-skill pilot. The takeaway: the framework matters more than the model. Loading expert skills before generation gave 5x more improvement than swapping to a stronger model.
That was 10 skills. This is what happens when you run it on all 1,078.
The Problem
Our v1 skills were good. Average score: 84.1, all above the 80-point quality gate. But "good" isn't the bar when you're selling expert-level AI skills. We knew the prompts could be sharper, the structure tighter, the domain coverage deeper.
The pilot showed us the path: rebuild every skill using our own platform. Let SupaSkills improve SupaSkills.
The Pipeline
Every skill went through 7 steps:
- Load the existing v1 skill, its sources, and its score
- Select 3 helper skills: 2 universal (Prompt Engineering Strategist + Technical Writing Expert) plus 1 category-specific expert
- Rebuild the system prompt with Claude Sonnet 4.5, informed by all 3 helpers
- Score with Claude (Model A)
- Score with GPT-4o (Model B)
- Council review with Gemini 2.0 Flash, independent adjustment of up to +/-2 points
- Publish v2 if the new score meets quality gate (>=80) AND beats the v1 score
That last condition is key. If the rebuild scored lower than v1, we kept v1. No regressions allowed.
Helper Skill Selection
Each of the 12 categories has a designated domain expert:
| Category | Helper Skill |
|---|---|
| Software Engineering | Code Review Expert |
| DevOps & Infrastructure | CI/CD Pipeline Designer |
| Security | Security Code Reviewer |
| AI & Machine Learning | ML Model Evaluation Expert |
| Data & Analytics | Data Pipeline Architect |
| Design & UX | Design System Architect |
| Product & Strategy | Product Requirements Architect |
| Marketing & Growth | Growth Experiment Designer |
| Finance & Business | Competitive Analysis Strategist |
| Copywriting | B2B Content Strategist |
| Communication | Crisis Communication Manager |
| Legal & Compliance | Data Privacy Compliance Advisor |
One rule: a skill can never help rebuild itself. If the Prompt Engineering Strategist is being rebuilt, a substitute takes its place.
The Numbers
6 Runs, 143 Hours, 1,070 Rebuilt
We couldn't run all 1,078 in one shot. API credit limits, connection timeouts, and rate limits forced us into 6 separate runs over 10 days:
| Run | Skills Published | Runtime | How It Ended |
|---|---|---|---|
| 1 | 152 | ~19h | Anthropic credit exhaustion |
| 2 | 197 | ~19h | Anthropic credit exhaustion |
| 3 | 240 | ~32h | Anthropic credit exhaustion |
| 4 | 393 | ~63h | Connection errors at tail |
| 5 | 30 | ~5h | Anthropic credit exhaustion |
| 6 | 47 | ~5h | Clean finish, 0 errors |
Total: 1,070 skills rebuilt to v2. 8 skills remained at v1 as intentional holdouts (v2 scored lower). The script auto-skips any skill already at v2+, so reruns were safe.
Average processing time: ~5 minutes per skill (~320 seconds).
Before and After
| Metric | v1 | v2 | Change |
|---|---|---|---|
| Average score | 84.1 | 88.3 | +4.2 |
| Score range | 80.0 – 89.8 | 83.8 – 91.9 | Raised floor and ceiling |
| Platinum tier (85+) | ~35% | 97% | +62 percentage points |
| Gold tier (70-84) | ~65% | 3% | Nearly eliminated |
| Below Gold | 0% | 0% | Still zero |
The average improvement was +3.9 points for rebuilt skills. The highest v2 score: 91.93. The lowest: 83.78 (still comfortably Platinum).
The 18 Holdouts
18 skills scored lower on v2 than v1. The pipeline correctly kept them at v1. These weren't failures. They're skills where the v1 prompt was already tight and the rebuild either over-generalized or added length without improving substance.
Their average v1 score: 84.1. Still above our quality gate, still shipping.
What Changed in the Prompts
The rebuilt prompts are structurally different. Here's what the helper skills consistently improved:
1. Sharper Role Definitions
v1 prompts opened with generic role descriptions. v2 prompts open with specific expertise boundaries: what the skill covers, what it doesn't, and when to defer to other skills.
2. Structured Output Formats
The Technical Writing Expert helper pushed every skill toward consistent output structure: clear sections, numbered steps where appropriate, decision frameworks with concrete criteria.
3. Domain-Specific Guardrails
Category helpers added field-specific guardrails. Legal skills now flag jurisdiction dependencies. Finance skills now include confidence intervals. Security skills now distinguish between informational guidance and actionable hardening steps.
4. Better Edge Case Handling
v1 prompts handled the happy path well. v2 prompts explicitly address what to do when inputs are ambiguous, incomplete, or contradictory, because that's what happens in real-world usage.
The Scoring System
Every skill is scored across 6 dimensions by 3 independent models:
| Dimension | Weight | What It Measures |
|---|---|---|
| Research Quality | 15% | Accuracy of domain knowledge, source alignment |
| Prompt Engineering | 25% | Structure, clarity, instruction precision |
| Practical Utility | 15% | Actionability of outputs, real-world applicability |
| Completeness | 10% | Coverage of topic scope, edge cases |
| User Satisfaction | 20% | Output readability, tone, user experience |
| Decision Usefulness | 15% | Whether outputs support actual decisions |
The final score blends Claude and GPT-4o dimension scores 50/50, then applies the Gemini council adjustment. After that, compute_supa_score() produces the authoritative number.
Model Bias: Claude Scores Itself Higher
One pattern was consistent across all 1,070 rebuilds: Claude scored its own outputs 5-13 points higher than GPT-4o did. Average gap: ~8 points.
This is why multi-model scoring exists. A single model scoring its own work produces inflated numbers. The 50/50 blend and Gemini council check keep scores honest.
Gemini consistently applied small negative adjustments (-0.5 to -1.0), acting as a stabilizer rather than an amplifier.
Benchmark: v1 vs v2 in Production
Numbers on paper are one thing. We ran a head-to-head benchmark on 3 real-world tasks:
| Task | v1 Output | v2 Output | Winner |
|---|---|---|---|
| REST API security audit | Covered 4/7 OWASP categories | Covered 7/7 with remediation steps | v2 |
| SaaS pricing strategy | Generic framework | Market-specific with competitor analysis | v2 |
| Employment contract review | Flagged 3 risk areas | Flagged 6 risk areas with jurisdiction notes | v2 |
v2 won all 3, with an average improvement of +9.4% on task-specific rubrics. The gains aren't subtle. v2 outputs are measurably more complete, more specific, and more actionable.
What It Cost
Let's be transparent about the economics:
| Resource | Cost |
|---|---|
| Claude Sonnet 4.5 (rebuild + score) | ~$380 |
| GPT-4o (cross-score) | ~$85 |
| Gemini 2.0 Flash (council) | ~$12 |
| Total | ~$477 |
For 1,070 skill rebuilds, that's roughly $0.45 per skill. The 10-skill pilot cost $18.50 ($1.85/skill), so the per-unit cost dropped 75% at scale due to shorter prompts on simpler skills and batch efficiency.
143 hours of compute time. 10 days wall-clock. One engineer monitoring.
What We Learned
1. Self-improvement works at scale.
Using your own product to improve your own product isn't just a nice story. It's measurably effective. The helper skills encode domain expertise that a raw model doesn't have. At 1,070 skills, the pattern held consistently.
2. Multi-model scoring catches what single-model doesn't.
If we'd only used Claude to score Claude's output, our average score would be ~92. The actual average is 88.3. That 4-point gap is self-evaluation bias, and it would have shipped as false quality signal.
3. Regression protection matters.
18 skills didn't improve. The pipeline caught all 18 and preserved their v1 versions. Without that safety net, we'd have 18 degraded skills in production.
4. The bottleneck is API credits, not quality.
We stopped 5 times due to credit exhaustion. The pipeline itself ran cleanly. Run 6 finished with zero errors. The limiting factor at scale isn't the system, it's the billing.
What's Next
The entire catalog is now at v2. 97% Platinum. But we're not done:
- Model-aware routing: Route rebuilds to different models based on domain (insight from the pilot)
- Continuous scoring: Re-evaluate skills quarterly as models improve
- User signal integration: Incorporate load counts and user ratings into the scoring loop
Every skill is available through our MCP connector, REST API, and the skill catalog. The scores are real, the methodology is documented, and the data backs every number.
Raw Data
| Metric | Value |
|---|---|
| Skills rebuilt | 1,070 / 1,078 (99.3%) |
| Holdouts (kept v1) | 18 |
| Pipeline runs | 6 |
| Total runtime | ~143 hours |
| Avg time per skill | ~5 minutes |
| v1 average score | 84.1 |
| v2 average score | 88.3 |
| Average improvement | +3.9 points |
| Score range (v2) | 83.78 – 91.93 |
| Platinum tier | 97% (1,053 skills) |
| Gold tier | 3% (35 skills) |
| Scoring models | Claude Sonnet 4.5, GPT-4o, Gemini 2.0 Flash |
| Total cost | ~$477 |
| Cost per skill | ~$0.45 |
Every SupaSkills score is computed by compute_supa_score(), a server-side function, not a self-reported number. Learn how the scoring works.