SupaSkills runs a quality-scored skill pipeline. Every skill passes through a 6-dimension scoring system, cross-validated by multiple AI models. We've rebuilt over 550 of our 1,078 skills this way.
Last week, we asked a simple question: does the builder model matter?
We took 10 skills across 5 domains. Rebuilt each one with 3 new approaches. Scored everything with the same triple-model system. Here's what happened.
The Setup
Our current pipeline (v2) uses Claude Sonnet as the builder, enhanced by loading relevant SupaSkills during generation. The AI literally uses our own skills to build better skills.
For this test, we added three new builder configurations:
| Version | Builder | Method |
|---|---|---|
| v1 | Claude Sonnet | Raw, no skill assistance |
| v2 | Claude Sonnet + SupaSkills | Current pipeline |
| v3a | Gemini 3.1 Pro + SupaSkills | Google's latest reasoning model |
| v3b | Claude Opus 4.6 + SupaSkills | Anthropic's most capable model |
| v3c | Gemini 3.1 Pro → Opus 4.6 | Tag-team: Gemini drafts, Opus refines |
Same 10 skills. Same helper skills loaded. Same scoring rubric. Same triple-model evaluation (Claude Sonnet 40%, GPT-4o 40%, Gemini 20%).
Total cost: $18.50.
The Results
| Skill | Domain | v1 | v2 | v3a | v3b | v3c |
|---|---|---|---|---|---|---|
| Code Review Expert | Engineering | 81.5 | 90.2 | 91.4 | 86.6 | 90.9 |
| Logging Strategy Designer | Engineering | 80.8 | 88.7 | 86.0 | 90.8 | 90.5 |
| Competitive Analysis Strategist | Business | 81.0 | 88.7 | 88.6 | 89.1 | 90.8 |
| Cap Table Scenario Modeler | Business | 83.5 | 89.6 | 90.7 | 91.5 | 90.2 |
| Cookie Policy Writer | Legal | 83.0 | 90.0 | 90.5 | 90.0 | 90.0 |
| Digital Markets Act Guide | Legal | 80.9 | 88.5 | 90.5 | 87.9 | 92.5 |
| B2C Marketing Copywriting | Content | 80.0 | 89.5 | 89.3 | 89.8 | 89.3 |
| Voice UI Designer | Content | 80.0 | 88.9 | 86.9 | 88.1 | 92.4 |
| Graph Database Architect | Technology | 83.3 | 89.7 | 92.8 | 90.5 | 91.1 |
| Feature Store Engineer | Technology | 83.5 | 89.9 | 91.5 | 91.1 | 90.5 |
| Average | 81.8 | 89.4 | 89.8 | 89.5 | 90.8 |
Best version per skill: v3a won 4, v3b won 3, v3c won 3. v1 and v2 won 0.
Three Things We Learned
1. The framework matters more than the model.
The biggest jump was v1 → v2: +7.6 points average. That's Claude Sonnet without SupaSkills vs. Claude Sonnet with SupaSkills.
The model swap (v2 → best v3): +1.4 points average.
Loading the right expert skills before generation gave 5x more improvement than switching to a more expensive model. The scaffolding — what the AI knows before it starts writing — dominates the output quality.
2. No single model wins everywhere.
Gemini 3.1 Pro excelled on technical skills (Graph Database Architect: 92.8, its highest single score). Opus 4.6 was strongest on business and finance skills (Cap Table: 91.5). The tag-team approach dominated on regulatory and creative skills (Digital Markets Act: 92.5, Voice UI: 92.4).
Different models have different strengths. A quality framework that works across all of them is more valuable than betting on one.
3. The tag-team is interesting, but not obviously better.
v3c (Gemini drafts, Opus refines) had the highest average (90.8) but didn't win the most individual matchups. It smoothed out weaknesses — no score below 89.3 — but rarely hit the peaks that single-model approaches reached. It's the most consistent, not the most impressive.
The real question: is +1.4 points worth 2x the API cost? For a quality-focused platform, probably. For most use cases, a single strong model with the right context gets you 95% of the way.
Under the Hood: Dimension Breakdown
We score every skill across 6 weighted dimensions:
| Dimension | Weight | v3a (Gemini) | v3b (Opus) | v3c (Tag-team) |
|---|---|---|---|---|
| Research Quality | 15% | 8.91 | 9.13 | 8.95 |
| Prompt Engineering | 25% | 9.03 | 8.90 | 9.13 |
| Practical Utility | 15% | 8.67 | 8.52 | 8.66 |
| Completeness | 10% | 8.97 | 9.21 | 9.14 |
| User Satisfaction | 20% | 8.82 | 8.71 | 8.88 |
| Decision Usefulness | 15% | 8.75 | 8.69 | 8.85 |
Opus produces the most thoroughly researched, complete outputs — it writes more, covers more edge cases, misses fewer details. Gemini produces more practically useful outputs — tighter, more action-oriented, easier to apply. The tag-team combines both: Gemini's practical structure refined by Opus's thoroughness.
What This Means for SupaSkills
We're not switching our entire pipeline to a single new model. The data says that's the wrong move.
What we are doing:
Model-aware routing. Technical skills go to Gemini 3.1 Pro. Business and finance skills go to Opus 4.6. Regulatory and creative skills get the tag-team treatment. The pipeline picks the best builder based on domain.
SupaSkills stays model-agnostic. Our scoring system evaluates output quality regardless of which model produced it. That's the point — we're a quality layer, not a model wrapper.
The v2 pipeline keeps running. 550 skills rebuilt and counting. The v3 insights will inform the next generation, not interrupt the current one.
Try It Yourself
All 1,078 skills are available through the SupaSkills API and as a Claude MCP connector. The scoring framework behind this experiment powers every skill in our catalog.
# Claude Desktop / Claude Code
clawhub install supaskills
# Or via REST API
curl https://www.supaskills.ai/api/v1/skills?q=code+review \
-H "Authorization: Bearer sk_supa_YOUR_KEY"
Browse the catalog at supaskills.ai/skills.
Raw Data
The complete results including per-skill breakdowns, dimension scores, and prompt lengths are available in our pilot results JSON.
| Metric | Value |
|---|---|
| Skills tested | 10 (2 per domain) |
| Versions compared | 5 (v1, v2, v3a, v3b, v3c) |
| Scoring runs | 90 (10 x 3 x triple-model) |
| Scoring models | Claude Sonnet 4.5, GPT-4o, Gemini 3.1 Pro |
| Total cost | $18.50 |
| v1 → v2 improvement | +7.6 points (framework effect) |
| v2 → best v3 improvement | +1.4 points (model effect) |
SupaSkills scores every AI skill across 6 dimensions. No hype, no self-reported benchmarks. Just measured quality. Learn more about our scoring methodology.