We Rebuilt 10 Skills with 4 AI Models. The Model Mattered Less Than We Expected.

SupaSkills runs a quality-scored skill pipeline. Every skill passes through a 6-dimension scoring system, cross-validated by multiple AI models. We've rebuilt over 550 of our 1,078 skills this way.

Last week, we asked a simple question: does the builder model matter?

We took 10 skills across 5 domains. Rebuilt each one with 3 new approaches. Scored everything with the same triple-model system. Here's what happened.

The Setup

Our current pipeline (v2) uses Claude Sonnet as the builder, enhanced by loading relevant SupaSkills during generation. The AI literally uses our own skills to build better skills.

For this test, we added three new builder configurations:

Version	Builder	Method
v1	Claude Sonnet	Raw, no skill assistance
v2	Claude Sonnet + SupaSkills	Current pipeline
v3a	Gemini 3.1 Pro + SupaSkills	Google's latest reasoning model
v3b	Claude Opus 4.6 + SupaSkills	Anthropic's most capable model
v3c	Gemini 3.1 Pro → Opus 4.6	Tag-team: Gemini drafts, Opus refines

Same 10 skills. Same helper skills loaded. Same scoring rubric. Same triple-model evaluation (Claude Sonnet 40%, GPT-4o 40%, Gemini 20%).

Total cost: $18.50.

The Results

Skill	Domain	v1	v2	v3a	v3b	v3c
Code Review Expert	Engineering	81.5	90.2	91.4	86.6	90.9
Logging Strategy Designer	Engineering	80.8	88.7	86.0	90.8	90.5
Competitive Analysis Strategist	Business	81.0	88.7	88.6	89.1	90.8
Cap Table Scenario Modeler	Business	83.5	89.6	90.7	91.5	90.2
Cookie Policy Writer	Legal	83.0	90.0	90.5	90.0	90.0
Digital Markets Act Guide	Legal	80.9	88.5	90.5	87.9	92.5
B2C Marketing Copywriting	Content	80.0	89.5	89.3	89.8	89.3
Voice UI Designer	Content	80.0	88.9	86.9	88.1	92.4
Graph Database Architect	Technology	83.3	89.7	92.8	90.5	91.1
Feature Store Engineer	Technology	83.5	89.9	91.5	91.1	90.5
Average		81.8	89.4	89.8	89.5	90.8

Best version per skill: v3a won 4, v3b won 3, v3c won 3. v1 and v2 won 0.

Average Score by Version

81.8

Sonnet raw

89.4

Sonnet + SupaSkills

v3a

89.8

Gemini 3.1 Pro

v3b

89.5

Claude Opus 4.6

v3c

90.8

Tag-team

+7.6

v1 → v2 (framework)

+1.4

v2 → best v3 (model)

5×more impact from framework

Three Things We Learned

1. The framework matters more than the model.

The biggest jump was v1 → v2: +7.6 points average. That's Claude Sonnet without SupaSkills vs. Claude Sonnet with SupaSkills.

The model swap (v2 → best v3): +1.4 points average.

Loading the right expert skills before generation gave 5x more improvement than switching to a more expensive model. The scaffolding — what the AI knows before it starts writing — dominates the output quality.

2. No single model wins everywhere.

Gemini 3.1 Pro excelled on technical skills (Graph Database Architect: 92.8, its highest single score). Opus 4.6 was strongest on business and finance skills (Cap Table: 91.5). The tag-team approach dominated on regulatory and creative skills (Digital Markets Act: 92.5, Voice UI: 92.4).

Different models have different strengths. A quality framework that works across all of them is more valuable than betting on one.

3. The tag-team is interesting, but not obviously better.

v3c (Gemini drafts, Opus refines) had the highest average (90.8) but didn't win the most individual matchups. It smoothed out weaknesses — no score below 89.3 — but rarely hit the peaks that single-model approaches reached. It's the most consistent, not the most impressive.

The real question: is +1.4 points worth 2x the API cost? For a quality-focused platform, probably. For most use cases, a single strong model with the right context gets you 95% of the way.

Under the Hood: Dimension Breakdown

We score every skill across 6 weighted dimensions:

Dimension	Weight	v3a (Gemini)	v3b (Opus)	v3c (Tag-team)
Research Quality	15%	8.91	9.13	8.95
Prompt Engineering	25%	9.03	8.90	9.13
Practical Utility	15%	8.67	8.52	8.66
Completeness	10%	8.97	9.21	9.14
User Satisfaction	20%	8.82	8.71	8.88
Decision Usefulness	15%	8.75	8.69	8.85

Dimension Breakdown — v3 Models

Opus produces the most thoroughly researched, complete outputs — it writes more, covers more edge cases, misses fewer details. Gemini produces more practically useful outputs — tighter, more action-oriented, easier to apply. The tag-team combines both: Gemini's practical structure refined by Opus's thoroughness.

What This Means for SupaSkills

We're not switching our entire pipeline to a single new model. The data says that's the wrong move.

What we are doing:

Model-aware routing. Technical skills go to Gemini 3.1 Pro. Business and finance skills go to Opus 4.6. Regulatory and creative skills get the tag-team treatment. The pipeline picks the best builder based on domain.

SupaSkills stays model-agnostic. Our scoring system evaluates output quality regardless of which model produced it. That's the point — we're a quality layer, not a model wrapper.

The v2 pipeline keeps running. 550 skills rebuilt and counting. The v3 insights will inform the next generation, not interrupt the current one.

Try It Yourself

All 1,078 skills are available through the SupaSkills API and as a Claude MCP connector. The scoring framework behind this experiment powers every skill in our catalog.

# Claude Desktop / Claude Code
clawhub install supaskills

# Or via REST API
curl https://www.supaskills.ai/api/v1/skills?q=code+review \
  -H "Authorization: Bearer sk_supa_YOUR_KEY"

Browse the catalog at supaskills.ai/skills.

Raw Data

The complete results including per-skill breakdowns, dimension scores, and prompt lengths are available in our pilot results JSON.

Metric	Value
Skills tested	10 (2 per domain)
Versions compared	5 (v1, v2, v3a, v3b, v3c)
Scoring runs	90 (10 x 3 x triple-model)
Scoring models	Claude Sonnet 4.5, GPT-4o, Gemini 3.1 Pro
Total cost	$18.50
v1 → v2 improvement	+7.6 points (framework effect)
v2 → best v3 improvement	+1.4 points (model effect)

SupaSkills scores every AI skill across 6 dimensions. No hype, no self-reported benchmarks. Just measured quality. Learn more about our scoring methodology.