← Back to blog
Performancemulti-modelpipelinequality

We Rebuilt 10 Skills with 4 AI Models. The Model Mattered Less Than We Expected.

Max Jürschik·February 24, 2026·7 min read

SupaSkills runs a quality-scored skill pipeline. Every skill passes through a 6-dimension scoring system, cross-validated by multiple AI models. We've rebuilt over 550 of our 1,078 skills this way.

Last week, we asked a simple question: does the builder model matter?

We took 10 skills across 5 domains. Rebuilt each one with 3 new approaches. Scored everything with the same triple-model system. Here's what happened.

Gemini 3.1 Prov3a · avg 89.8Claude Opus 4.6v3b · avg 89.5Tag-teamv3c · avg 90.8SUPASCOREFRAMEWORK

The Setup

Our current pipeline (v2) uses Claude Sonnet as the builder, enhanced by loading relevant SupaSkills during generation. The AI literally uses our own skills to build better skills.

For this test, we added three new builder configurations:

VersionBuilderMethod
v1Claude SonnetRaw, no skill assistance
v2Claude Sonnet + SupaSkillsCurrent pipeline
v3aGemini 3.1 Pro + SupaSkillsGoogle's latest reasoning model
v3bClaude Opus 4.6 + SupaSkillsAnthropic's most capable model
v3cGemini 3.1 Pro → Opus 4.6Tag-team: Gemini drafts, Opus refines

Same 10 skills. Same helper skills loaded. Same scoring rubric. Same triple-model evaluation (Claude Sonnet 40%, GPT-4o 40%, Gemini 20%).

Total cost: $18.50.


The Results

SkillDomainv1v2v3av3bv3c
Code Review ExpertEngineering81.590.291.486.690.9
Logging Strategy DesignerEngineering80.888.786.090.890.5
Competitive Analysis StrategistBusiness81.088.788.689.190.8
Cap Table Scenario ModelerBusiness83.589.690.791.590.2
Cookie Policy WriterLegal83.090.090.590.090.0
Digital Markets Act GuideLegal80.988.590.587.992.5
B2C Marketing CopywritingContent80.089.589.389.889.3
Voice UI DesignerContent80.088.986.988.192.4
Graph Database ArchitectTechnology83.389.792.890.591.1
Feature Store EngineerTechnology83.589.991.591.190.5
Average81.889.489.889.590.8

Best version per skill: v3a won 4, v3b won 3, v3c won 3. v1 and v2 won 0.

Average Score by Version
v1
81.8
v2
89.4
v3a
89.8
v3b
89.5
v3c
90.8
+7.6
v1 → v2 (framework)
vs
+1.4
v2 → best v3 (model)
more impact from framework

Three Things We Learned

1. The framework matters more than the model.

The biggest jump was v1 → v2: +7.6 points average. That's Claude Sonnet without SupaSkills vs. Claude Sonnet with SupaSkills.

The model swap (v2 → best v3): +1.4 points average.

Loading the right expert skills before generation gave 5x more improvement than switching to a more expensive model. The scaffolding — what the AI knows before it starts writing — dominates the output quality.

2. No single model wins everywhere.

Gemini 3.1 Pro excelled on technical skills (Graph Database Architect: 92.8, its highest single score). Opus 4.6 was strongest on business and finance skills (Cap Table: 91.5). The tag-team approach dominated on regulatory and creative skills (Digital Markets Act: 92.5, Voice UI: 92.4).

Different models have different strengths. A quality framework that works across all of them is more valuable than betting on one.

3. The tag-team is interesting, but not obviously better.

v3c (Gemini drafts, Opus refines) had the highest average (90.8) but didn't win the most individual matchups. It smoothed out weaknesses — no score below 89.3 — but rarely hit the peaks that single-model approaches reached. It's the most consistent, not the most impressive.

The real question: is +1.4 points worth 2x the API cost? For a quality-focused platform, probably. For most use cases, a single strong model with the right context gets you 95% of the way.


Under the Hood: Dimension Breakdown

We score every skill across 6 weighted dimensions:

DimensionWeightv3a (Gemini)v3b (Opus)v3c (Tag-team)
Research Quality15%8.919.138.95
Prompt Engineering25%9.038.909.13
Practical Utility15%8.678.528.66
Completeness10%8.979.219.14
User Satisfaction20%8.828.718.88
Decision Usefulness15%8.758.698.85
Dimension Breakdown — v3 Models
8.58.89.09.3ResearchQuality15%PromptEngineering25%PracticalUtility15%Complete-ness10%UserSatisfaction20%DecisionUsefulness15%

Opus produces the most thoroughly researched, complete outputs — it writes more, covers more edge cases, misses fewer details. Gemini produces more practically useful outputs — tighter, more action-oriented, easier to apply. The tag-team combines both: Gemini's practical structure refined by Opus's thoroughness.


What This Means for SupaSkills

We're not switching our entire pipeline to a single new model. The data says that's the wrong move.

What we are doing:

Model-aware routing. Technical skills go to Gemini 3.1 Pro. Business and finance skills go to Opus 4.6. Regulatory and creative skills get the tag-team treatment. The pipeline picks the best builder based on domain.

SupaSkills stays model-agnostic. Our scoring system evaluates output quality regardless of which model produced it. That's the point — we're a quality layer, not a model wrapper.

The v2 pipeline keeps running. 550 skills rebuilt and counting. The v3 insights will inform the next generation, not interrupt the current one.


Try It Yourself

All 1,078 skills are available through the SupaSkills API and as a Claude MCP connector. The scoring framework behind this experiment powers every skill in our catalog.

# Claude Desktop / Claude Code
clawhub install supaskills

# Or via REST API
curl https://www.supaskills.ai/api/v1/skills?q=code+review \
  -H "Authorization: Bearer sk_supa_YOUR_KEY"

Browse the catalog at supaskills.ai/skills.


Raw Data

The complete results including per-skill breakdowns, dimension scores, and prompt lengths are available in our pilot results JSON.

MetricValue
Skills tested10 (2 per domain)
Versions compared5 (v1, v2, v3a, v3b, v3c)
Scoring runs90 (10 x 3 x triple-model)
Scoring modelsClaude Sonnet 4.5, GPT-4o, Gemini 3.1 Pro
Total cost$18.50
v1 → v2 improvement+7.6 points (framework effect)
v2 → best v3 improvement+1.4 points (model effect)

SupaSkills scores every AI skill across 6 dimensions. No hype, no self-reported benchmarks. Just measured quality. Learn more about our scoring methodology.

Try the skills mentioned in this post

Browse Skills