← Back to blog
Performancebenchmarkanthropicquality

We Matched Anthropic's 21 Official Skills Against Our Closest 21. Here's the Data.

Max Jürschik·March 15, 2026·10 min read

Anthropic ships 24 official skills with Claude Code. 21 have readable prompt files. We matched each one against its closest topic equivalent from our catalog, then scored both sides with the same rubric. Four layers, blind judging, reproducible methodology.

This is the March 2026 update. Anthropic has added 7 new skills since our original benchmark (debug, handoff, review, subagent-dev, tdd, slack-gif-creator, xlsx). Several of the new ones are significantly stronger than the original batch.

The setup

Anthropic's side: 24 skills total, 21 with readable prompt files. Document generation (docx, pdf, xlsx, pptx), visual design (algorithmic-art, canvas-design, theme-factory), web development (frontend-design, web-artifacts-builder), developer tooling (claude-api, mcp-builder, skill-creator, webapp-testing, debug, review, tdd, subagent-dev), creative (slack-gif-creator), workflow (doc-coauthoring, handoff, loop, simplify). The 4 built-in skills without local files (claude-api, loop, simplify, keybindings-help) were excluded.

Our side: 1,144 published skills across 5 domains and 14 categories. Every skill scores 80.0+ on SupaScore (6 dimensions, weighted formula, minimum 6 research sources).

The rules: Claude Sonnet as blind judge. Same model, same evaluation criteria. Every score recorded, every judgement explained.

Anthropic vs SupaSkills Benchmark, March 2026
Both sides scored with the same SupaScore rubric (6 dimensions, blind judge)
0.0
Anthropic (21 skills)
vs
0.0
SupaSkills (closest 21)
4 / 21above 80.0
1,144 / 1,144above 80.0

Layer 1: Scoring all 21 Anthropic skills with SupaScore

We ran every readable Anthropic skill through the same six-dimension rubric our own skills must pass. Important: these scores come from our SupaScore system, not from Anthropic's own evaluation. Anthropic does not publish quality scores for their skills. We applied our rubric to both sides equally, using Claude Sonnet as a blind judge.

SkillRQPEPUCOUSDUScoreTier
skill-creator8.09.09.08.09.09.086.00Platinum
xlsx8.09.09.08.09.08.086.00Platinum
mcp-builder8.09.08.09.00.08.084.00Gold
docx8.09.08.09.08.07.081.00Gold
slack-gif-creator6.08.09.08.09.07.079.00Gold
algorithmic-art7.08.00.08.08.07.076.00Gold
frontend-design4.08.09.07.00.08.074.00Gold
doc-coauthoring3.08.08.07.08.08.073.00Gold
tdd6.07.08.07.08.07.072.00Gold
webapp-testing6.07.08.07.08.07.072.00Gold
debug3.08.08.07.07.09.070.00Gold
review3.07.08.06.08.07.066.00Silver
pdf8.03.08.06.07.06.063.00Silver
subagent-dev3.07.08.06.07.06.063.00Silver
handoff2.06.08.05.07.08.062.00Silver
pptx2.04.08.07.08.06.061.00Silver
canvas-design4.07.07.06.06.05.059.00Bronze
theme-factory3.06.07.05.06.05.055.00Bronze
web-artifacts-builder2.06.07.06.05.06.055.00Bronze
brand-guidelines3.06.07.05.00.04.051.00Bronze
internal-comms2.03.04.02.04.03.031.00Bronze

Average Anthropic SupaScore: 67.57. Our average: 88.29. A 20.7-point gap.

2 Platinum. 9 Gold. 5 Silver. 5 Bronze. 4 skills above our 80.0 publishing threshold.

All 21 Anthropic Skills, Scored With Our SupaScore Rubric

Anthropic's skills evaluated by our 6-dimension quality rubric. Not their own scoring.

skill-creator
86
platinum
xlsx
86
platinum
mcp-builder
84
gold
docx
81
gold
slack-gif-creator
79
gold
algorithmic-art
76
gold
frontend-design
74
gold
doc-coauthoring
73
gold
tdd
72
gold
webapp-testing
72
gold
debug
70
gold
review
66
silver
pdf
63
silver
subagent-dev
63
silver
handoff
62
silver
pptx
61
silver
canvas-design
59
bronze
theme-factory
55
bronze
web-artifacts-builder
55
bronze
brand-guidelines
51
bronze
internal-comms
31
bronze
Our publishing threshold: 80.04 of 21 pass

This is a meaningful improvement from the original 17-skill benchmark (average 57.23, zero above 80.0). Anthropic's newer skills (skill-creator, xlsx, tdd, debug) are significantly stronger than the original batch. Their team is clearly iterating.

Where they've gotten strong

skill-creator (86.00, Platinum) is genuinely impressive. It's a meta-skill for building other skills, with clear methodology, evaluation criteria, and structural guidance. This is one of the few Anthropic skills that would pass our quality gate.

xlsx (86.00, Platinum) is thorough and well-structured. Detailed API reference, examples, validation steps. It reads like production documentation, not a quick prompt.

docx (81.00, Gold) and mcp-builder (84.00, Gold) also cross our threshold. Both have real depth: docx includes XML reference patterns and validation workflows, mcp-builder covers the full MCP protocol with error handling patterns.

Average Dimension Scores (SupaScore Rubric)

Same 6 dimensions, same weights, same blind judge for both sides

Research Quality
15%
4.7
8.8
+4.1
Prompt Engineering
25%
6.9
8.9
+2.0
Practical Utility
15%
7.4
8.7
+1.3
Completeness
10%
6.6
8.5
+1.9
User Satisfaction
20%
6.3
8.8
+2.5
Decision Usefulness
15%
6.7
8.7
+2.0
Anthropic (21 avg)
SupaSkills (closest 21 avg)

The consistent weakness

Average Research Quality across all 21 skills: 4.7 out of 10. Average Prompt Engineering: 6.9.

The pattern from the original benchmark persists: Anthropic's team writes clean, well-structured prompts, but they don't ground them in domain research. No cited papers. No referenced frameworks. No source hierarchy. Our minimum is 6 sources across 2+ types; their count is still zero.

internal-comms (31.00) remains the weakest: 1,511 characters pointing to example files that aren't included. brand-guidelines (51.00) is seven hex codes and two font names. Useful for Anthropic's brand specifically, but not transferable.

What changed since the original benchmark

The gap narrowed from 31 points to 20.7 points. That's real progress. But the structural issues remain the same: no sources, no governance, no quality gate, no IP audit. Individual skill quality went up; infrastructure stayed at zero.

Layer 2: Head-to-head

Same task, same model. Their skill as system prompt vs. our best matching skill.

MatchupAnthropicSupaSkillsOur SkillWinner
MCP: Weather Server47mcp-server-deployment-expertSupaSkills
MCP: Auth + Resources57mcp-tool-designerSupaSkills
Testing: Auth Flow48cypress-e2e-testing-expertSupaSkills
Testing: Visual Regression56accessibility-testing-automation-engineerSupaSkills
Frontend: Dashboard47reactSupaSkills
Brand: Developer Tool78brand-designSupaSkills
Claude API: Pipeline47claude-codeSupaSkills
Themes: Token System38color-system-designerSupaSkills
Head-to-Head: Same Task, Same Model, Blind Judge

Each task scored 1-10 by a separate Claude instance that did not know which response came from which skill

MCP: Weather Server
SupaSkills
MCP: Auth + Resources
SupaSkills
Testing: Auth Flow
SupaSkills
Testing: Visual Regr.
SupaSkills
Frontend: Dashboard
SupaSkills
Brand: Developer Tool
SupaSkills
Claude API: Pipeline
SupaSkills
Themes: Token System
SupaSkills
Anthropic: 0 wins
SupaSkills: 8 wins
Avg: 4.5 / 10Avg: 7.3 / 10

SupaSkills 8 wins. Anthropic 0 wins. 0 ties. Average: Anthropic 4.5, SupaSkills 7.3.

In our original February benchmark, Anthropic won 2 matchups (MCP and Claude API) based on home-field advantage. That advantage disappeared when we matched their skills against our current best rather than our February catalog. Our mcp-server-deployment-expert (88.78) and claude-code (90.95) now outperform Anthropic's own MCP and API skills on their own turf.

The closest matchup was Visual Regression (5 vs 6). The largest gap was Themes (3 vs 8), where our color-system-designer produced a complete multi-brand token system while their theme-factory generated a basic color set without semantic naming.

Layer 3: The stack effect

What happens when you combine two or three related SupaSkills against a single Anthropic skill?

MatchupAnthropicSupaSkills StackStack compositionDelta
MCP: Auth78mcp-tool-designer + mcp-server-deployment-expert + tool-using-agent-designer+1
Testing: Auth58playwright-e2e + cypress-e2e + react-testing+3
Claude API78structured-output-designer + system-prompt-architect + vercel-ai-sdk-developer+1
Themes68color-system-designer + react-design-system-ops + accessible-component-kit-developer+2

The stack won 4 of 8 matchups. When it won, the margin was significant. The testing stack scored +3 over Anthropic's webapp-testing.

This is the argument for a skills platform over a skills repository. With 1,144 skills, you combine complementary expertise. A color system designer plus a design system ops expert plus an accessibility specialist covers more ground than any single "theme-factory" skill can.

We call these combinations PowerPacks. 19 published bundles of 3 to 7 skills designed to work together. The benchmark suggests the concept is measurably better, not just convenient.

Layer 4: Infrastructure comparison

Beyond individual skill quality, there is a structural gap.

CapabilityAnthropicSupaSkills
Quality scoringNoneSupaScore 6D (avg 88.29)
Quality gateNoneScore >= 80.0 enforced
Research sourcesNoneMin 6 sources, 2+ types (9,000+)
IP/Copyright auditNone1,144/1,144 audited
VersioningGit historySemver, is_latest, changelog
Injection protectionNoneDelivery guard + canary
DiscoveryFilesystemHybrid semantic search
DistributionGit cloneREST API + MCP + ChatGPT Actions
Governanceskill-creator has evalPer-skill guardrails, stop conditions, risk_level
Safety disclaimersNoneDomain-specific (medical, finance, legal)
Multi-skill orchestrationNonePowerPacks (19 published)
Cross-validationNoneOpenAI cross-scores (avg delta 1.92)
Rate limitingNoneTiered by plan
Usage analyticsNoneLoad counts, activation events, audit log

Anthropic ships skills as markdown files in a git repo. No scoring, no search, no API, no versioning beyond git commits, no IP auditing, no delivery guards, no safety disclaimers for sensitive domains.

That is not a criticism. It is a different approach. They built starting points for their own tool. We built infrastructure for 1,144 production-grade skills.

What the data tells us

Anthropic is getting better. Their newer skills (skill-creator at 86, xlsx at 86) compete with our catalog. The gap narrowed from 31 points to 20.7 points. We take that seriously.

The weakness is still systematic. Every skill is missing the same things: sources, governance, parameters, validation, versioning, safety disclaimers. This is an infrastructure gap, not a talent gap. With a quality pipeline, their best skills would compete with anyone's.

Single skills have a ceiling. Even Anthropic's Platinum skills max out at 86. Our stacks of two or three skills consistently hit 8/10 on tasks where single skills plateau. Combinatorial expertise beats individual depth, but only if you have enough skills to combine.

Home-field advantage has an expiry date. In February, Anthropic won on MCP and Claude API because we didn't have dedicated skills for those domains. By March, our mcp-server-deployment-expert and claude-code skill beat their own tools on their own turf. Domain-specific research depth eventually outperforms insider knowledge when the research base is deep enough.

Summary

MetricAnthropicSupaSkills
Skills scored211,144
Avg quality score67.5788.29
Above 80.0 threshold41,144
Platinum tier21,000+
Research sources09,000+
Domains covered~45 (14 categories)
H2H wins (single)08
H2H wins (stack)04
Infrastructure features0/1414/14

The benchmark is reproducible. Same model, same prompts, same evaluation criteria. All raw data is available in our benchmark results JSON.


Benchmark methodology: Claude Sonnet as executor + blind judge. 21 skills scored on 6 dimensions. 8 head-to-head matchups, 10 tasks, 3 conditions per task (Anthropic single, SupaSkills single, SupaSkills stack). Updated March 15, 2026.