← Back to blog
Performancebenchmarkanthropicquality

What Deep Research Adds to Claude's Built-In Skills: A Data Comparison

Max Jürschik·March 15, 2026·10 min read

Anthropic ships 24 official skills with Claude Code. These are designed as lightweight tooling aids, not deep domain-expert prompts. We were curious: what does a research-backed, quality-gated skill add on top of Claude's already strong foundation?

We scored 21 of Anthropic's skills (the ones with readable prompt files) alongside our closest equivalents using the same rubric. This is the March 2026 update, including 7 new Anthropic skills that are significantly stronger than the original batch.

The setup

Anthropic's side: 24 skills total, 21 with readable prompt files. Document generation (docx, pdf, xlsx, pptx), visual design (algorithmic-art, canvas-design, theme-factory), web development (frontend-design, web-artifacts-builder), developer tooling (claude-api, mcp-builder, skill-creator, webapp-testing, debug, review, tdd, subagent-dev), creative (slack-gif-creator), workflow (doc-coauthoring, handoff, loop, simplify). The 4 built-in skills without local files (claude-api, loop, simplify, keybindings-help) were excluded.

Our side: 1,144 published skills across 5 domains and 14 categories. Every skill scores 80.0+ on SupaScore (6 dimensions, weighted formula, minimum 6 research sources).

The rules: Claude Sonnet as blind judge. Same model, same evaluation criteria. Every score recorded, every judgement explained.

Anthropic vs SupaSkills Benchmark, March 2026
Both sides scored with the same SupaScore rubric (6 dimensions, blind judge)
0.0
Anthropic (21 skills)
vs
0.0
SupaSkills (closest 21)
4 / 21above 80.0
1,144 / 1,144above 80.0

Layer 1: Scoring all 21 Anthropic skills with SupaScore

We ran every readable Anthropic skill through the same six-dimension rubric our own skills must pass. Important: these scores come from our SupaScore system, not from Anthropic's own evaluation. Anthropic does not publish quality scores for their skills. We applied our rubric to both sides equally, using Claude Sonnet as a blind judge.

SkillRQPEPUCOUSDUScoreTier
skill-creator8.09.09.08.09.09.086.00Platinum
xlsx8.09.09.08.09.08.086.00Platinum
mcp-builder8.09.08.09.00.08.084.00Gold
docx8.09.08.09.08.07.081.00Gold
slack-gif-creator6.08.09.08.09.07.079.00Gold
algorithmic-art7.08.00.08.08.07.076.00Gold
frontend-design4.08.09.07.00.08.074.00Gold
doc-coauthoring3.08.08.07.08.08.073.00Gold
tdd6.07.08.07.08.07.072.00Gold
webapp-testing6.07.08.07.08.07.072.00Gold
debug3.08.08.07.07.09.070.00Gold
review3.07.08.06.08.07.066.00Silver
pdf8.03.08.06.07.06.063.00Silver
subagent-dev3.07.08.06.07.06.063.00Silver
handoff2.06.08.05.07.08.062.00Silver
pptx2.04.08.07.08.06.061.00Silver
canvas-design4.07.07.06.06.05.059.00Bronze
theme-factory3.06.07.05.06.05.055.00Bronze
web-artifacts-builder2.06.07.06.05.06.055.00Bronze
brand-guidelines3.06.07.05.00.04.051.00Bronze
internal-comms2.03.04.02.04.03.031.00Bronze

Average Anthropic SupaScore: 67.57. Our average: 88.29. The gap reflects the different design goals: Anthropic's skills are concise tooling helpers; ours are research-backed domain specialists.

2 Platinum. 9 Gold. 5 Silver. 5 Bronze. 4 skills above our 80.0 publishing threshold, which requires minimum 6 research sources and structured governance.

All 21 Anthropic Skills, Scored With Our SupaScore Rubric

Anthropic's skills evaluated by our 6-dimension quality rubric. Not their own scoring.

skill-creator
86
platinum
xlsx
86
platinum
mcp-builder
84
gold
docx
81
gold
slack-gif-creator
79
gold
algorithmic-art
76
gold
frontend-design
74
gold
doc-coauthoring
73
gold
tdd
72
gold
webapp-testing
72
gold
debug
70
gold
review
66
silver
pdf
63
silver
subagent-dev
63
silver
handoff
62
silver
pptx
61
silver
canvas-design
59
bronze
theme-factory
55
bronze
web-artifacts-builder
55
bronze
brand-guidelines
51
bronze
internal-comms
31
bronze
Our publishing threshold: 80.04 of 21 pass

This is a meaningful improvement from the original 17-skill benchmark (average 57.23, zero above 80.0). Anthropic's newer skills (skill-creator, xlsx, tdd, debug) are significantly stronger than the original batch. Their team is clearly iterating.

Where they've gotten strong

skill-creator (86.00, Platinum) is genuinely impressive. It's a meta-skill for building other skills, with clear methodology, evaluation criteria, and structural guidance. This is one of the few Anthropic skills that would pass our quality gate.

xlsx (86.00, Platinum) is thorough and well-structured. Detailed API reference, examples, validation steps. It reads like production documentation, not a quick prompt.

docx (81.00, Gold) and mcp-builder (84.00, Gold) also cross our threshold. Both have real depth: docx includes XML reference patterns and validation workflows, mcp-builder covers the full MCP protocol with error handling patterns.

Average Dimension Scores (SupaScore Rubric)

Same 6 dimensions, same weights, same blind judge for both sides

Research Quality
15%
4.7
8.8
+4.1
Prompt Engineering
25%
6.9
8.9
+2.0
Practical Utility
15%
7.4
8.7
+1.3
Completeness
10%
6.6
8.5
+1.9
User Satisfaction
20%
6.3
8.8
+2.5
Decision Usefulness
15%
6.7
8.7
+2.0
Anthropic (21 avg)
SupaSkills (closest 21 avg)

The consistent weakness

Average Research Quality across all 21 skills: 4.7 out of 10. Average Prompt Engineering: 6.9.

Anthropic's team writes clean, well-structured prompts. The gap is in domain research depth: their skills don't include cited sources or external frameworks, which is reasonable for quick-start tooling. Our quality gate requires minimum 6 sources across 2+ types, which produces deeper domain coverage but also requires significantly more effort per skill.

Some of Anthropic's skills serve very specific internal purposes (brand-guidelines for their own brand, internal-comms for their document formats) and are not designed to be general-purpose domain experts.

What changed since the original benchmark

The gap narrowed from 31 points to 20.7 points. That's real progress. But the structural issues remain the same: no sources, no governance, no quality gate, no IP audit. Individual skill quality went up; infrastructure stayed at zero.

Layer 2: Head-to-head

Same task, same model. Their skill as system prompt vs. our best matching skill.

MatchupAnthropicSupaSkillsOur SkillWinner
MCP: Weather Server47mcp-server-deployment-expertSupaSkills
MCP: Auth + Resources57mcp-tool-designerSupaSkills
Testing: Auth Flow48cypress-e2e-testing-expertSupaSkills
Testing: Visual Regression56accessibility-testing-automation-engineerSupaSkills
Frontend: Dashboard47reactSupaSkills
Brand: Developer Tool78brand-designSupaSkills
Claude API: Pipeline47claude-codeSupaSkills
Themes: Token System38color-system-designerSupaSkills
Head-to-Head: Same Task, Same Model, Blind Judge

Each task scored 1-10 by a separate Claude instance that did not know which response came from which skill

MCP: Weather Server
SupaSkills
MCP: Auth + Resources
SupaSkills
Testing: Auth Flow
SupaSkills
Testing: Visual Regr.
SupaSkills
Frontend: Dashboard
SupaSkills
Brand: Developer Tool
SupaSkills
Claude API: Pipeline
SupaSkills
Themes: Token System
SupaSkills
Anthropic: 0 wins
SupaSkills: 8 wins
Avg: 4.5 / 10Avg: 7.3 / 10

SupaSkills led in all 8 matchups. Average: Anthropic 4.5, SupaSkills 7.3. This is expected: our skills are purpose-built for these domains with deep research, while Anthropic's are general-purpose helpers.

In our original February benchmark, Anthropic led in 2 matchups (MCP and Claude API) where their insider knowledge gave them an edge. Our dedicated skills for those domains have since closed that gap.

The closest matchup was Visual Regression (5 vs 6). The largest gap was Themes (3 vs 8), where our color-system-designer produced a complete multi-brand token system while their theme-factory generated a basic color set without semantic naming.

Layer 3: The stack effect

What happens when you combine two or three related SupaSkills against a single Anthropic skill?

MatchupAnthropicSupaSkills StackStack compositionDelta
MCP: Auth78mcp-tool-designer + mcp-server-deployment-expert + tool-using-agent-designer+1
Testing: Auth58playwright-e2e + cypress-e2e + react-testing+3
Claude API78structured-output-designer + system-prompt-architect + vercel-ai-sdk-developer+1
Themes68color-system-designer + react-design-system-ops + accessible-component-kit-developer+2

The stack won 4 of 8 matchups. When it won, the margin was significant. The testing stack scored +3 over Anthropic's webapp-testing.

This is the argument for a skills platform over a skills repository. With 1,144 skills, you combine complementary expertise. A color system designer plus a design system ops expert plus an accessibility specialist covers more ground than any single "theme-factory" skill can.

We call these combinations PowerPacks. 19 published bundles of 3 to 7 skills designed to work together. The benchmark suggests the concept is measurably better, not just convenient.

Layer 4: Infrastructure comparison

Beyond individual skill quality, there is a structural gap.

CapabilityAnthropicSupaSkills
Quality scoringNoneSupaScore 6D (avg 88.29)
Quality gateNoneScore >= 80.0 enforced
Research sourcesNoneMin 6 sources, 2+ types (9,000+)
IP/Copyright auditNone1,144/1,144 audited
VersioningGit historySemver, is_latest, changelog
Injection protectionNoneDelivery guard + canary
DiscoveryFilesystemHybrid semantic search
DistributionGit cloneREST API + MCP + ChatGPT Actions
Governanceskill-creator has evalPer-skill guardrails, stop conditions, risk_level
Safety disclaimersNoneDomain-specific (medical, finance, legal)
Multi-skill orchestrationNonePowerPacks (19 published)
Cross-validationNoneOpenAI cross-scores (avg delta 1.92)
Rate limitingNoneTiered by plan
Usage analyticsNoneLoad counts, activation events, audit log

Anthropic ships skills as markdown files in their CLI. This is a clean, simple approach: easy to read, easy to fork, zero infrastructure overhead. It works well for the quick-start tooling helpers they are designed to be.

We built something complementary: infrastructure for 1,144 production-grade skills with scoring, search, versioning, and governance. Different goals, different trade-offs.

What the data tells us

Anthropic is getting better, fast. Their newer skills (skill-creator at 86, xlsx at 86) meet our Platinum tier. The gap narrowed from 31 points to 20.7 points since February. Their team is clearly investing in skill quality.

The difference is infrastructure, not talent. Anthropic's best skills are well-written. The gap is in what surrounds them: sources, governance, versioning, safety disclaimers. With a quality pipeline, their strongest skills would compete with anyone's.

Skill stacking adds value. Combining two or three complementary skills consistently outperforms any single skill on complex tasks. This is the argument for a skills platform: with 1,144+ skills, you assemble domain-specific expertise for your exact use case.

Summary

MetricAnthropicSupaSkills
Skills scored211,144
Avg quality score67.5788.29
Platinum tier21,000+
Research sources per skill0 (by design)6+ (required)
Domains covered~45 (14 categories)
Design goalTooling helpersDomain specialists

The benchmark is reproducible. Same model, same prompts, same evaluation criteria. All raw data is available in our benchmark results JSON.


Benchmark methodology: Claude Sonnet as executor + blind judge. 21 skills scored on 6 dimensions. 8 head-to-head matchups, 10 tasks, 3 conditions per task (Anthropic single, SupaSkills single, SupaSkills stack). Updated March 15, 2026.