What Deep Research Adds to Claude's Built-In Skills: A Data Comparison

Anthropic ships 24 official skills with Claude Code. These are designed as lightweight tooling aids, not deep domain-expert prompts. We were curious: what does a research-backed, quality-gated skill add on top of Claude's already strong foundation?

We scored 21 of Anthropic's skills (the ones with readable prompt files) alongside our closest equivalents using the same rubric. This is the March 2026 update, including 7 new Anthropic skills that are significantly stronger than the original batch.

The setup

Anthropic's side: 24 skills total, 21 with readable prompt files. Document generation (docx, pdf, xlsx, pptx), visual design (algorithmic-art, canvas-design, theme-factory), web development (frontend-design, web-artifacts-builder), developer tooling (claude-api, mcp-builder, skill-creator, webapp-testing, debug, review, tdd, subagent-dev), creative (slack-gif-creator), workflow (doc-coauthoring, handoff, loop, simplify). The 4 built-in skills without local files (claude-api, loop, simplify, keybindings-help) were excluded.

Our side: 1,144 published skills across 5 domains and 14 categories. Every skill scores 80.0+ on SupaScore (6 dimensions, weighted formula, minimum 6 research sources).

The rules: Claude Sonnet as blind judge. Same model, same evaluation criteria. Every score recorded, every judgement explained.

Anthropic vs SupaSkills Benchmark, March 2026

Both sides scored with the same SupaScore rubric (6 dimensions, blind judge)

0.0

Anthropic (21 skills)

0.0

SupaSkills (closest 21)

4 / 21above 80.0

1,144 / 1,144above 80.0

Layer 1: Scoring all 21 Anthropic skills with SupaScore

We ran every readable Anthropic skill through the same six-dimension rubric our own skills must pass. Important: these scores come from our SupaScore system, not from Anthropic's own evaluation. Anthropic does not publish quality scores for their skills. We applied our rubric to both sides equally, using Claude Sonnet as a blind judge.

Skill	RQ	PE	PU	CO	US	DU	Score	Tier
skill-creator	8.0	9.0	9.0	8.0	9.0	9.0	86.00	Platinum
xlsx	8.0	9.0	9.0	8.0	9.0	8.0	86.00	Platinum
mcp-builder	8.0	9.0	8.0	9.0	0.0	8.0	84.00	Gold
docx	8.0	9.0	8.0	9.0	8.0	7.0	81.00	Gold
slack-gif-creator	6.0	8.0	9.0	8.0	9.0	7.0	79.00	Gold
algorithmic-art	7.0	8.0	0.0	8.0	8.0	7.0	76.00	Gold
frontend-design	4.0	8.0	9.0	7.0	0.0	8.0	74.00	Gold
doc-coauthoring	3.0	8.0	8.0	7.0	8.0	8.0	73.00	Gold
tdd	6.0	7.0	8.0	7.0	8.0	7.0	72.00	Gold
webapp-testing	6.0	7.0	8.0	7.0	8.0	7.0	72.00	Gold
debug	3.0	8.0	8.0	7.0	7.0	9.0	70.00	Gold
review	3.0	7.0	8.0	6.0	8.0	7.0	66.00	Silver
pdf	8.0	3.0	8.0	6.0	7.0	6.0	63.00	Silver
subagent-dev	3.0	7.0	8.0	6.0	7.0	6.0	63.00	Silver
handoff	2.0	6.0	8.0	5.0	7.0	8.0	62.00	Silver
pptx	2.0	4.0	8.0	7.0	8.0	6.0	61.00	Silver
canvas-design	4.0	7.0	7.0	6.0	6.0	5.0	59.00	Bronze
theme-factory	3.0	6.0	7.0	5.0	6.0	5.0	55.00	Bronze
web-artifacts-builder	2.0	6.0	7.0	6.0	5.0	6.0	55.00	Bronze
brand-guidelines	3.0	6.0	7.0	5.0	0.0	4.0	51.00	Bronze
internal-comms	2.0	3.0	4.0	2.0	4.0	3.0	31.00	Bronze

Average Anthropic SupaScore: 67.57. Our average: 88.29. The gap reflects the different design goals: Anthropic's skills are concise tooling helpers; ours are research-backed domain specialists.

2 Platinum. 9 Gold. 5 Silver. 5 Bronze. 4 skills above our 80.0 publishing threshold, which requires minimum 6 research sources and structured governance.

All 21 Anthropic Skills, Scored With Our SupaScore Rubric

Anthropic's skills evaluated by our 6-dimension quality rubric. Not their own scoring.

skill-creator

platinum

xlsx

platinum

mcp-builder

gold

docx

gold

slack-gif-creator

gold

algorithmic-art

gold

frontend-design

gold

doc-coauthoring

gold

tdd

gold

webapp-testing

gold

debug

gold

review

silver

pdf

silver

subagent-dev

silver

handoff

silver

pptx

silver

canvas-design

bronze

theme-factory

bronze

web-artifacts-builder

bronze

brand-guidelines

bronze

internal-comms

bronze

Our publishing threshold: 80.04 of 21 pass

This is a meaningful improvement from the original 17-skill benchmark (average 57.23, zero above 80.0). Anthropic's newer skills (skill-creator, xlsx, tdd, debug) are significantly stronger than the original batch. Their team is clearly iterating.

Where they've gotten strong

skill-creator (86.00, Platinum) is genuinely impressive. It's a meta-skill for building other skills, with clear methodology, evaluation criteria, and structural guidance. This is one of the few Anthropic skills that would pass our quality gate.

xlsx (86.00, Platinum) is thorough and well-structured. Detailed API reference, examples, validation steps. It reads like production documentation, not a quick prompt.

docx (81.00, Gold) and mcp-builder (84.00, Gold) also cross our threshold. Both have real depth: docx includes XML reference patterns and validation workflows, mcp-builder covers the full MCP protocol with error handling patterns.

Average Dimension Scores (SupaScore Rubric)

Same 6 dimensions, same weights, same blind judge for both sides

Research Quality

15%

4.7

8.8

+4.1

Prompt Engineering

25%

6.9

8.9

+2.0

Practical Utility

15%

7.4

8.7

+1.3

Completeness

10%

6.6

8.5

+1.9

User Satisfaction

20%

6.3

8.8

+2.5

Decision Usefulness

15%

6.7

8.7

+2.0

Anthropic (21 avg)

SupaSkills (closest 21 avg)

The consistent weakness

Average Research Quality across all 21 skills: 4.7 out of 10. Average Prompt Engineering: 6.9.

Anthropic's team writes clean, well-structured prompts. The gap is in domain research depth: their skills don't include cited sources or external frameworks, which is reasonable for quick-start tooling. Our quality gate requires minimum 6 sources across 2+ types, which produces deeper domain coverage but also requires significantly more effort per skill.

Some of Anthropic's skills serve very specific internal purposes (brand-guidelines for their own brand, internal-comms for their document formats) and are not designed to be general-purpose domain experts.

What changed since the original benchmark

The gap narrowed from 31 points to 20.7 points. That's real progress. But the structural issues remain the same: no sources, no governance, no quality gate, no IP audit. Individual skill quality went up; infrastructure stayed at zero.

Layer 2: Head-to-head

Same task, same model. Their skill as system prompt vs. our best matching skill.

Matchup	Anthropic	SupaSkills	Our Skill	Winner
MCP: Weather Server	4	7	mcp-server-deployment-expert	SupaSkills
MCP: Auth + Resources	5	7	mcp-tool-designer	SupaSkills
Testing: Auth Flow	4	8	cypress-e2e-testing-expert	SupaSkills
Testing: Visual Regression	5	6	accessibility-testing-automation-engineer	SupaSkills
Frontend: Dashboard	4	7	react	SupaSkills
Brand: Developer Tool	7	8	brand-design	SupaSkills
Claude API: Pipeline	4	7	claude-code	SupaSkills
Themes: Token System	3	8	color-system-designer	SupaSkills

Head-to-Head: Same Task, Same Model, Blind Judge

Each task scored 1-10 by a separate Claude instance that did not know which response came from which skill

MCP: Weather Server

SupaSkills

MCP: Auth + Resources

SupaSkills

Testing: Auth Flow

SupaSkills

Testing: Visual Regr.

SupaSkills

Frontend: Dashboard

SupaSkills

Brand: Developer Tool

SupaSkills

Claude API: Pipeline

SupaSkills

Themes: Token System

SupaSkills

Anthropic: 0 wins

SupaSkills: 8 wins

Avg: 4.5 / 10Avg: 7.3 / 10

SupaSkills led in all 8 matchups. Average: Anthropic 4.5, SupaSkills 7.3. This is expected: our skills are purpose-built for these domains with deep research, while Anthropic's are general-purpose helpers.

In our original February benchmark, Anthropic led in 2 matchups (MCP and Claude API) where their insider knowledge gave them an edge. Our dedicated skills for those domains have since closed that gap.

The closest matchup was Visual Regression (5 vs 6). The largest gap was Themes (3 vs 8), where our color-system-designer produced a complete multi-brand token system while their theme-factory generated a basic color set without semantic naming.

Layer 3: The stack effect

What happens when you combine two or three related SupaSkills against a single Anthropic skill?

Matchup	Anthropic	SupaSkills Stack	Stack composition	Delta
MCP: Auth	7	8	mcp-tool-designer + mcp-server-deployment-expert + tool-using-agent-designer	+1
Testing: Auth	5	8	playwright-e2e + cypress-e2e + react-testing	+3
Claude API	7	8	structured-output-designer + system-prompt-architect + vercel-ai-sdk-developer	+1
Themes	6	8	color-system-designer + react-design-system-ops + accessible-component-kit-developer	+2

The stack won 4 of 8 matchups. When it won, the margin was significant. The testing stack scored +3 over Anthropic's webapp-testing.

This is the argument for a skills platform over a skills repository. With 1,144 skills, you combine complementary expertise. A color system designer plus a design system ops expert plus an accessibility specialist covers more ground than any single "theme-factory" skill can.

We call these combinations PowerPacks. 19 published bundles of 3 to 7 skills designed to work together. The benchmark suggests the concept is measurably better, not just convenient.

Layer 4: Infrastructure comparison

Beyond individual skill quality, there is a structural gap.

Capability	Anthropic	SupaSkills
Quality scoring	None	SupaScore 6D (avg 88.29)
Quality gate	None	Score >= 80.0 enforced
Research sources	None	Min 6 sources, 2+ types (9,000+)
IP/Copyright audit	None	1,144/1,144 audited
Versioning	Git history	Semver, is_latest, changelog
Injection protection	None	Delivery guard + canary
Discovery	Filesystem	Hybrid semantic search
Distribution	Git clone	REST API + MCP + ChatGPT Actions
Governance	skill-creator has eval	Per-skill guardrails, stop conditions, risk_level
Safety disclaimers	None	Domain-specific (medical, finance, legal)
Multi-skill orchestration	None	PowerPacks (19 published)
Cross-validation	None	OpenAI cross-scores (avg delta 1.92)
Rate limiting	None	Tiered by plan
Usage analytics	None	Load counts, activation events, audit log

Anthropic ships skills as markdown files in their CLI. This is a clean, simple approach: easy to read, easy to fork, zero infrastructure overhead. It works well for the quick-start tooling helpers they are designed to be.

We built something complementary: infrastructure for 1,144 production-grade skills with scoring, search, versioning, and governance. Different goals, different trade-offs.

What the data tells us

Anthropic is getting better, fast. Their newer skills (skill-creator at 86, xlsx at 86) meet our Platinum tier. The gap narrowed from 31 points to 20.7 points since February. Their team is clearly investing in skill quality.

The difference is infrastructure, not talent. Anthropic's best skills are well-written. The gap is in what surrounds them: sources, governance, versioning, safety disclaimers. With a quality pipeline, their strongest skills would compete with anyone's.

Skill stacking adds value. Combining two or three complementary skills consistently outperforms any single skill on complex tasks. This is the argument for a skills platform: with 1,144+ skills, you assemble domain-specific expertise for your exact use case.

Summary

Metric	Anthropic	SupaSkills
Skills scored	21	1,144
Avg quality score	67.57	88.29
Platinum tier	2	1,000+
Research sources per skill	0 (by design)	6+ (required)
Domains covered	~4	5 (14 categories)
Design goal	Tooling helpers	Domain specialists

The benchmark is reproducible. Same model, same prompts, same evaluation criteria. All raw data is available in our benchmark results JSON.

Benchmark methodology: Claude Sonnet as executor + blind judge. 21 skills scored on 6 dimensions. 8 head-to-head matchups, 10 tasks, 3 conditions per task (Anthropic single, SupaSkills single, SupaSkills stack). Updated March 15, 2026.

The setup

Layer 1: Scoring all 21 Anthropic skills with SupaScore

Where they've gotten strong

The consistent weakness

What changed since the original benchmark

Layer 2: Head-to-head

Layer 3: The stack effect

Layer 4: Infrastructure comparison

What the data tells us

Summary

Related posts