Anthropic ships 24 official skills with Claude Code. 21 have readable prompt files. We matched each one against its closest topic equivalent from our catalog, then scored both sides with the same rubric. Four layers, blind judging, reproducible methodology.
This is the March 2026 update. Anthropic has added 7 new skills since our original benchmark (debug, handoff, review, subagent-dev, tdd, slack-gif-creator, xlsx). Several of the new ones are significantly stronger than the original batch.
The setup
Anthropic's side: 24 skills total, 21 with readable prompt files. Document generation (docx, pdf, xlsx, pptx), visual design (algorithmic-art, canvas-design, theme-factory), web development (frontend-design, web-artifacts-builder), developer tooling (claude-api, mcp-builder, skill-creator, webapp-testing, debug, review, tdd, subagent-dev), creative (slack-gif-creator), workflow (doc-coauthoring, handoff, loop, simplify). The 4 built-in skills without local files (claude-api, loop, simplify, keybindings-help) were excluded.
Our side: 1,144 published skills across 5 domains and 14 categories. Every skill scores 80.0+ on SupaScore (6 dimensions, weighted formula, minimum 6 research sources).
The rules: Claude Sonnet as blind judge. Same model, same evaluation criteria. Every score recorded, every judgement explained.
Layer 1: Scoring all 21 Anthropic skills with SupaScore
We ran every readable Anthropic skill through the same six-dimension rubric our own skills must pass. Important: these scores come from our SupaScore system, not from Anthropic's own evaluation. Anthropic does not publish quality scores for their skills. We applied our rubric to both sides equally, using Claude Sonnet as a blind judge.
| Skill | RQ | PE | PU | CO | US | DU | Score | Tier |
|---|---|---|---|---|---|---|---|---|
| skill-creator | 8.0 | 9.0 | 9.0 | 8.0 | 9.0 | 9.0 | 86.00 | Platinum |
| xlsx | 8.0 | 9.0 | 9.0 | 8.0 | 9.0 | 8.0 | 86.00 | Platinum |
| mcp-builder | 8.0 | 9.0 | 8.0 | 9.0 | 0.0 | 8.0 | 84.00 | Gold |
| docx | 8.0 | 9.0 | 8.0 | 9.0 | 8.0 | 7.0 | 81.00 | Gold |
| slack-gif-creator | 6.0 | 8.0 | 9.0 | 8.0 | 9.0 | 7.0 | 79.00 | Gold |
| algorithmic-art | 7.0 | 8.0 | 0.0 | 8.0 | 8.0 | 7.0 | 76.00 | Gold |
| frontend-design | 4.0 | 8.0 | 9.0 | 7.0 | 0.0 | 8.0 | 74.00 | Gold |
| doc-coauthoring | 3.0 | 8.0 | 8.0 | 7.0 | 8.0 | 8.0 | 73.00 | Gold |
| tdd | 6.0 | 7.0 | 8.0 | 7.0 | 8.0 | 7.0 | 72.00 | Gold |
| webapp-testing | 6.0 | 7.0 | 8.0 | 7.0 | 8.0 | 7.0 | 72.00 | Gold |
| debug | 3.0 | 8.0 | 8.0 | 7.0 | 7.0 | 9.0 | 70.00 | Gold |
| review | 3.0 | 7.0 | 8.0 | 6.0 | 8.0 | 7.0 | 66.00 | Silver |
| 8.0 | 3.0 | 8.0 | 6.0 | 7.0 | 6.0 | 63.00 | Silver | |
| subagent-dev | 3.0 | 7.0 | 8.0 | 6.0 | 7.0 | 6.0 | 63.00 | Silver |
| handoff | 2.0 | 6.0 | 8.0 | 5.0 | 7.0 | 8.0 | 62.00 | Silver |
| pptx | 2.0 | 4.0 | 8.0 | 7.0 | 8.0 | 6.0 | 61.00 | Silver |
| canvas-design | 4.0 | 7.0 | 7.0 | 6.0 | 6.0 | 5.0 | 59.00 | Bronze |
| theme-factory | 3.0 | 6.0 | 7.0 | 5.0 | 6.0 | 5.0 | 55.00 | Bronze |
| web-artifacts-builder | 2.0 | 6.0 | 7.0 | 6.0 | 5.0 | 6.0 | 55.00 | Bronze |
| brand-guidelines | 3.0 | 6.0 | 7.0 | 5.0 | 0.0 | 4.0 | 51.00 | Bronze |
| internal-comms | 2.0 | 3.0 | 4.0 | 2.0 | 4.0 | 3.0 | 31.00 | Bronze |
Average Anthropic SupaScore: 67.57. Our average: 88.29. A 20.7-point gap.
2 Platinum. 9 Gold. 5 Silver. 5 Bronze. 4 skills above our 80.0 publishing threshold.
Anthropic's skills evaluated by our 6-dimension quality rubric. Not their own scoring.
This is a meaningful improvement from the original 17-skill benchmark (average 57.23, zero above 80.0). Anthropic's newer skills (skill-creator, xlsx, tdd, debug) are significantly stronger than the original batch. Their team is clearly iterating.
Where they've gotten strong
skill-creator (86.00, Platinum) is genuinely impressive. It's a meta-skill for building other skills, with clear methodology, evaluation criteria, and structural guidance. This is one of the few Anthropic skills that would pass our quality gate.
xlsx (86.00, Platinum) is thorough and well-structured. Detailed API reference, examples, validation steps. It reads like production documentation, not a quick prompt.
docx (81.00, Gold) and mcp-builder (84.00, Gold) also cross our threshold. Both have real depth: docx includes XML reference patterns and validation workflows, mcp-builder covers the full MCP protocol with error handling patterns.
Same 6 dimensions, same weights, same blind judge for both sides
The consistent weakness
Average Research Quality across all 21 skills: 4.7 out of 10. Average Prompt Engineering: 6.9.
The pattern from the original benchmark persists: Anthropic's team writes clean, well-structured prompts, but they don't ground them in domain research. No cited papers. No referenced frameworks. No source hierarchy. Our minimum is 6 sources across 2+ types; their count is still zero.
internal-comms (31.00) remains the weakest: 1,511 characters pointing to example files that aren't included. brand-guidelines (51.00) is seven hex codes and two font names. Useful for Anthropic's brand specifically, but not transferable.
What changed since the original benchmark
The gap narrowed from 31 points to 20.7 points. That's real progress. But the structural issues remain the same: no sources, no governance, no quality gate, no IP audit. Individual skill quality went up; infrastructure stayed at zero.
Layer 2: Head-to-head
Same task, same model. Their skill as system prompt vs. our best matching skill.
| Matchup | Anthropic | SupaSkills | Our Skill | Winner |
|---|---|---|---|---|
| MCP: Weather Server | 4 | 7 | mcp-server-deployment-expert | SupaSkills |
| MCP: Auth + Resources | 5 | 7 | mcp-tool-designer | SupaSkills |
| Testing: Auth Flow | 4 | 8 | cypress-e2e-testing-expert | SupaSkills |
| Testing: Visual Regression | 5 | 6 | accessibility-testing-automation-engineer | SupaSkills |
| Frontend: Dashboard | 4 | 7 | react | SupaSkills |
| Brand: Developer Tool | 7 | 8 | brand-design | SupaSkills |
| Claude API: Pipeline | 4 | 7 | claude-code | SupaSkills |
| Themes: Token System | 3 | 8 | color-system-designer | SupaSkills |
Each task scored 1-10 by a separate Claude instance that did not know which response came from which skill
SupaSkills 8 wins. Anthropic 0 wins. 0 ties. Average: Anthropic 4.5, SupaSkills 7.3.
In our original February benchmark, Anthropic won 2 matchups (MCP and Claude API) based on home-field advantage. That advantage disappeared when we matched their skills against our current best rather than our February catalog. Our mcp-server-deployment-expert (88.78) and claude-code (90.95) now outperform Anthropic's own MCP and API skills on their own turf.
The closest matchup was Visual Regression (5 vs 6). The largest gap was Themes (3 vs 8), where our color-system-designer produced a complete multi-brand token system while their theme-factory generated a basic color set without semantic naming.
Layer 3: The stack effect
What happens when you combine two or three related SupaSkills against a single Anthropic skill?
| Matchup | Anthropic | SupaSkills Stack | Stack composition | Delta |
|---|---|---|---|---|
| MCP: Auth | 7 | 8 | mcp-tool-designer + mcp-server-deployment-expert + tool-using-agent-designer | +1 |
| Testing: Auth | 5 | 8 | playwright-e2e + cypress-e2e + react-testing | +3 |
| Claude API | 7 | 8 | structured-output-designer + system-prompt-architect + vercel-ai-sdk-developer | +1 |
| Themes | 6 | 8 | color-system-designer + react-design-system-ops + accessible-component-kit-developer | +2 |
The stack won 4 of 8 matchups. When it won, the margin was significant. The testing stack scored +3 over Anthropic's webapp-testing.
This is the argument for a skills platform over a skills repository. With 1,144 skills, you combine complementary expertise. A color system designer plus a design system ops expert plus an accessibility specialist covers more ground than any single "theme-factory" skill can.
We call these combinations PowerPacks. 19 published bundles of 3 to 7 skills designed to work together. The benchmark suggests the concept is measurably better, not just convenient.
Layer 4: Infrastructure comparison
Beyond individual skill quality, there is a structural gap.
| Capability | Anthropic | SupaSkills |
|---|---|---|
| Quality scoring | None | SupaScore 6D (avg 88.29) |
| Quality gate | None | Score >= 80.0 enforced |
| Research sources | None | Min 6 sources, 2+ types (9,000+) |
| IP/Copyright audit | None | 1,144/1,144 audited |
| Versioning | Git history | Semver, is_latest, changelog |
| Injection protection | None | Delivery guard + canary |
| Discovery | Filesystem | Hybrid semantic search |
| Distribution | Git clone | REST API + MCP + ChatGPT Actions |
| Governance | skill-creator has eval | Per-skill guardrails, stop conditions, risk_level |
| Safety disclaimers | None | Domain-specific (medical, finance, legal) |
| Multi-skill orchestration | None | PowerPacks (19 published) |
| Cross-validation | None | OpenAI cross-scores (avg delta 1.92) |
| Rate limiting | None | Tiered by plan |
| Usage analytics | None | Load counts, activation events, audit log |
Anthropic ships skills as markdown files in a git repo. No scoring, no search, no API, no versioning beyond git commits, no IP auditing, no delivery guards, no safety disclaimers for sensitive domains.
That is not a criticism. It is a different approach. They built starting points for their own tool. We built infrastructure for 1,144 production-grade skills.
What the data tells us
Anthropic is getting better. Their newer skills (skill-creator at 86, xlsx at 86) compete with our catalog. The gap narrowed from 31 points to 20.7 points. We take that seriously.
The weakness is still systematic. Every skill is missing the same things: sources, governance, parameters, validation, versioning, safety disclaimers. This is an infrastructure gap, not a talent gap. With a quality pipeline, their best skills would compete with anyone's.
Single skills have a ceiling. Even Anthropic's Platinum skills max out at 86. Our stacks of two or three skills consistently hit 8/10 on tasks where single skills plateau. Combinatorial expertise beats individual depth, but only if you have enough skills to combine.
Home-field advantage has an expiry date. In February, Anthropic won on MCP and Claude API because we didn't have dedicated skills for those domains. By March, our mcp-server-deployment-expert and claude-code skill beat their own tools on their own turf. Domain-specific research depth eventually outperforms insider knowledge when the research base is deep enough.
Summary
| Metric | Anthropic | SupaSkills |
|---|---|---|
| Skills scored | 21 | 1,144 |
| Avg quality score | 67.57 | 88.29 |
| Above 80.0 threshold | 4 | 1,144 |
| Platinum tier | 2 | 1,000+ |
| Research sources | 0 | 9,000+ |
| Domains covered | ~4 | 5 (14 categories) |
| H2H wins (single) | 0 | 8 |
| H2H wins (stack) | 0 | 4 |
| Infrastructure features | 0/14 | 14/14 |
The benchmark is reproducible. Same model, same prompts, same evaluation criteria. All raw data is available in our benchmark results JSON.
Benchmark methodology: Claude Sonnet as executor + blind judge. 21 skills scored on 6 dimensions. 8 head-to-head matchups, 10 tasks, 3 conditions per task (Anthropic single, SupaSkills single, SupaSkills stack). Updated March 15, 2026.