Anthropic ships 24 official skills with Claude Code. These are designed as lightweight tooling aids, not deep domain-expert prompts. We were curious: what does a research-backed, quality-gated skill add on top of Claude's already strong foundation?
We scored 21 of Anthropic's skills (the ones with readable prompt files) alongside our closest equivalents using the same rubric. This is the March 2026 update, including 7 new Anthropic skills that are significantly stronger than the original batch.
The setup
Anthropic's side: 24 skills total, 21 with readable prompt files. Document generation (docx, pdf, xlsx, pptx), visual design (algorithmic-art, canvas-design, theme-factory), web development (frontend-design, web-artifacts-builder), developer tooling (claude-api, mcp-builder, skill-creator, webapp-testing, debug, review, tdd, subagent-dev), creative (slack-gif-creator), workflow (doc-coauthoring, handoff, loop, simplify). The 4 built-in skills without local files (claude-api, loop, simplify, keybindings-help) were excluded.
Our side: 1,144 published skills across 5 domains and 14 categories. Every skill scores 80.0+ on SupaScore (6 dimensions, weighted formula, minimum 6 research sources).
The rules: Claude Sonnet as blind judge. Same model, same evaluation criteria. Every score recorded, every judgement explained.
Layer 1: Scoring all 21 Anthropic skills with SupaScore
We ran every readable Anthropic skill through the same six-dimension rubric our own skills must pass. Important: these scores come from our SupaScore system, not from Anthropic's own evaluation. Anthropic does not publish quality scores for their skills. We applied our rubric to both sides equally, using Claude Sonnet as a blind judge.
| Skill | RQ | PE | PU | CO | US | DU | Score | Tier |
|---|---|---|---|---|---|---|---|---|
| skill-creator | 8.0 | 9.0 | 9.0 | 8.0 | 9.0 | 9.0 | 86.00 | Platinum |
| xlsx | 8.0 | 9.0 | 9.0 | 8.0 | 9.0 | 8.0 | 86.00 | Platinum |
| mcp-builder | 8.0 | 9.0 | 8.0 | 9.0 | 0.0 | 8.0 | 84.00 | Gold |
| docx | 8.0 | 9.0 | 8.0 | 9.0 | 8.0 | 7.0 | 81.00 | Gold |
| slack-gif-creator | 6.0 | 8.0 | 9.0 | 8.0 | 9.0 | 7.0 | 79.00 | Gold |
| algorithmic-art | 7.0 | 8.0 | 0.0 | 8.0 | 8.0 | 7.0 | 76.00 | Gold |
| frontend-design | 4.0 | 8.0 | 9.0 | 7.0 | 0.0 | 8.0 | 74.00 | Gold |
| doc-coauthoring | 3.0 | 8.0 | 8.0 | 7.0 | 8.0 | 8.0 | 73.00 | Gold |
| tdd | 6.0 | 7.0 | 8.0 | 7.0 | 8.0 | 7.0 | 72.00 | Gold |
| webapp-testing | 6.0 | 7.0 | 8.0 | 7.0 | 8.0 | 7.0 | 72.00 | Gold |
| debug | 3.0 | 8.0 | 8.0 | 7.0 | 7.0 | 9.0 | 70.00 | Gold |
| review | 3.0 | 7.0 | 8.0 | 6.0 | 8.0 | 7.0 | 66.00 | Silver |
| 8.0 | 3.0 | 8.0 | 6.0 | 7.0 | 6.0 | 63.00 | Silver | |
| subagent-dev | 3.0 | 7.0 | 8.0 | 6.0 | 7.0 | 6.0 | 63.00 | Silver |
| handoff | 2.0 | 6.0 | 8.0 | 5.0 | 7.0 | 8.0 | 62.00 | Silver |
| pptx | 2.0 | 4.0 | 8.0 | 7.0 | 8.0 | 6.0 | 61.00 | Silver |
| canvas-design | 4.0 | 7.0 | 7.0 | 6.0 | 6.0 | 5.0 | 59.00 | Bronze |
| theme-factory | 3.0 | 6.0 | 7.0 | 5.0 | 6.0 | 5.0 | 55.00 | Bronze |
| web-artifacts-builder | 2.0 | 6.0 | 7.0 | 6.0 | 5.0 | 6.0 | 55.00 | Bronze |
| brand-guidelines | 3.0 | 6.0 | 7.0 | 5.0 | 0.0 | 4.0 | 51.00 | Bronze |
| internal-comms | 2.0 | 3.0 | 4.0 | 2.0 | 4.0 | 3.0 | 31.00 | Bronze |
Average Anthropic SupaScore: 67.57. Our average: 88.29. The gap reflects the different design goals: Anthropic's skills are concise tooling helpers; ours are research-backed domain specialists.
2 Platinum. 9 Gold. 5 Silver. 5 Bronze. 4 skills above our 80.0 publishing threshold, which requires minimum 6 research sources and structured governance.
Anthropic's skills evaluated by our 6-dimension quality rubric. Not their own scoring.
This is a meaningful improvement from the original 17-skill benchmark (average 57.23, zero above 80.0). Anthropic's newer skills (skill-creator, xlsx, tdd, debug) are significantly stronger than the original batch. Their team is clearly iterating.
Where they've gotten strong
skill-creator (86.00, Platinum) is genuinely impressive. It's a meta-skill for building other skills, with clear methodology, evaluation criteria, and structural guidance. This is one of the few Anthropic skills that would pass our quality gate.
xlsx (86.00, Platinum) is thorough and well-structured. Detailed API reference, examples, validation steps. It reads like production documentation, not a quick prompt.
docx (81.00, Gold) and mcp-builder (84.00, Gold) also cross our threshold. Both have real depth: docx includes XML reference patterns and validation workflows, mcp-builder covers the full MCP protocol with error handling patterns.
Same 6 dimensions, same weights, same blind judge for both sides
The consistent weakness
Average Research Quality across all 21 skills: 4.7 out of 10. Average Prompt Engineering: 6.9.
Anthropic's team writes clean, well-structured prompts. The gap is in domain research depth: their skills don't include cited sources or external frameworks, which is reasonable for quick-start tooling. Our quality gate requires minimum 6 sources across 2+ types, which produces deeper domain coverage but also requires significantly more effort per skill.
Some of Anthropic's skills serve very specific internal purposes (brand-guidelines for their own brand, internal-comms for their document formats) and are not designed to be general-purpose domain experts.
What changed since the original benchmark
The gap narrowed from 31 points to 20.7 points. That's real progress. But the structural issues remain the same: no sources, no governance, no quality gate, no IP audit. Individual skill quality went up; infrastructure stayed at zero.
Layer 2: Head-to-head
Same task, same model. Their skill as system prompt vs. our best matching skill.
| Matchup | Anthropic | SupaSkills | Our Skill | Winner |
|---|---|---|---|---|
| MCP: Weather Server | 4 | 7 | mcp-server-deployment-expert | SupaSkills |
| MCP: Auth + Resources | 5 | 7 | mcp-tool-designer | SupaSkills |
| Testing: Auth Flow | 4 | 8 | cypress-e2e-testing-expert | SupaSkills |
| Testing: Visual Regression | 5 | 6 | accessibility-testing-automation-engineer | SupaSkills |
| Frontend: Dashboard | 4 | 7 | react | SupaSkills |
| Brand: Developer Tool | 7 | 8 | brand-design | SupaSkills |
| Claude API: Pipeline | 4 | 7 | claude-code | SupaSkills |
| Themes: Token System | 3 | 8 | color-system-designer | SupaSkills |
Each task scored 1-10 by a separate Claude instance that did not know which response came from which skill
SupaSkills led in all 8 matchups. Average: Anthropic 4.5, SupaSkills 7.3. This is expected: our skills are purpose-built for these domains with deep research, while Anthropic's are general-purpose helpers.
In our original February benchmark, Anthropic led in 2 matchups (MCP and Claude API) where their insider knowledge gave them an edge. Our dedicated skills for those domains have since closed that gap.
The closest matchup was Visual Regression (5 vs 6). The largest gap was Themes (3 vs 8), where our color-system-designer produced a complete multi-brand token system while their theme-factory generated a basic color set without semantic naming.
Layer 3: The stack effect
What happens when you combine two or three related SupaSkills against a single Anthropic skill?
| Matchup | Anthropic | SupaSkills Stack | Stack composition | Delta |
|---|---|---|---|---|
| MCP: Auth | 7 | 8 | mcp-tool-designer + mcp-server-deployment-expert + tool-using-agent-designer | +1 |
| Testing: Auth | 5 | 8 | playwright-e2e + cypress-e2e + react-testing | +3 |
| Claude API | 7 | 8 | structured-output-designer + system-prompt-architect + vercel-ai-sdk-developer | +1 |
| Themes | 6 | 8 | color-system-designer + react-design-system-ops + accessible-component-kit-developer | +2 |
The stack won 4 of 8 matchups. When it won, the margin was significant. The testing stack scored +3 over Anthropic's webapp-testing.
This is the argument for a skills platform over a skills repository. With 1,144 skills, you combine complementary expertise. A color system designer plus a design system ops expert plus an accessibility specialist covers more ground than any single "theme-factory" skill can.
We call these combinations PowerPacks. 19 published bundles of 3 to 7 skills designed to work together. The benchmark suggests the concept is measurably better, not just convenient.
Layer 4: Infrastructure comparison
Beyond individual skill quality, there is a structural gap.
| Capability | Anthropic | SupaSkills |
|---|---|---|
| Quality scoring | None | SupaScore 6D (avg 88.29) |
| Quality gate | None | Score >= 80.0 enforced |
| Research sources | None | Min 6 sources, 2+ types (9,000+) |
| IP/Copyright audit | None | 1,144/1,144 audited |
| Versioning | Git history | Semver, is_latest, changelog |
| Injection protection | None | Delivery guard + canary |
| Discovery | Filesystem | Hybrid semantic search |
| Distribution | Git clone | REST API + MCP + ChatGPT Actions |
| Governance | skill-creator has eval | Per-skill guardrails, stop conditions, risk_level |
| Safety disclaimers | None | Domain-specific (medical, finance, legal) |
| Multi-skill orchestration | None | PowerPacks (19 published) |
| Cross-validation | None | OpenAI cross-scores (avg delta 1.92) |
| Rate limiting | None | Tiered by plan |
| Usage analytics | None | Load counts, activation events, audit log |
Anthropic ships skills as markdown files in their CLI. This is a clean, simple approach: easy to read, easy to fork, zero infrastructure overhead. It works well for the quick-start tooling helpers they are designed to be.
We built something complementary: infrastructure for 1,144 production-grade skills with scoring, search, versioning, and governance. Different goals, different trade-offs.
What the data tells us
Anthropic is getting better, fast. Their newer skills (skill-creator at 86, xlsx at 86) meet our Platinum tier. The gap narrowed from 31 points to 20.7 points since February. Their team is clearly investing in skill quality.
The difference is infrastructure, not talent. Anthropic's best skills are well-written. The gap is in what surrounds them: sources, governance, versioning, safety disclaimers. With a quality pipeline, their strongest skills would compete with anyone's.
Skill stacking adds value. Combining two or three complementary skills consistently outperforms any single skill on complex tasks. This is the argument for a skills platform: with 1,144+ skills, you assemble domain-specific expertise for your exact use case.
Summary
| Metric | Anthropic | SupaSkills |
|---|---|---|
| Skills scored | 21 | 1,144 |
| Avg quality score | 67.57 | 88.29 |
| Platinum tier | 2 | 1,000+ |
| Research sources per skill | 0 (by design) | 6+ (required) |
| Domains covered | ~4 | 5 (14 categories) |
| Design goal | Tooling helpers | Domain specialists |
The benchmark is reproducible. Same model, same prompts, same evaluation criteria. All raw data is available in our benchmark results JSON.
Benchmark methodology: Claude Sonnet as executor + blind judge. 21 skills scored on 6 dimensions. 8 head-to-head matchups, 10 tasks, 3 conditions per task (Anthropic single, SupaSkills single, SupaSkills stack). Updated March 15, 2026.