Benchmark

Does it actually work?

We ran a blind A/B test. 10 queries across 10 professional domains. A separate Claude instance judged both responses without knowing which was which. Plus 15 deep-dive comparisons. No cherry-picking, no retries.

Browse 1,000+ skills →See pricing →

Blind A/B Evaluation

10 queries. Blind judge. Real numbers.

Same prompt, same model, same settings. One response with a SupaSkill loaded, one without. A separate Claude instance scored both blind — randomised order, no labels, six dimensions scored 1–10.

8 of 9

Skill wins

89% win rate

+15%

Quality improvement

51.9 vs 45.0 out of 60

~1s

API overhead

865ms search + 201ms load

By domain

Domain	With Skill	Without	Delta
Legal	58/60	42/60	+38.1%
Finance	52/60	40/60	+30.0%
Engineering	53/60	41/60	+29.3%
Marketing	53/60	43/60	+23.3%
DevOps	53/60	48/60	+10.4%
Pricing	51/60	47/60	+8.5%
Data	49/60	47/60	+4.3%
Communication	49/60	47/60	+4.3%
Security	49/60	50/60	-2.0%

The more specialised the domain, the bigger the improvement. Legal, finance, and engineering — where precision and frameworks matter most — saw the largest gains. Security was effectively a tie: a single point out of 60.

Where the difference shows

Averages across 9 matched queries. Blue = with skill, grey = without.

Source Quality+33%

8.4

6.3

Specificity+22%

9.1

7.4

Practical Value+17%

7.7

Completeness+17%

8.4

7.2

Structure+5%

8.8

8.3

Actionability+1%

8.1

Source Quality shows the largest gap. Skills cite specific frameworks, standards, and methodologies that vanilla Claude doesn't reference. The output becomes verifiable, not just plausible.

Latency overhead

865ms

201ms

Load

∼1s

Total overhead

Claude takes 20–45 seconds to generate a response. One extra second for measurably better output.

Outlier: Query #7

The product domain query asked for a PRD, but search matched a press release specialist. The skill correctly refused to write a PRD — producing only 276 tokens vs 2,000 — scoring 12/60. This is a search relevance issue, not a skill quality issue. We excluded it from the averages and are improving search matching.

What we didn't do

• We didn't run 100 queries and pick the best 10
• We didn't tune prompts to favour skills
• We didn't exclude the security result where vanilla Claude won
• We didn't exclude the search mismatch — we reported it

The pattern

Across every test, the same pattern emerged. Vanilla Claude gives correct but generic advice. A SupaSkill transforms it into structured, actionable output.

Without Skill

Generic advice

Correct but surface-level

Stops at the recommendation

"Here’s what to do"

No risk awareness

With Skill

Structured methodology

Specific and deep

Full implementation lifecycle

Templates, code, and dashboards

Guardrails and monitoring built in

5-dimension comparison

Each of the 5 original tests scored on structure, specificity, completeness, actionability, and correctness. Red = baseline, blue = with skill.

Median improvement: +163%. The real difference is structural, not just length.

Coding

Platinum

Database Performance Tuner

89 · +133%

7-phase diagnostic methodology, partial indexes that skip 75% of rows, pg_stat_statements monitoring

Baseline

312w

Boosted

728w

Marketing

Platinum

Customer Acquisition Cost Optimizer

85.5 · +185%

Magic Number calculation (0.13 — critical sales efficiency failure baseline missed entirely)

Baseline

888w

Boosted

2529w

Legal

Platinum

Cross-Border Data Transfer Specialist

88.1 · +163%

Full TIA template with FISA 702 gap analysis, vendor-specific migration guides, EDPB references by number

Baseline

704w

Boosted

1855w

Data

Platinum

AI Cost Optimizer

86.55 · +1127%

From 3 bullet points to a 6-phase engineering plan with working Python code, cost projections at 10x scale

Baseline

243w

Boosted

2981w

Audit

Platinum

DevSecOps Pipeline Architect

89.2 · +92%

Quality gates with policy-as-code, SOC 2 control mapping (CC6.1, CC6.7, CC7.1), exception/waiver workflow

Baseline

817w

Boosted

1571w

Where vanilla Claude gets it wrong

Some tasks require domain expertise that general-purpose Claude doesn't have. We tested 10 scenarios across 5 domains — toggle each card to see the difference.

9/10 clear win. 1/10 marginal win. 0 ties. 0 boosted worse.

legal86.5

SaaS Contract Clause Analysis

Identifies GDPR vagueness, asymmetric termination, IP ownership concern, jurisdiction mismatch. Bullet-point format with no structured methodology or priority ordering.

236w→1061wClear Win

Baseline identifies problems but provides no redline language, no market-standard comparison, and no priority ordering — boosted transforms vague warnings into an actionable negotiation playbook with exact contract replacement text.

legal85.65

GDPR DSAR Deadline Calculation

Correct deadline (March 3). Extension conditions summarized. Controller responsibility for CRM vendor noted. No structured action plan.

258w→660wClear Win

Baseline misses identity verification as a prerequisite — a common GDPR trap that can invalidate a response — and omits international data transfer compliance for the US CRM vendor.

finance87.25

SaaS Unit Economics Error

Calculates CAC $2,375 correctly but LTV $8,712 and LTV:CAC 3.67:1 — considers business acceptable. Uses inflated LTV formula that masks the true crisis.

296w→722wClear Win

Baseline's LTV calculation inflates the result, producing an optimistic 3.67x ratio that masks the true crisis — boosted uses the correct LTV = (ARPC x Margin) / Churn formula and surfaces a 37.7-month payback.

finance88.85

Pricing Tier Cannibalisation

Feature list for new tier. Conservative/optimistic revenue scenarios (+3.7% to +11%). No explicit cannibalization modeling.

213w→634wClear Win

Baseline doesn't model Enterprise cannibalization risk — boosted shows the new tier could reduce total revenue if the decoy effect isn't engineered correctly, turning 'add a tier' into a strategic pricing architecture decision.

security86.55

Log4j CVE Risk Assessment

Correctly identifies exploitability. 4 attack vectors. Two mitigations (upgrade or JVM flag). No perimeter defense or compliance mapping.

245w→688wClear Win

Baseline stops at 'patch and use JVM flag' — boosted adds the critical perimeter defense layer (WAF JNDI blocking rule) and compliance mappings that matter for security audit reporting.

security86

AWS API Key Exposure Triage

Correct sequence: disable key, create new, investigate CloudTrail. Blast radius enumeration. No git history sanitization, no fork proliferation awareness.

401w→889wClear Win

Baseline misses git history sanitization entirely — the credential persists in git history and all forks even after key rotation. Boosted also addresses GDPR notification triggers if S3 personal data was accessed.

data84.9

ETL Pipeline Failure Diagnosis

Correctly diagnoses vendor-added column. Bash diff commands. Three fix options. Contact vendor. Schema validation recommendation. Misses dashboard impact assessment.

270w→822wClear Win

Baseline treats this as a one-time fix — boosted adds production-grade defensive engineering (schema evolution function, Airflow pre-flight task, dbt warn-not-fail) that prevents silent failure by design.

data85

A/B Test Statistical Error

Correctly flags small effect size, borderline p-value, short duration. Recommends extending to 1-2 weeks. Both reach correct 'do not ship' conclusion.

199w→385wMarginal Win

Both reach the correct 'do not ship' conclusion — boosted uniquely flags Sample Ratio Mismatch as a validity threat and identifies the observed effect is below MDE threshold.

compliance89.2

SOC 2 Evidence Gap Analysis

Identifies three critical gaps. 6-week action plan. Suggests Okta deprovisioning. Brief evidence checklist. No auditor-ready artifacts.

236w→914wClear Win

Baseline gives a process checklist — boosted delivers auditor-ready documentation artifacts (attestation template, YAML checklist, OPA policy code) that an auditor can directly review as evidence.

compliance86.25

Data Breach Notification Timeline

72-hour GDPR Art. 33 notification. Customer notification. US state law variability. SOC 2 auditor. Basic immediate actions. Treats individual notification as mandatory.

259w→590wClear Win

Baseline treats individual notification as mandatory — boosted identifies the pivotal nuance that password hash algorithm strength determines Art. 34 high-risk threshold, a material distinction affecting notification obligations for 50,000 users.

How we test

ModelClaude Sonnet 4 (same version for all calls)
MethodSame prompt sent twice — once without, once with skill system prompt
JudgingBlind A/B: a separate Claude instance scores randomised responses on 6 dimensions (1–10 each)
RulesSingle run per test, no retries, no cherry-picking. Prompts designed before running.
PromptsRealistic scenarios with specific numbers and constraints — not toy examples

We don't claim statistical significance from 25 tests. This is a demonstration, not a peer-reviewed study. We report every result including the ones where vanilla Claude won.

The best test is your own: grab a free account, load a skill in your domain, and compare the output to what you were getting before.

Read scoring methodology →

See for yourself

Activate a skill and compare the output to what you were getting before. 3 free skill slots. No credit card required.

Start Free Browse 1,000+ Skills

SupaSkills is built by Kill The Dragon, a strategy agency in Vienna.