Does it actually work?
We ran a blind A/B test. 10 queries across 10 professional domains. A separate Claude instance judged both responses without knowing which was which. Plus 15 deep-dive comparisons. No cherry-picking, no retries.
10 queries. Blind judge. Real numbers.
Same prompt, same model, same settings. One response with a SupaSkill loaded, one without. A separate Claude instance scored both blind — randomised order, no labels, six dimensions scored 1–10.
8 of 9
Skill wins
89% win rate
+15%
Quality improvement
51.9 vs 45.0 out of 60
~1s
API overhead
865ms search + 201ms load
By domain
| Domain | With Skill | Without | Delta |
|---|---|---|---|
| Legal | 58/60 | 42/60 | +38.1% |
| Finance | 52/60 | 40/60 | +30.0% |
| Engineering | 53/60 | 41/60 | +29.3% |
| Marketing | 53/60 | 43/60 | +23.3% |
| DevOps | 53/60 | 48/60 | +10.4% |
| Pricing | 51/60 | 47/60 | +8.5% |
| Data | 49/60 | 47/60 | +4.3% |
| Communication | 49/60 | 47/60 | +4.3% |
| Security | 49/60 | 50/60 | -2.0% |
The more specialised the domain, the bigger the improvement. Legal, finance, and engineering — where precision and frameworks matter most — saw the largest gains. Security was effectively a tie: a single point out of 60.
Where the difference shows
Averages across 9 matched queries. Blue = with skill, grey = without.
Source Quality shows the largest gap. Skills cite specific frameworks, standards, and methodologies that vanilla Claude doesn't reference. The output becomes verifiable, not just plausible.
Latency overhead
865ms
Search
201ms
Load
∼1s
Total overhead
Claude takes 20–45 seconds to generate a response. One extra second for measurably better output.
Outlier: Query #7
The product domain query asked for a PRD, but search matched a press release specialist. The skill correctly refused to write a PRD — producing only 276 tokens vs 2,000 — scoring 12/60. This is a search relevance issue, not a skill quality issue. We excluded it from the averages and are improving search matching.
What we didn't do
- • We didn't run 100 queries and pick the best 10
- • We didn't tune prompts to favour skills
- • We didn't exclude the security result where vanilla Claude won
- • We didn't exclude the search mismatch — we reported it
The pattern
Across every test, the same pattern emerged. Vanilla Claude gives correct but generic advice. A SupaSkill transforms it into structured, actionable output.
Without Skill
Generic advice
Correct but surface-level
Stops at the recommendation
"Here’s what to do"
No risk awareness
With Skill
Structured methodology
Specific and deep
Full implementation lifecycle
Templates, code, and dashboards
Guardrails and monitoring built in
5-dimension comparison
Each of the 5 original tests scored on structure, specificity, completeness, actionability, and correctness. Red = baseline, blue = with skill.
Median improvement: +163%. The real difference is structural, not just length.
Database Performance Tuner
89 · +133%
7-phase diagnostic methodology, partial indexes that skip 75% of rows, pg_stat_statements monitoring
Customer Acquisition Cost Optimizer
85.5 · +185%
Magic Number calculation (0.13 — critical sales efficiency failure baseline missed entirely)
Cross-Border Data Transfer Specialist
88.1 · +163%
Full TIA template with FISA 702 gap analysis, vendor-specific migration guides, EDPB references by number
AI Cost Optimizer
86.55 · +1127%
From 3 bullet points to a 6-phase engineering plan with working Python code, cost projections at 10x scale
DevSecOps Pipeline Architect
89.2 · +92%
Quality gates with policy-as-code, SOC 2 control mapping (CC6.1, CC6.7, CC7.1), exception/waiver workflow
Where vanilla Claude gets it wrong
Some tasks require domain expertise that general-purpose Claude doesn't have. We tested 10 scenarios across 5 domains — toggle each card to see the difference.
9/10 clear win. 1/10 marginal win. 0 ties. 0 boosted worse.
SaaS Contract Clause Analysis
Baseline identifies problems but provides no redline language, no market-standard comparison, and no priority ordering — boosted transforms vague warnings into an actionable negotiation playbook with exact contract replacement text.
GDPR DSAR Deadline Calculation
Baseline misses identity verification as a prerequisite — a common GDPR trap that can invalidate a response — and omits international data transfer compliance for the US CRM vendor.
SaaS Unit Economics Error
Baseline's LTV calculation inflates the result, producing an optimistic 3.67x ratio that masks the true crisis — boosted uses the correct LTV = (ARPC x Margin) / Churn formula and surfaces a 37.7-month payback.
Pricing Tier Cannibalisation
Baseline doesn't model Enterprise cannibalization risk — boosted shows the new tier could reduce total revenue if the decoy effect isn't engineered correctly, turning 'add a tier' into a strategic pricing architecture decision.
Log4j CVE Risk Assessment
Baseline stops at 'patch and use JVM flag' — boosted adds the critical perimeter defense layer (WAF JNDI blocking rule) and compliance mappings that matter for security audit reporting.
AWS API Key Exposure Triage
Baseline misses git history sanitization entirely — the credential persists in git history and all forks even after key rotation. Boosted also addresses GDPR notification triggers if S3 personal data was accessed.
ETL Pipeline Failure Diagnosis
Baseline treats this as a one-time fix — boosted adds production-grade defensive engineering (schema evolution function, Airflow pre-flight task, dbt warn-not-fail) that prevents silent failure by design.
A/B Test Statistical Error
Both reach the correct 'do not ship' conclusion — boosted uniquely flags Sample Ratio Mismatch as a validity threat and identifies the observed effect is below MDE threshold.
SOC 2 Evidence Gap Analysis
Baseline gives a process checklist — boosted delivers auditor-ready documentation artifacts (attestation template, YAML checklist, OPA policy code) that an auditor can directly review as evidence.
Data Breach Notification Timeline
Baseline treats individual notification as mandatory — boosted identifies the pivotal nuance that password hash algorithm strength determines Art. 34 high-risk threshold, a material distinction affecting notification obligations for 50,000 users.
How we test
- ModelClaude Sonnet 4 (same version for all calls)
- MethodSame prompt sent twice — once without, once with skill system prompt
- JudgingBlind A/B: a separate Claude instance scores randomised responses on 6 dimensions (1–10 each)
- RulesSingle run per test, no retries, no cherry-picking. Prompts designed before running.
- PromptsRealistic scenarios with specific numbers and constraints — not toy examples
We don't claim statistical significance from 25 tests. This is a demonstration, not a peer-reviewed study. We report every result including the ones where vanilla Claude won.
The best test is your own: grab a free account, load a skill in your domain, and compare the output to what you were getting before.
See for yourself
Activate a skill and compare the output to what you were getting before. 3 free skill slots. No credit card required.
SupaSkills is built by Kill The Dragon, a strategy agency in Vienna.