How We Score AI Skills
Every skill in the supaskills catalogue is scored on 6 dimensions using SupaScore. The rubric is public. The minimum to publish is 80/100. Here's how it works.
Why we score
Not all system prompts are equal. A one-line “you are an expert” prompt and a 3,000-token researched methodology both technically work — but they produce fundamentally different output.
We built SupaScore to measure that difference. Every skill in our catalogue passes through the same quality evaluation before it goes live. If it scores below 80, it doesn't ship.
The scoring system isn't marketing. It's our quality gate.
The 6 dimensions
Each skill is evaluated on six dimensions. The composite SupaScore is a weighted average.
Research Quality
15 %Does the skill draw on verified, high-quality sources? Are frameworks correctly applied? Is the domain knowledge accurate and current?
- •5+ cited sources required
- •Minimum 2 source types (e.g. book + official docs)
- •No single-source dependency
- •Factual accuracy verified
Prompt Engineering
25 %Is the system prompt well-structured? Does it use clear instructions, structured output formats, and effective techniques?
- •Clear role definition and task framing
- •Structured output format specified
- •Edge cases and constraints addressed
- •Efficient token usage (no bloat)
Practical Utility
15 %Does the skill produce output that is directly useful? Can the user act on it without significant rework?
- •Output is actionable, not just informational
- •Format matches real-world use (report, checklist, analysis)
- •Reduces time vs. manual approach
- •Works across typical use cases in the domain
Completeness
10 %Does the skill cover the domain adequately? Are there obvious gaps or missing perspectives?
- •Core aspects of the domain addressed
- •Common edge cases handled
- •Guardrails for out-of-scope requests
- •Appropriate depth (not superficial, not overloaded)
User Satisfaction
20 %Does the output feel right? Is it clear, well-organised, and professional?
- •Output is readable and well-structured
- •Tone matches the domain (formal for legal, practical for engineering)
- •No hallucination-prone instructions
- •Consistent quality across different inputs
Decision Usefulness
15 %Does the skill help the user make better decisions? Does it surface options, risks, and trade-offs?
- •Presents alternatives, not just one answer
- •Identifies risks and limitations
- •Adapts to user’s specific context
- •Supports informed decision-making
The score scale
Production floor: 80 (Gold tier). Nothing below 80 enters the catalogue. Current range: 80.00 – 89.75. Average: 84.2.
95 – 100
Expert-verified. Available to Max users.
85 – 94
Excellent. Available to Pro and Max users.
70 – 84
Published. Available to all users.
60 – 69
Below quality gate. Not published.
< 60
Draft only. Internal use.
How skills are built
Each skill goes through an 8-phase research pipeline.
- 1
Domain Scoping
Define the skill’s domain, target user, and expected output
- 2
Source Research
Collect 6+ sources (books, papers, frameworks, official docs)
- 3
Methodology Extraction
Identify key frameworks, diagnostic questions, decision trees
- 4
Prompt Drafting
Write the system prompt encoding the methodology
- 5
Quality Scoring
Score on 6 dimensions using SupaScore
- 6
Masterfile Creation
Create the canonical reference document (research + prompt + sources + score)
- 7
Quality Gate
Automated check: score ≥ 80, sources ≥ 6, masterfile complete
- 8
Publication
Skill goes live in the catalogue with version tracking
The pipeline has run 35+ production sessions with 0 failures.
Source standards
Every skill cites its research sources. We require minimum 6 sources per skill and minimum 2 source types (prevents single-perspective bias).
Sources are displayed as “Research Sources” — we conducted the research, the skill is our original work.
Source types in the catalogue
What we don't claim
- —We don't claim every skill is perfect. The scoring system exists because quality varies.
- —We don't claim the score predicts your specific use case. Try the free tier and judge for yourself.
- —We don't claim affiliation with or endorsement by any cited author or organisation.
- —We don't claim system prompts replace domain expertise. They encode it for faster, more consistent access.
- —Our benchmark is a demonstration (5 tests), not a peer-reviewed study. We report patterns and specific examples, not aggregate percentages.
SupaBoost results
We tested 5 Platinum-tier skills head-to-head against vanilla Claude Sonnet. Same prompt, same model, with and without a skill loaded.
The pattern was consistent across all 5 domains: vanilla Claude gives correct but general advice. A skill transforms it into a structured methodology with specific frameworks, templates, code, and monitoring.
This is a demonstration (5 tests, not a peer-reviewed study). Full case studies with before/after comparisons are available on the benchmark page.
Questions about the methodology? Get in touch
Browse Scored SkillsSupaSkills is built by Kill The Dragon, a strategy agency in Vienna.