You find two AI skills for financial analysis. One is scored 62. The other is scored 87. Both promise to analyze your SaaS metrics. Both produce outputs that read well. Both sound confident.
The 62 tells you your churn rate is "concerning" and suggests you "focus on retention." The 87 calculates your logo churn at 4.2% monthly, revenue churn at 6.8% (flagging the expansion revenue gap), identifies that your bottom quartile of accounts by ARR churns at 3x the rate of your top quartile, and recommends a specific segmentation-based retention strategy with expected impact ranges.
Same question. Same model underneath. One gives you a feeling. The other gives you a decision.
That difference is not random. It is measurable. Here is how.
Six Dimensions, Weighted by Impact
SupaScore evaluates every skill across six dimensions. The weights are not arbitrary — they reflect how much each dimension contributes to whether the skill's output actually helps you make better decisions.
| Dimension | Weight | What it measures | |-----------|--------|------------------| | Research Quality | 15% | Depth of domain knowledge embedded in the skill | | Prompt Engineering | 25% | Structural sophistication of the skill's instructions | | Practical Utility | 15% | Whether outputs are actionable, not just informative | | Completeness | 10% | Coverage of edge cases and failure modes | | User Satisfaction | 20% | Output clarity, format quality, and usability | | Decision Usefulness | 15% | Whether the output helps you make a concrete decision |
Prompt Engineering gets the highest weight because it is the multiplier. Good domain knowledge in a poorly structured prompt produces inconsistent results. Good structure applied to shallow knowledge produces confident-sounding mediocrity. The skill needs both, but structure determines whether the knowledge actually reaches the output.
Let us walk through what low versus high looks like in each dimension.
Research Quality (15%)
At a score of 3/10: The skill works from general knowledge. It knows that financial analysis involves metrics like revenue and churn. It applies textbook definitions. When it encounters a nuanced scenario — say, a company with negative net revenue churn driven by expansion within a shrinking customer base — it treats the headline number as positive and moves on.
At a score of 9/10: The skill embeds frameworks from specific methodologies. It knows the difference between gross and net revenue retention and when each matters. It applies industry-specific benchmarks (B2B SaaS median gross retention of 90% versus B2C at 70%). It recognizes that negative net churn with high logo churn is a concentration risk, not a growth signal.
The gap: a low-research skill tells you what your numbers are. A high-research skill tells you what your numbers mean.
Prompt Engineering (25%)
This is the structural backbone. It determines whether the skill consistently produces high-quality output or only sometimes gets it right.
At a score of 3/10: The skill uses a simple instruction: "You are a financial analyst. Analyze the provided data." No output structure. No role constraints. No error handling. The model fills in the gaps with whatever pattern seems most likely, which varies by input. Ask it the same question twice with slightly different phrasing and you may get materially different analyses.
At a score of 9/10: The skill defines a clear analytical framework, specifies output structure, includes guardrails against common failure modes, and chains reasoning steps in a deliberate sequence. It tells the model what to analyze first, what to compare against, and what format to deliver the answer in. The result is reproducible. You get the same quality of analysis regardless of how you phrase the input.
The gap: poor prompt engineering means you are gambling on whether the model interprets your intent correctly. Strong prompt engineering means the skill controls the interpretation.
Practical Utility (15%)
This is the "so what" dimension. Does the output help you do something, or does it just tell you something?
At a score of 3/10: "Your customer acquisition cost is high relative to industry averages. Consider optimizing your marketing spend." That is a observation dressed as advice. What should you optimize? Which channels? By how much? No answers.
At a score of 9/10: "Your CAC of EUR 840 is 2.1x the B2B SaaS median for your ARR range. Your paid search CAC is EUR 1,200 versus EUR 380 for organic content. Shifting 30% of paid budget to content marketing, assuming a 6-month ramp, projects CAC reduction to EUR 620 based on comparable companies at your stage." Now you have a decision to make, not a vague concern to worry about.
The gap: low-utility skills describe problems. High-utility skills prescribe responses with enough specificity that you can act on them today.
Completeness (10%)
Completeness gets the lowest weight because you can compensate for missing edges with follow-up questions. But skills that handle edge cases up front save you the round trips.
At a score of 3/10: The skill handles the happy path. Give it clean data and a straightforward question and it performs well. Give it messy data — missing values, conflicting figures, unusual business models — and it either ignores the issues or hallucinates through them.
At a score of 9/10: The skill identifies data gaps before analyzing. It flags when figures do not reconcile. It asks for clarification on ambiguous inputs instead of guessing. When it encounters a scenario outside its training — a hybrid SaaS/marketplace model, for instance — it states the limitation explicitly rather than applying a pure-SaaS framework and hoping it fits.
The gap: incomplete skills give you false confidence. Complete skills tell you where the analysis is strong and where it is uncertain.
User Satisfaction (20%)
This is the second-highest weight for a reason. The best analysis in the world is useless if it is buried in a wall of text, structured in a way that hides the key findings, or formatted so that you cannot extract the actionable pieces.
At a score of 3/10: A long paragraph that mixes observations, recommendations, and caveats without clear separation. Important findings sit in the middle of the output. No hierarchy. No structure. You have to read the entire thing to find what matters.
At a score of 9/10: The output starts with a summary of findings ranked by impact. Each finding has a clear label, supporting data, and a recommended action. Format is consistent across runs. Tables where tables help. Prose where context matters. The output respects that you are making decisions under time pressure and structures itself accordingly.
The gap: low-satisfaction skills make you work to extract value. High-satisfaction skills hand you the value in a format you can use immediately.
Decision Usefulness (15%)
The final dimension is the one that ties everything together. After reading the output, can you make a better decision than you could before?
At a score of 3/10: "Based on the analysis, your company's financial health appears moderate. There are areas of strength and areas that need improvement." This is the AI equivalent of a fortune cookie. It could apply to any company. It advances no decision.
At a score of 9/10: "Your runway at current burn is 14 months. If revenue growth decelerates to the trailing 3-month trend (8% MoM vs. 12% 6-month average), runway drops to 11 months. Recommendation: either close the Series A within 6 months at current terms or reduce burn by EUR 45K/month to extend runway to 18 months. The primary lever is the EUR 28K/month in paid acquisition that is not converting to retained revenue."
The gap is binary. After a low-scoring output, you still do not know what to do. After a high-scoring output, you know exactly what to do, why, and what happens if you do not.
The Tier System
Scores roll up into tiers that give you an instant read on what to expect:
- Bronze (below 60): Outputs are unreliable. Likely better off prompting the model yourself. Skills at this level typically have fundamental issues in prompt engineering or domain knowledge.
- Silver (60-69): Usable for simple tasks but falls apart on nuance. Good enough for first drafts that you will heavily edit. Not suitable for decisions with consequences.
- Gold (70-84): Solid for professional use. Handles most scenarios well. You still verify critical outputs, but the skill saves significant time and catches things you might miss.
- Platinum (85-94): Expert-grade. Outputs consistently match or exceed what a capable professional produces in a first pass. Identifies risks and opportunities that generalist analysis misses.
- Diamond (95+): Exceptional. Rare. Outputs demonstrate genuine domain mastery with sophisticated handling of edge cases, nuance, and competing considerations.
Every skill on supaskills is Gold tier or above. The minimum quality gate is a SupaScore of 80. Skills below that threshold do not get published.
Why This Matters
The internet is about to be flooded with AI skills, prompts, agents, and assistants. Most of them will be untested. Many will sound impressive in their descriptions and produce mediocre outputs in practice. Some will produce outputs that are confidently wrong — the worst possible outcome when you are making business decisions.
SupaScore exists because quality is not optional when the output influences real decisions. A financial analysis that misses your runway problem is not "slightly less useful." It is dangerous. A contract review that overlooks a termination clause is not "incomplete." It is a liability.
The score gives you a way to know, before you rely on a skill, whether it has been held to a standard. Not a marketing standard. A measurable one.
Six dimensions. Weighted by impact. Applied to every skill. That is how you tell the difference between a prompt someone pasted from a forum and a skill that was built to get the answer right.