The AI discourse has a blind spot.
Every week, a new benchmark. GPT-5 vs Claude 4 vs Gemini Ultra vs Llama 4. MMLU scores, HumanEval pass rates, reasoning traces. The industry obsesses over model quality — and rightfully so. Better models matter.
But here is the thing nobody talks about: the prompt determines roughly 80% of output quality. The model is the engine. The prompt is the steering wheel, the GPS, and the road map combined. You can put a Formula 1 engine in a car with no steering and it will go fast — into a wall.
Most people are driving into walls.
The State of Prompts in 2026
There is no npm for prompts. No package manager. No version control convention. No test suite. No peer review. No quality score.
Think about what that means. In software engineering, you would never ship code without linting, testing, and review. In data science, you would never deploy a model without validation metrics. But in the prompt ecosystem — which drives billions of dollars of AI-generated output every day — the standard practice is: write something in a text box, see if the output looks okay, ship it.
This is where code was in 2005. Before CI/CD. Before pull requests. Before linters. Before anyone agreed that automated testing was not optional. Developers wrote code, eyeballed it, and deployed on Fridays. We cringe at that now.
The prompt ecosystem is in its cowboy coding era.
Why Prompt Quality Is Hard
Prompt quality is hard to measure because the failure modes are subtle. Bad code crashes. Bad prompts produce output that looks right and is wrong.
A poorly engineered prompt for contract review will produce a summary that reads like a lawyer wrote it. Correct grammar, professional tone, structured paragraphs. But it misses the asymmetric indemnification clause. It does not flag the governing law problem. It summarizes instead of analyzing. The output passes the eye test and fails the expert test.
This is worse than a crash. A crash tells you something is broken. A plausible-but-wrong analysis tells you nothing — until it costs you money.
Prompt quality is also hard because it is multi-dimensional. A good prompt needs:
- Domain knowledge: What frameworks, standards, and methodologies apply?
- Task architecture: In what order should the model process the information?
- Output design: What format serves the user's decision-making process?
- Guardrails: What should the model refuse to do? Where should it flag uncertainty?
- Evaluation criteria: How do you know if the output is good?
Most prompts address zero of these. The typical prompt is a sentence or two of instruction followed by hope.
The Code Quality Parallel
Software engineering solved this problem over 20 years, and the path is instructive.
Phase 1: Cowboy era. Developers wrote code however they wanted. No standards, no reviews, no automated checks. Quality was a function of individual skill and luck.
Phase 2: Linting and style guides. Tools like ESLint, Prettier, and PEP 8 enforced baseline consistency. Not quality — consistency. But consistency turned out to be a precondition for quality.
Phase 3: Testing and CI/CD. Automated test suites caught regressions. Continuous integration meant every change was validated before merge. Quality became measurable and enforceable.
Phase 4: Review culture. Pull requests, code review, pair programming. Human judgment layered on top of automated checks. The combination of machine enforcement and human evaluation produced reliably high-quality output.
Prompts are stuck at Phase 1. There are no linters for prompt quality. No test suites that validate prompt output against expert benchmarks. No review culture. No shared standards for what a good prompt looks like.
What a Standard Could Look Like
We are not claiming to have solved this. But we think about it constantly, because it is the core problem behind everything we build.
SupaScore is one attempt at a quality standard for prompts. It measures six dimensions: Research Quality, Prompt Engineering, Practical Utility, Completeness, User Satisfaction, and Decision Usefulness. Every skill on supaskills is scored across all six. The computation is independent — skills do not score themselves.
Is it perfect? No. Is it better than nothing? By a wide margin.
Here is what we think a broader industry standard would need:
Source verification. Is the prompt grounded in real domain knowledge, or is it generating plausible-sounding methodology? A contract review prompt should reference actual legal principles. A financial analysis prompt should use established valuation frameworks. Sources should be citable and verifiable.
Output validation. Does the prompt consistently produce output that experts would approve? This requires evaluation against expert benchmarks, not just user satisfaction surveys. Users often cannot tell the difference between good output and confident-sounding bad output.
Scope coverage. Does the prompt cover the full scope of its claimed domain? A GDPR prompt that ignores cross-border data transfers has a critical gap. Scope should be documented and testable.
Safety boundaries. Does the prompt know what it does not know? A legal prompt should flag when a question requires jurisdiction-specific advice. A medical prompt should refuse to diagnose. Guardrails should be explicit, not implied.
Versioning and regression testing. When a prompt is updated, does the new version perform at least as well as the old version across all dimensions? Without regression testing, iteration degrades quality as often as it improves it.
The Industry Needs This
This is not a pitch for supaskills specifically. It is a pitch for the concept of prompt quality as a discipline.
Right now, every company building on AI is reinventing the same wheel: writing prompts internally, evaluating them by feel, iterating without metrics, and shipping without review. The aggregate waste — in time, in money, in decisions made on bad AI output — is enormous.
The software industry figured out that code quality requires infrastructure: tools, standards, processes, and culture. The prompt ecosystem needs the same infrastructure. Open standards for prompt quality. Shared benchmarks for evaluation. Tools that make quality measurable rather than subjective.
We built SupaScore because we needed it for our own work. We publish the scores because transparency drives trust. But the real goal is bigger than one company's scoring system. The goal is an industry where prompt quality is as measurable, enforceable, and expected as code quality.
We are a long way from there. But the first step is agreeing that the problem exists.