Blog
What separates expert AI output from the generic kind. Performance data, integration guides, and industry perspectives.
We Tested It: Does Loading the Same Skill (Prompt) Twice Make AI Better?
We ran a controlled experiment: no skill, one skill, the same skill loaded twice, and two similar skills combined. The results surprised us. Double-stacking improved quality by 8% - but at 2x the token cost.
Introducing SkillStreaming: Dynamic Expertise Retrieval Across 1,000+ AI Skills
We decomposed 1,279 AI skills into 13,381 retrievable fragments and built a system that assembles cross-domain expertise on every turn. Same concept coverage, 63% fewer tokens, zero manual skill selection.
The Ecosystem Audit: Scoring 167 Community Agent Skills
We scored 167 community-built Claude Code skills from 40+ organisations using the same SupaScore rubric we apply to our own. The tier distribution tells a clear story about what quality infrastructure adds.
What Deep Research Adds to Claude's Built-In Skills: A Data Comparison
We scored Anthropic's 21 Claude Code skills alongside our closest equivalents using the same rubric. The data shows where domain research and quality infrastructure make a measurable difference.
How Safety Skills Improve Claude's Responses in Sensitive Domains: A 68-Query Benchmark
We benchmarked Claude with and without safety skills on 68 real-world queries in sensitive domains. 6 scoring dimensions, 10 domains, 272 API calls. Skill-augmented responses scored 26.8% higher with a 96% win rate.
How We Tune AI: From Generic to Expert in 6 Dimensions
The instrument is the same. But untuned, it sounds wrong. Here's what tuning AI actually means, and what it changes in your output.
We Rebuilt All 1,078 Skills. Here's What 143 Hours of AI Told Us.
After our 10-skill pilot proved the framework, we ran the full pipeline. 1,070 skills rebuilt, average score up 3.9 points, 97% now Platinum. The results changed how we think about AI quality at scale.
The Hidden Cost of Bad AI Advice
Bad AI advice isn't free. It costs decisions. A wrong LTV:CAC calculation. A missed compliance deadline. A contract clause nobody flagged.
We Rebuilt 10 Skills with 4 AI Models. The Model Mattered Less Than We Expected.
We tested Gemini 3.1 Pro, Claude Opus 4.6, and a tag-team approach against our current pipeline. The framework gave 5x more improvement than the model swap.
10 Questions Where Expert Skills Outperform Generic Prompts
We tested 10 hard questions across legal, finance, security, and engineering. Expert-guided prompts consistently outperformed generic prompts on the details that matter.
How SupaScore Works: 6 Dimensions That Separate Good from Dangerous
What happens when you use an AI skill scored 62 versus one scored 87. The difference isn't academic. It's your next business decision.
What Expert Skills Catch in Contracts That Generic AI Misses
A SaaS contract review where an expert legal skill caught three deal-breaking clauses that a generic prompt missed. Here's what happened.