Blog

What separates expert AI output from the generic kind. Performance data, integration guides, and industry perspectives.

We ran a controlled experiment: no skill, one skill, the same skill loaded twice, and two similar skills combined. The results surprised us. Double-stacking improved quality by 8% - but at 2x the token cost.

experimentprompt-engineeringskillsbenchmark

PerformanceMar 21, 202612 min read

Introducing SkillStreaming: Dynamic Expertise Retrieval Across 1,000+ AI Skills

We decomposed 1,279 AI skills into 13,381 retrievable fragments and built a system that assembles cross-domain expertise on every turn. Same concept coverage, 63% fewer tokens, zero manual skill selection.

skillstreamingragretrievalsubskills

PerformanceMar 16, 20268 min read

The Ecosystem Audit: Scoring 167 Community Agent Skills

We scored 167 community-built Claude Code skills from 40+ organisations using the same SupaScore rubric we apply to our own. The tier distribution tells a clear story about what quality infrastructure adds.

benchmarkecosystemqualitycommunity

PerformanceMar 15, 202610 min read

What Deep Research Adds to Claude's Built-In Skills: A Data Comparison

We scored Anthropic's 21 Claude Code skills alongside our closest equivalents using the same rubric. The data shows where domain research and quality infrastructure make a measurable difference.

benchmarkanthropicqualityskills

PerformanceMar 15, 20268 min read

How Safety Skills Improve Claude's Responses in Sensitive Domains: A 68-Query Benchmark

We benchmarked Claude with and without safety skills on 68 real-world queries in sensitive domains. 6 scoring dimensions, 10 domains, 272 API calls. Skill-augmented responses scored 26.8% higher with a 96% win rate.

safetyevalbenchmarksociety

PerformanceMar 12, 20265 min read

How We Tune AI: From Generic to Expert in 6 Dimensions

The instrument is the same. But untuned, it sounds wrong. Here's what tuning AI actually means, and what it changes in your output.

qualitymethodologycomparison

PerformanceMar 10, 20268 min read

We Rebuilt All 1,078 Skills. Here's What 143 Hours of AI Told Us.

After our 10-skill pilot proved the framework, we ran the full pipeline. 1,070 skills rebuilt, average score up 3.9 points, 97% now Platinum. The results changed how we think about AI quality at scale.

pipelinequalitybenchmarkv2

PerformanceFeb 26, 20265 min read

The Hidden Cost of Bad AI Advice

Bad AI advice isn't free. It costs decisions. A wrong LTV:CAC calculation. A missed compliance deadline. A contract clause nobody flagged.

businessfinancerisk

PerformanceFeb 24, 20267 min read

We Rebuilt 10 Skills with 4 AI Models. The Model Mattered Less Than We Expected.

We tested Gemini 3.1 Pro, Claude Opus 4.6, and a tag-team approach against our current pipeline. The framework gave 5x more improvement than the model swap.

multi-modelpipelinequalitybenchmark

PerformanceFeb 23, 20268 min read

10 Questions Where Expert Skills Outperform Generic Prompts

We tested 10 hard questions across legal, finance, security, and engineering. Expert-guided prompts consistently outperformed generic prompts on the details that matter.

benchmarkcomparisonhard-nuts

PerformanceFeb 23, 20266 min read

How SupaScore Works: 6 Dimensions That Separate Good from Dangerous

What happens when you use an AI skill scored 62 versus one scored 87. The difference isn't academic. It's your next business decision.

supascorequalitymethodology

PerformanceFeb 23, 20265 min read

What Expert Skills Catch in Contracts That Generic AI Misses

A SaaS contract review where an expert legal skill caught three deal-breaking clauses that a generic prompt missed. Here's what happened.

legalcontractsbenchmark