← Back to blog
Performanceskillstreamingragretrieval

Introducing SkillStreaming: Dynamic Expertise Retrieval Across 1,000+ AI Skills

Max Scheurer·March 21, 2026·12 min read

What if you never had to pick a skill again?

Every AI skill system works the same way. You search for a skill. You load it. The full 12,000-token prompt drops into context. You work with it. Need another topic? Search again, load again. Three skills in, you have 36,000 tokens permanently burning through your context window, and 90% of it is irrelevant to what you actually asked.

We built something different.

SkillStreaming decomposes skills into semantic fragments called SubSkills, embeds them in a vector space, and retrieves only the relevant pieces on every turn. Instead of loading a full skill, you get focused expertise from 3 to 5 skills assembled within a 4,000-token budget.

The result: same concept coverage at 63% fewer tokens. Cross-domain intelligence that no single skill can deliver. Zero manual skill selection.

Today, SkillStreaming is live on supaskills.ai.


The problem with static skill loading

The first generation of AI skills treated prompts as files: download them, load them, work inside them. PromptBase, FlowGPT, Custom GPTs, MCP servers, and even our own load_skill all follow this model.

It has three structural problems.

Low relevance density. A 12,000-token skill covers an entire domain: methodology, patterns, examples, guardrails, templates. On any given turn, you need maybe 10% of that. Research from Liu et al. (2024) showed that language models exhibit a U-shaped attention curve: information buried in the middle of long contexts is processed with 40-60% accuracy compared to 85-95% at the boundaries. You are paying full token cost for content the model functionally ignores.

No cross-domain assembly. Real tasks cross boundaries. A pricing page needs pricing strategy, conversion copywriting, SaaS packaging, and A/B testing. A GDPR compliance project needs legal frameworks, privacy engineering, cookie consent, and technical implementation. No single skill covers all of that. Static loading forces you to choose one domain and miss the others.

Manual routing. You have to know which skill you need before you ask the question. That assumption fails in exploratory work, project kick-offs, and any situation where the user's real need only becomes clear mid-conversation.


How SkillStreaming works

We decomposed all 1,279 published skills into 13,381 semantic fragments (SubSkills) averaging 965 tokens each. Each SubSkill is independently embedded, classified into one of 9 types (methodology, pattern, reference, guardrail, example, template, role, antipattern, general), and indexed in pgvector with an HNSW index.

When you ask a question, the system:

  1. Embeds your query using OpenAI text-embedding-3-small (1,536 dimensions)
  2. Searches 13,381 SubSkills via cosine similarity (under 2ms)
  3. Re-ranks using a weighted score: 70% semantic similarity + 30% SupaScore quality prior
  4. Enforces diversity with a maximum of 2 SubSkills per skill
  5. Assembles within an adaptive token budget (2,500 to 5,500 based on query complexity)
  6. Scans the composite through a 4-layer security guard (hash, keyword, context, semantic injection detection)

The entire pipeline runs in 502ms on average, including the embedding API call.


What the numbers say

RAGAS evaluation with external golden dataset

We built a 15-item golden dataset grounded in external standards: OWASP API Security Top 10, GDPR Articles 28/32/33, W3C WCAG 2.2, AICPA SOC 2 criteria, NIST SP 800-218, and relevant IETF RFCs.

MetricScore
RAGAS Composite0.875 (industry "good" threshold: >0.60)
Context Precision98%
Context Recall92%
Noise Ratio2%
Score Distribution15/15 Excellent

A/B comparison: SkillStreaming vs full skill loading

We ran the same 10 benchmark prompts from our existing benchmark page through both approaches.

DimensionFull SkillSkillStreaming
Concept Coverage84%84%
Methodology Depth73%60%
Cross-Domain Bonus0%68%
Token Usage10,2073,806
Skills per Query13.0

SkillStreaming matches full skill loading on concept coverage while using 63% fewer tokens. Full loading still wins on methodology depth (73% vs 60%), because a complete skill has more procedural detail. But SkillStreaming delivers a 68% cross-domain bonus that static loading structurally cannot.

Session simulations across 5 project types

We simulated 38 turns across a SaaS website build (10 turns), B2B marketing campaign (8 turns), indie game development (8 turns), legal compliance setup (6 turns), and data platform build (6 turns).

  • Pass rate: 97% (37/38)
  • Token savings vs full loading: 90% (145k vs ~1.46M)
  • Unique skills accessed: 101 out of 1,279
  • Context evolution: 88-100% skill turnover between consecutive turns

That last metric matters most. It proves the system genuinely adapts on every turn rather than recycling the same fragments.


What SkillStreaming is not

We want to be precise about what we built.

SkillStreaming is not a new retrieval mechanism. It uses standard bi-encoder embeddings and HNSW indexing. It is not a replacement for full skill loading: when you need deep, extended work with a single skill's complete methodology, load_skill is still the right choice.

The innovation is on the product level: recognizing that AI skills are a retrieval problem, and engineering a production system that treats curated expertise as a searchable, composable corpus.

Or, as the Mixture-of-Experts analogy puts it: SkillStreaming performs at the prompt level what MoE architectures do at the neural level. A gating network routes each input to the top-K experts from a large pool. SkillStreaming routes each query to the top-K SubSkills from 13,381 fragments. The principle is the same: activate what is relevant, skip what is not.


Two modes, one catalog

SkillStreaming and full skill loading coexist as complementary modes.

SkillStreaming (default): For exploration, cross-domain questions, and project work where topics change between turns. 3 to 5 skills per query, approximately 3,800 tokens per turn.

Full Skill Load (upgrade path): For deep, focused sessions on a single topic. One complete skill, 10,000 to 14,000 tokens.

Claude decides automatically. SkillStreaming is positioned as the default tool. When deeper expertise is needed, Claude escalates to a full skill load.


Security

Delivering dynamically retrieved content into system prompts creates a specific threat surface. We address it with four layers:

  1. Hash integrity: SHA-256 verification detects database tampering
  2. Keyword scan: 30+ regex patterns catch direct injection attempts
  3. Context analysis: 80-character windows reduce false positives from security-related content
  4. Semantic injection detection (new): Embedding similarity against 10 known injection pattern vectors catches obfuscated attacks that keyword scanning misses

Hard caps enforce boundaries: 8,000 tokens maximum, 20 chunks maximum, 1,000-character query limit. A full platform security audit on March 21, 2026 yielded 0 CRITICAL and 0 OPEN findings.


The whitepaper

We published a full whitepaper covering the architecture, evaluation, competitive analysis, and limitations in detail. It includes 14 peer-reviewed references and positions SkillStreaming within the broader RAG literature.

Read the SkillStreaming Whitepaper (PDF)


Try it now

SkillStreaming is available to all supaskills users via MCP and REST API. No configuration needed. Just describe what you need.

MCP tool: ask_skills REST endpoint: POST /api/v1/ask

The system retrieves, ranks, and assembles the best SubSkills from 1,279 skills for your specific question. Every turn. Automatically.

Start Free or read the docs.