← Back to blog
Performanceexperimentprompt-engineeringskills

We Tested It: Does Loading the Same Skill (Prompt) Twice Make AI Better?

Max Scheurer·March 22, 2026·6 min read

Someone told me loading a prompt twice makes AI responses better.

I heard this claim floating around the prompt engineering community: if you feed an LLM the same instructions twice, it follows them more closely. Something about reinforcement. Attention weights. The model "paying more attention" to repeated text.

Sounded like cargo cult prompting to me. But we have 1,279 quality-scored skills and an automated evaluation pipeline. So instead of arguing about it, I tested it.


The experiment

Four conditions, three real-world queries, Claude Sonnet as the model:

A) Baseline - Raw Claude, no skill loaded. Just "You are a helpful assistant."

B) Single Skill - One expert skill loaded as system prompt. The normal way.

C) Double-Stacked - The exact same skill loaded twice. Copy-paste, separated by "IMPORTANT REINFORCEMENT - RE-READ THE ABOVE INSTRUCTIONS."

D) Dual Similar - Two different but related skills combined. For example, content-strategy + content-marketing-strategist.

The queries were real tasks, not toy examples:

  1. Content planning - "Create a 4-week content plan for a B2B SaaS startup launching an AI developer tool."
  2. API review - "Review this API endpoint design and suggest improvements: POST /api/v1/users"
  3. Pricing strategy - "Design a 3-tier pricing model for a developer tool with both API and UI."

Each output was blindly scored by a separate Claude instance on five dimensions: specificity, actionability, expertise depth, structure, and completeness. Composite score out of 10.


The results

ConditionAvg ScoreAvg OutputToken Cost
A: No Skill~7.04,628 chars$0.07
B: Single Skill7.9/106,964 chars$0.17
C: Double-Stacked8.5/106,988 chars$0.25
D: Dual Similar8.0/107,651 chars$0.07

The delta that matters: Single Skill to Double-Stack is +0.6 points (7.9 to 8.5). That is roughly an 8% quality improvement.

But it costs 2x the input tokens. The double-stacked system prompt averages 17,534 tokens versus 8,778 for a single skill.


Breaking it down by query

API Review showed the biggest win: 9.0 with a single skill jumped to 9.4 with double-stacking. The evaluator noted "exceptional response with concrete code examples, specific security attack scenarios, multiple implementation options with clear tradeoffs." The repeated instructions seemed to make the model more thorough in covering edge cases.

Pricing Strategy had the largest relative improvement: 7.0 to 8.0. The single-skill response cut off before delivering actual tier numbers. The double-stacked version got further into the analysis before hitting the token limit. Reinforcement may have helped the model prioritize the core ask over preamble.

Content Planning showed a modest improvement: 7.8 to 8.2. Both versions still struggled with the 4-week scope, tending to go deep on Week 1 and running out of output tokens. More instructions did not fix a fundamental output-length constraint.


What about combining two similar skills?

This was the result I did not expect.

Dual Similar Skills scored 8.0 - barely above the single skill (7.9) and below the double-stack (8.5). Loading content-strategy alongside content-marketing-strategist did not produce meaningfully better output than loading content-strategy alone.

The hypothesis was that complementary expertise would add depth. In practice, it mostly added noise. Two similar-but-not-identical instruction sets may create subtle conflicts in tone, structure, and priorities that cancel each other out.

Repetition beats diversity. At least for single-turn tasks.


The honest interpretation

Double-stacking works. But "works" needs context:

The improvement is real but small. 0.6 points on a 10-point scale. Noticeable in blind evaluation, but not dramatic. You would not look at two outputs side by side and immediately point to the double-stacked one as clearly better.

The cost is not small. 2x input tokens means 2x cost on that dimension. For a $3/M-token model, the difference is $0.04 per call on average. At scale, that adds up. At casual usage, nobody cares.

The mechanism is probably attention distribution. Longer system prompts get more attention weight relative to the user message. Repeating instructions does not teach the model anything new - it shifts the attention budget toward following those specific instructions more closely. The same effect could potentially be achieved by making the original prompt more emphatic or structured.

Output length did not change. Double-stacking produced almost identical output lengths (6,988 vs 6,964 chars). The improvement was in quality, not quantity. The model did not write more - it wrote better.


What this means for prompt engineering

If you are building a system that routes prompts to LLMs (like we do with SkillStreaming), there are practical takeaways:

  1. Single skill is the sweet spot. 7.9 quality for 8,778 input tokens is excellent value.

  2. Double-stacking is a luxury option. Save it for high-stakes tasks where a 0.6-point improvement justifies 2x token cost. Think legal review, not Slack summaries.

  3. Do not bother combining similar skills. The marginal gain (0.1 points) is not worth the complexity. If you want cross-domain expertise, use a system like SkillStreaming that retrieves specific fragments rather than loading entire overlapping skills.

  4. Invest in prompt quality instead. A well-written skill prompt at 1x is almost certainly better than a mediocre prompt at 2x. The 7.0 to 7.9 jump from "no skill" to "single skill" dwarfs the 7.9 to 8.5 jump from stacking.


Methodology notes

  • Model: Claude Sonnet (claude-sonnet-4-5-20250929)
  • Evaluator: Separate Claude instance with structured scoring rubric
  • Dimensions: Specificity, actionability, expertise depth, structure, completeness
  • Skills tested: content-strategy (89.2 SupaScore), api-design-architect, saas-pricing-strategist
  • max_tokens: 2,048 per generation
  • All outputs evaluated blind (evaluator did not know which condition produced which output)
  • Full script: scripts/eval-double-prompt.ts in the supaskills repo

This is a small experiment (n=3 queries). The results are directional, not definitive. We plan to run a larger version with 20+ queries and multiple models. But the signal is clear enough to share.


The skills used in this experiment are part of the supaskills.ai catalog. Try them yourself: claude mcp add supaskills