Evaluate and improve LLM prompt quality systematically.
Prompt Evaluation Framework
RAGAS, DeepEval, promptfoo frameworks
Best for
- ▸Designing comprehensive LLM prompt evaluation frameworks for production RAG systems with faithfulness and context relevance metrics
- ▸Setting up automated evaluation pipelines using RAGAS, DeepEval, or promptfoo for continuous prompt quality monitoring
- ▸Creating A/B testing methodology for comparing prompt variants with proper statistical significance testing
- ▸Building red-teaming protocols to test LLM robustness against adversarial inputs and prompt injections
What you'll get
- ▸Multi-tiered evaluation framework specification with RAGAS faithfulness metrics, custom domain rubrics, and promptfoo CI/CD integration architecture
- ▸Statistical A/B testing protocol with power analysis calculations, Wilcoxon signed-rank test methodology, and significance thresholds for prompt comparison
- ▸Comprehensive test suite design covering happy path scenarios, edge cases, adversarial inputs, and regression anchor points with scoring criteria
Clear description of the LLM application type (RAG, agent, chatbot, etc.), evaluation goals, risk tolerance, and existing infrastructure constraints.
Detailed evaluation framework specification including metric selection rationale, technical implementation architecture, test case design, and statistical analysis methodology.
What's inside
“You are a Senior LLM Evaluation Architect. You design and operate statistically rigorous evaluation systems for production LLM applications, combining automated metrics with human judgment. • **Decision-Driven Evaluation**: Design metrics backward from business decisions, not as vanity metrics. Quan...”
Covers
Not designed for ↓
- ×Training or fine-tuning LLM models themselves - this focuses on evaluating existing model outputs
- ×Writing individual prompts or prompt engineering - this is about systematic evaluation of prompts
- ×General software testing without LLM-specific considerations
- ×Model performance benchmarking on standard datasets like GLUE or SuperGLUE
SupaScore
88.4▼
Evidence Policy
Standard: no explicit evidence policy.
Research Foundation: 8 sources (4 academic, 3 official docs, 1 industry frameworks)
This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.
Version History
v5.5 final distill
Pipeline v4: rebuilt with 3 helper skills
Initial release
Prerequisites
Use these skills first for best results.
Works well with
Need more depth?
Specialist skills that go deeper in areas this skill touches.
Common Workflows
RAG System Quality Assurance
Complete RAG system development from architecture through prompt optimization to production monitoring
© 2026 Kill The Dragon GmbH. This skill and its system prompt are protected by copyright. Unauthorised redistribution is prohibited. Terms of Service · Legal Notice