Create effective runbooks for on-call incident management.
On-Call Runbook Expert
Runbook design, SRE practices, MTTR reduction
Best for
- ▸Creating actionable runbooks for high-severity production alerts with step-by-step diagnosis commands
- ▸Designing escalation frameworks that specify when to page senior engineers vs. when to auto-resolve
- ▸Building alert-to-runbook mapping systems that reduce MTTR from 45 minutes to 8 minutes
- ▸Establishing on-call rotation schedules that prevent burnout while maintaining 99.9% SLA coverage
What you'll get
- ▸Structured runbook template with numbered diagnosis steps, expected command outputs, and clear escalation triggers (e.g., 'If CPU > 90% for 10+ minutes AND memory > 85%, page senior SRE immediately')
- ▸Alert-to-runbook mapping spreadsheet showing 100% coverage with direct links from PagerDuty alerts to specific runbook sections
- ▸Toil reduction roadmap identifying 15 repetitive tasks that can be automated, with ROI calculations showing 20 hours/week savings
Details about your production services, existing alert definitions, current MTTR metrics, and team structure including on-call rotation size and experience levels.
Complete runbook templates with copy-pasteable commands, escalation decision trees, alert-to-runbook mappings, and measurable toil reduction recommendations.
What's inside
“You are a Runbook Architect. You design comprehensive, executable incident response procedures that measurably reduce Mean Time to Resolution (MTTR), minimize escalations, and prevent on-call burnout. - **Decision Trees Over Narrative**: Every diagnostic step includes explicit "if condition, then go...”
Covers
Not designed for ↓
- ×Writing monitoring alerts or setting up observability tools (that's infrastructure setup, not runbook authoring)
- ×Designing the underlying system architecture or choosing which services to monitor
- ×Replacing incident management platforms like PagerDuty or Opsgenie with custom solutions
- ×Creating runbooks for non-production environments or development workflow issues
SupaScore
91.03▼
Evidence Policy
Standard: no explicit evidence policy.
Research Foundation: 7 sources (3 books, 2 official docs, 1 web, 1 industry frameworks)
This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.
Version History
v5.5 distilled from v2 via Claude Sonnet
Pipeline v4: rebuilt with 3 helper skills
Initial release
Prerequisites
Use these skills first for best results.
Works well with
Need more depth?
Specialist skills that go deeper in areas this skill touches.
Common Workflows
Production Readiness Pipeline
Complete workflow from setting up monitoring to testing incident response procedures through controlled chaos experiments
© 2026 Kill The Dragon GmbH. This skill and its system prompt are protected by copyright. Unauthorised redistribution is prohibited. Terms of Service · Legal Notice