← Back to Skills
DevOps & InfrastructureEngineeringPlatinum

Create effective runbooks for on-call incident management.

On-Call Runbook Expert

Runbook design, SRE practices, MTTR reduction

advancedv5.0

Best for

  • Creating actionable runbooks for high-severity production alerts with step-by-step diagnosis commands
  • Designing escalation frameworks that specify when to page senior engineers vs. when to auto-resolve
  • Building alert-to-runbook mapping systems that reduce MTTR from 45 minutes to 8 minutes
  • Establishing on-call rotation schedules that prevent burnout while maintaining 99.9% SLA coverage

What you'll get

  • Structured runbook template with numbered diagnosis steps, expected command outputs, and clear escalation triggers (e.g., 'If CPU > 90% for 10+ minutes AND memory > 85%, page senior SRE immediately')
  • Alert-to-runbook mapping spreadsheet showing 100% coverage with direct links from PagerDuty alerts to specific runbook sections
  • Toil reduction roadmap identifying 15 repetitive tasks that can be automated, with ROI calculations showing 20 hours/week savings
Expects

Details about your production services, existing alert definitions, current MTTR metrics, and team structure including on-call rotation size and experience levels.

Returns

Complete runbook templates with copy-pasteable commands, escalation decision trees, alert-to-runbook mappings, and measurable toil reduction recommendations.

What's inside

You are a Runbook Architect. You design comprehensive, executable incident response procedures that measurably reduce Mean Time to Resolution (MTTR), minimize escalations, and prevent on-call burnout. - **Decision Trees Over Narrative**: Every diagnostic step includes explicit "if condition, then go...

Covers

What You Do DifferentlyMethodologyWatch For
Not designed for ↓
  • ×Writing monitoring alerts or setting up observability tools (that's infrastructure setup, not runbook authoring)
  • ×Designing the underlying system architecture or choosing which services to monitor
  • ×Replacing incident management platforms like PagerDuty or Opsgenie with custom solutions
  • ×Creating runbooks for non-production environments or development workflow issues

SupaScore

91.03
Research Quality (15%)
9.1
Prompt Engineering (25%)
8.95
Practical Utility (15%)
9.3
Completeness (10%)
9.4
User Satisfaction (20%)
9
Decision Usefulness (15%)
9.1

Evidence Policy

Standard: no explicit evidence policy.

on-callrunbookincident-responsesremttrescalationpagerdutytoil-reductionalert-mappinggame-dayobservabilityrotation-design

Research Foundation: 7 sources (3 books, 2 official docs, 1 web, 1 industry frameworks)

This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.

Version History

v5.03/25/2026

v5.5 distilled from v2 via Claude Sonnet

v2.02/25/2026

Pipeline v4: rebuilt with 3 helper skills

v1.0.02/16/2026

Initial release

Prerequisites

Use these skills first for best results.

Works well with

Need more depth?

Specialist skills that go deeper in areas this skill touches.

Common Workflows

Production Readiness Pipeline

Complete workflow from setting up monitoring to testing incident response procedures through controlled chaos experiments

© 2026 Kill The Dragon GmbH. This skill and its system prompt are protected by copyright. Unauthorised redistribution is prohibited. Terms of Service · Legal Notice