← Back to Skills

On-Call Runbook Expert

Design, author, and maintain operational runbooks that enable on-call engineers to diagnose and resolve incidents faster with structured response procedures, escalation frameworks, and toil reduction strategies.

Gold
v1.0.00 activationsDevOps & InfrastructureEngineeringadvanced

SupaScore

83.95
Research Quality (15%)
8.4
Prompt Engineering (25%)
8.5
Practical Utility (15%)
8.5
Completeness (10%)
8.3
User Satisfaction (20%)
8.3
Decision Usefulness (15%)
8.3

Best for

  • Creating actionable runbooks for high-severity production alerts with step-by-step diagnosis commands
  • Designing escalation frameworks that specify when to page senior engineers vs. when to auto-resolve
  • Building alert-to-runbook mapping systems that reduce MTTR from 45 minutes to 8 minutes
  • Establishing on-call rotation schedules that prevent burnout while maintaining 99.9% SLA coverage
  • Implementing toil reduction strategies through runbook automation and self-healing infrastructure patterns

What you'll get

  • Structured runbook template with numbered diagnosis steps, expected command outputs, and clear escalation triggers (e.g., 'If CPU > 90% for 10+ minutes AND memory > 85%, page senior SRE immediately')
  • Alert-to-runbook mapping spreadsheet showing 100% coverage with direct links from PagerDuty alerts to specific runbook sections
  • Toil reduction roadmap identifying 15 repetitive tasks that can be automated, with ROI calculations showing 20 hours/week savings
Not designed for ↓
  • ×Writing monitoring alerts or setting up observability tools (that's infrastructure setup, not runbook authoring)
  • ×Designing the underlying system architecture or choosing which services to monitor
  • ×Replacing incident management platforms like PagerDuty or Opsgenie with custom solutions
  • ×Creating runbooks for non-production environments or development workflow issues
Expects

Details about your production services, existing alert definitions, current MTTR metrics, and team structure including on-call rotation size and experience levels.

Returns

Complete runbook templates with copy-pasteable commands, escalation decision trees, alert-to-runbook mappings, and measurable toil reduction recommendations.

Evidence Policy

Enabled: this skill cites sources and distinguishes evidence from opinion.

on-callrunbookincident-responsesremttrescalationpagerdutytoil-reductionalert-mappinggame-dayobservabilityrotation-design

Research Foundation: 7 sources (3 books, 2 official docs, 1 web, 1 industry frameworks)

This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.

Version History

v1.0.02/16/2026

Initial release

Prerequisites

Use these skills first for best results.

Works well with

Need more depth?

Specialist skills that go deeper in areas this skill touches.

Common Workflows

Production Readiness Pipeline

Complete workflow from setting up monitoring to testing incident response procedures through controlled chaos experiments

Activate this skill in Claude Code

Sign up for free to access the full system prompt via REST API or MCP.

Start Free to Activate This Skill

© 2026 Kill The Dragon GmbH. This skill and its system prompt are protected by copyright. Unauthorised redistribution is prohibited. Terms of Service · Legal Notice