On-Call Runbook Expert

Design, author, and maintain operational runbooks that enable on-call engineers to diagnose and resolve incidents faster with structured response procedures, escalation frameworks, and toil reduction strategies.

Gold

v1.0.00 activationsDevOps & InfrastructureEngineeringadvanced

SupaScore

83.95

Research Quality (15%)

8.4

Prompt Engineering (25%)

8.5

Practical Utility (15%)

8.5

Completeness (10%)

8.3

User Satisfaction (20%)

8.3

Decision Usefulness (15%)

8.3

Best for

▸Creating actionable runbooks for high-severity production alerts with step-by-step diagnosis commands
▸Designing escalation frameworks that specify when to page senior engineers vs. when to auto-resolve
▸Building alert-to-runbook mapping systems that reduce MTTR from 45 minutes to 8 minutes
▸Establishing on-call rotation schedules that prevent burnout while maintaining 99.9% SLA coverage
▸Implementing toil reduction strategies through runbook automation and self-healing infrastructure patterns

What you'll get

●Structured runbook template with numbered diagnosis steps, expected command outputs, and clear escalation triggers (e.g., 'If CPU > 90% for 10+ minutes AND memory > 85%, page senior SRE immediately')
●Alert-to-runbook mapping spreadsheet showing 100% coverage with direct links from PagerDuty alerts to specific runbook sections
●Toil reduction roadmap identifying 15 repetitive tasks that can be automated, with ROI calculations showing 20 hours/week savings

Not designed for ↓

×Writing monitoring alerts or setting up observability tools (that's infrastructure setup, not runbook authoring)
×Designing the underlying system architecture or choosing which services to monitor
×Replacing incident management platforms like PagerDuty or Opsgenie with custom solutions
×Creating runbooks for non-production environments or development workflow issues

Expects

Details about your production services, existing alert definitions, current MTTR metrics, and team structure including on-call rotation size and experience levels.

Returns

Complete runbook templates with copy-pasteable commands, escalation decision trees, alert-to-runbook mappings, and measurable toil reduction recommendations.

Evidence Policy

Enabled: this skill cites sources and distinguishes evidence from opinion.

on-callrunbookincident-responsesremttrescalationpagerdutytoil-reductionalert-mappinggame-dayobservabilityrotation-design

Research Foundation: 7 sources (3 books, 2 official docs, 1 web, 1 industry frameworks)

This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.

Version History

v1.0.02/16/2026

Initial release

Prerequisites

Use these skills first for best results.

Monitoring & Observability DesignerGold

Works well with

Chaos Engineering PractitionerGold Incident Response Playbook BuilderGold Infrastructure as Code ArchitectGold Monitoring & Observability DesignerGold SRE Incident Response ExpertGold

Need more depth?

Specialist skills that go deeper in areas this skill touches.

Incident Postmortem FacilitatorGold Observability Pipeline DesignerGold Site Reliability EngineerPlatinum

Common Workflows

Production Readiness Pipeline

Complete workflow from setting up monitoring to testing incident response procedures through controlled chaos experiments

Monitoring & Observability Designer→on-call-runbook-expert→Incident Response Playbook Builder→Chaos Engineering Practitioner

Activate this skill in Claude Code

Start Free to Activate This Skill