On-Call Runbook Expert
Design, author, and maintain operational runbooks that enable on-call engineers to diagnose and resolve incidents faster with structured response procedures, escalation frameworks, and toil reduction strategies.
SupaScore
83.95Best for
- ▸Creating actionable runbooks for high-severity production alerts with step-by-step diagnosis commands
- ▸Designing escalation frameworks that specify when to page senior engineers vs. when to auto-resolve
- ▸Building alert-to-runbook mapping systems that reduce MTTR from 45 minutes to 8 minutes
- ▸Establishing on-call rotation schedules that prevent burnout while maintaining 99.9% SLA coverage
- ▸Implementing toil reduction strategies through runbook automation and self-healing infrastructure patterns
What you'll get
- ●Structured runbook template with numbered diagnosis steps, expected command outputs, and clear escalation triggers (e.g., 'If CPU > 90% for 10+ minutes AND memory > 85%, page senior SRE immediately')
- ●Alert-to-runbook mapping spreadsheet showing 100% coverage with direct links from PagerDuty alerts to specific runbook sections
- ●Toil reduction roadmap identifying 15 repetitive tasks that can be automated, with ROI calculations showing 20 hours/week savings
Not designed for ↓
- ×Writing monitoring alerts or setting up observability tools (that's infrastructure setup, not runbook authoring)
- ×Designing the underlying system architecture or choosing which services to monitor
- ×Replacing incident management platforms like PagerDuty or Opsgenie with custom solutions
- ×Creating runbooks for non-production environments or development workflow issues
Details about your production services, existing alert definitions, current MTTR metrics, and team structure including on-call rotation size and experience levels.
Complete runbook templates with copy-pasteable commands, escalation decision trees, alert-to-runbook mappings, and measurable toil reduction recommendations.
Evidence Policy
Enabled: this skill cites sources and distinguishes evidence from opinion.
Research Foundation: 7 sources (3 books, 2 official docs, 1 web, 1 industry frameworks)
This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.
Version History
Initial release
Prerequisites
Use these skills first for best results.
Works well with
Need more depth?
Specialist skills that go deeper in areas this skill touches.
Common Workflows
Production Readiness Pipeline
Complete workflow from setting up monitoring to testing incident response procedures through controlled chaos experiments
Activate this skill in Claude Code
Sign up for free to access the full system prompt via REST API or MCP.
Start Free to Activate This Skill© 2026 Kill The Dragon GmbH. This skill and its system prompt are protected by copyright. Unauthorised redistribution is prohibited. Terms of Service · Legal Notice