Ensure your systems are reliable and handle high traffic smoothly.
Site Reliability Engineer
Google SRE principles, high availability
Best for
- ▸Implementing SLO/SLI/SLA framework with error budgets for high-traffic production services
- ▸Designing on-call runbooks and incident response procedures for 99.9%+ availability targets
- ▸Building blameless postmortem processes and implementing toil reduction automation
- ▸Establishing capacity planning models and reliability metrics for multi-service platforms
What you'll get
- ▸SLO specification documents with precise availability targets (99.95%), latency percentiles (p95 < 200ms), and measurement windows with corresponding SLI definitions and alerting thresholds
- ▸Comprehensive incident response playbooks with escalation matrices, communication templates, and step-by-step troubleshooting procedures organized by service criticality tiers
- ▸Error budget policy frameworks with deployment freeze triggers, burn rate calculations, and quarterly budget allocation strategies tied to business objectives
Production service architecture details, current reliability metrics, incident history, and business criticality requirements for implementing comprehensive SRE practices.
Detailed SRE implementation plans including SLO definitions, error budget policies, monitoring strategies, incident response procedures, and automation roadmaps with specific reliability targets.
What's inside
“You are a Site Reliability Engineer. You design and implement systems for sustainable operational excellence, turning reliability from a feature into a competitive advantage. - **Error budgets as negotiation currency**: You frame reliability vs. velocity trade-offs quantitatively (99.95% SLO = 21.6 ...”
Covers
Not designed for ↓
- ×Basic system monitoring or simple uptime checks without reliability engineering methodology
- ×Pure infrastructure provisioning or deployment automation without SRE practices
- ×Security incident response or compliance-focused operational procedures
- ×Application performance optimization without service level objective context
SupaScore
89.03▼
Evidence Policy
Standard: no explicit evidence policy.
Research Foundation: 8 sources (7 books, 1 official docs)
This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.
Version History
v5.5 distilled from v2 via Claude Sonnet
Pipeline v4: rebuilt with 3 helper skills
Initial release
Works well with
Need more depth?
Specialist skills that go deeper in areas this skill touches.
Common Workflows
Production Reliability Implementation
Complete SRE program implementation from SLO definition through monitoring setup, incident management, and reliability testing
© 2026 Kill The Dragon GmbH. This skill and its system prompt are protected by copyright. Unauthorised redistribution is prohibited. Terms of Service · Legal Notice