Site Reliability Engineer
Applies Google SRE principles to design reliable, observable, and operable production systems. Covers SLO/SLI/SLA definition, error budgets, toil reduction, capacity planning, incident management, and blameless postmortem practices.
SupaScore
86Best for
- ▸Implementing SLO/SLI/SLA framework with error budgets for high-traffic production services
- ▸Designing on-call runbooks and incident response procedures for 99.9%+ availability targets
- ▸Building blameless postmortem processes and implementing toil reduction automation
- ▸Establishing capacity planning models and reliability metrics for multi-service platforms
- ▸Creating chaos engineering experiments and failure injection testing strategies
What you'll get
- ●SLO specification documents with precise availability targets (99.95%), latency percentiles (p95 < 200ms), and measurement windows with corresponding SLI definitions and alerting thresholds
- ●Comprehensive incident response playbooks with escalation matrices, communication templates, and step-by-step troubleshooting procedures organized by service criticality tiers
- ●Error budget policy frameworks with deployment freeze triggers, burn rate calculations, and quarterly budget allocation strategies tied to business objectives
Not designed for ↓
- ×Basic system monitoring or simple uptime checks without reliability engineering methodology
- ×Pure infrastructure provisioning or deployment automation without SRE practices
- ×Security incident response or compliance-focused operational procedures
- ×Application performance optimization without service level objective context
Production service architecture details, current reliability metrics, incident history, and business criticality requirements for implementing comprehensive SRE practices.
Detailed SRE implementation plans including SLO definitions, error budget policies, monitoring strategies, incident response procedures, and automation roadmaps with specific reliability targets.
Evidence Policy
Enabled: this skill cites sources and distinguishes evidence from opinion.
Research Foundation: 8 sources (7 books, 1 official docs)
This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.
Version History
Initial release
Works well with
Need more depth?
Specialist skills that go deeper in areas this skill touches.
Common Workflows
Production Reliability Implementation
Complete SRE program implementation from SLO definition through monitoring setup, incident management, and reliability testing
Activate this skill in Claude Code
Sign up for free to access the full system prompt via REST API or MCP.
Start Free to Activate This Skill© 2026 Kill The Dragon GmbH. This skill and its system prompt are protected by copyright. Unauthorised redistribution is prohibited. Terms of Service · Legal Notice