← Back to Skills

Site Reliability Engineer

Applies Google SRE principles to design reliable, observable, and operable production systems. Covers SLO/SLI/SLA definition, error budgets, toil reduction, capacity planning, incident management, and blameless postmortem practices.

Platinum
v1.0.00 activationsDevOps & InfrastructureEngineeringexpert

SupaScore

86
Research Quality (15%)
8.6
Prompt Engineering (25%)
8.8
Practical Utility (15%)
8.5
Completeness (10%)
8.5
User Satisfaction (20%)
8.4
Decision Usefulness (15%)
8.7

Best for

  • Implementing SLO/SLI/SLA framework with error budgets for high-traffic production services
  • Designing on-call runbooks and incident response procedures for 99.9%+ availability targets
  • Building blameless postmortem processes and implementing toil reduction automation
  • Establishing capacity planning models and reliability metrics for multi-service platforms
  • Creating chaos engineering experiments and failure injection testing strategies

What you'll get

  • SLO specification documents with precise availability targets (99.95%), latency percentiles (p95 < 200ms), and measurement windows with corresponding SLI definitions and alerting thresholds
  • Comprehensive incident response playbooks with escalation matrices, communication templates, and step-by-step troubleshooting procedures organized by service criticality tiers
  • Error budget policy frameworks with deployment freeze triggers, burn rate calculations, and quarterly budget allocation strategies tied to business objectives
Not designed for ↓
  • ×Basic system monitoring or simple uptime checks without reliability engineering methodology
  • ×Pure infrastructure provisioning or deployment automation without SRE practices
  • ×Security incident response or compliance-focused operational procedures
  • ×Application performance optimization without service level objective context
Expects

Production service architecture details, current reliability metrics, incident history, and business criticality requirements for implementing comprehensive SRE practices.

Returns

Detailed SRE implementation plans including SLO definitions, error budget policies, monitoring strategies, incident response procedures, and automation roadmaps with specific reliability targets.

Evidence Policy

Enabled: this skill cites sources and distinguishes evidence from opinion.

srereliabilitysloslierror-budgetincident-managementpostmortemtoil-reductioncapacity-planningobservabilityon-calldora-metrics

Research Foundation: 8 sources (7 books, 1 official docs)

This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.

Version History

v1.0.02/15/2026

Initial release

Works well with

Need more depth?

Specialist skills that go deeper in areas this skill touches.

Common Workflows

Production Reliability Implementation

Complete SRE program implementation from SLO definition through monitoring setup, incident management, and reliability testing

Activate this skill in Claude Code

Sign up for free to access the full system prompt via REST API or MCP.

Start Free to Activate This Skill

© 2026 Kill The Dragon GmbH. This skill and its system prompt are protected by copyright. Unauthorised redistribution is prohibited. Terms of Service · Legal Notice