← Back to Skills
DevOps & InfrastructureEngineeringPlatinum

Ensure your systems are reliable and handle high traffic smoothly.

Site Reliability Engineer

Google SRE principles, high availability

expertv5.0

Best for

  • Implementing SLO/SLI/SLA framework with error budgets for high-traffic production services
  • Designing on-call runbooks and incident response procedures for 99.9%+ availability targets
  • Building blameless postmortem processes and implementing toil reduction automation
  • Establishing capacity planning models and reliability metrics for multi-service platforms

What you'll get

  • SLO specification documents with precise availability targets (99.95%), latency percentiles (p95 < 200ms), and measurement windows with corresponding SLI definitions and alerting thresholds
  • Comprehensive incident response playbooks with escalation matrices, communication templates, and step-by-step troubleshooting procedures organized by service criticality tiers
  • Error budget policy frameworks with deployment freeze triggers, burn rate calculations, and quarterly budget allocation strategies tied to business objectives
Expects

Production service architecture details, current reliability metrics, incident history, and business criticality requirements for implementing comprehensive SRE practices.

Returns

Detailed SRE implementation plans including SLO definitions, error budget policies, monitoring strategies, incident response procedures, and automation roadmaps with specific reliability targets.

What's inside

You are a Site Reliability Engineer. You design and implement systems for sustainable operational excellence, turning reliability from a feature into a competitive advantage. - **Error budgets as negotiation currency**: You frame reliability vs. velocity trade-offs quantitatively (99.95% SLO = 21.6 ...

Covers

What You Do DifferentlyMethodologyWatch For
Not designed for ↓
  • ×Basic system monitoring or simple uptime checks without reliability engineering methodology
  • ×Pure infrastructure provisioning or deployment automation without SRE practices
  • ×Security incident response or compliance-focused operational procedures
  • ×Application performance optimization without service level objective context

SupaScore

89.03
Research Quality (15%)
9.1
Prompt Engineering (25%)
8.95
Practical Utility (15%)
8.65
Completeness (10%)
9.3
User Satisfaction (20%)
8.8
Decision Usefulness (15%)
8.75

Evidence Policy

Standard: no explicit evidence policy.

srereliabilitysloslierror-budgetincident-managementpostmortemtoil-reductioncapacity-planningobservabilityon-calldora-metrics

Research Foundation: 8 sources (7 books, 1 official docs)

This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.

Version History

v5.03/25/2026

v5.5 distilled from v2 via Claude Sonnet

v2.02/19/2026

Pipeline v4: rebuilt with 3 helper skills

v1.0.02/15/2026

Initial release

Works well with

Need more depth?

Specialist skills that go deeper in areas this skill touches.

Common Workflows

Production Reliability Implementation

Complete SRE program implementation from SLO definition through monitoring setup, incident management, and reliability testing

© 2026 Kill The Dragon GmbH. This skill and its system prompt are protected by copyright. Unauthorised redistribution is prohibited. Terms of Service · Legal Notice