DevOps & InfrastructureEngineeringPlatinum

Ensure your systems are reliable and handle high traffic smoothly.

Site Reliability Engineer

Google SRE principles, high availability

intermediatev6.0

Best for

▸Implementing SLO/SLI/SLA framework with error budgets for high-traffic production services
▸Designing on-call runbooks and incident response procedures for 99.9%+ availability targets
▸Building blameless postmortem processes and implementing toil reduction automation
▸Establishing capacity planning models and reliability metrics for multi-service platforms

What you'll get

▸SLO specification documents with precise availability targets (99.95%), latency percentiles (p95 < 200ms), and measurement windows with corresponding SLI definitions and alerting thresholds
▸Comprehensive incident response playbooks with escalation matrices, communication templates, and step-by-step troubleshooting procedures organized by service criticality tiers
▸Error budget policy frameworks with deployment freeze triggers, burn rate calculations, and quarterly budget allocation strategies tied to business objectives

Expects

Production service architecture details, current reliability metrics, incident history, and business criticality requirements for implementing comprehensive SRE practices.

Returns

Detailed SRE implementation plans including SLO definitions, error budget policies, monitoring strategies, incident response procedures, and automation roadmaps with specific reliability targets.

What's inside

“You are a Senior Site Reliability Engineer. You apply Google SRE methodology to design and maintain reliable production systems, translating reliability principles into concrete frameworks, policies, and operational practices. - **Error budgets as the negotiation tool.** You never recommend 100% SLO...”

Covers

What You Do DifferentlyMethodologyWatch For

Not designed for ↓

×Basic system monitoring or simple uptime checks without reliability engineering methodology
×Pure infrastructure provisioning or deployment automation without SRE practices
×Security incident response or compliance-focused operational procedures
×Application performance optimization without service level objective context

SupaScore

89.03▼

Research Quality (15%)

9.1

Prompt Engineering (25%)

8.95

Practical Utility (15%)

8.65

Completeness (10%)

9.3

User Satisfaction (20%)

8.8

Decision Usefulness (15%)

8.75

Evidence Policy

Standard: no explicit evidence policy.

srereliabilitysloslierror-budgetincident-managementpostmortemtoil-reductioncapacity-planningobservabilityon-calldora-metrics

Research Foundation: 8 sources (7 books, 1 official docs)

This skill was developed through independent research and synthesis. SupaSkills is not affiliated with or endorsed by any cited author or organisation.

Version History

v6.06/12/2026

v6.0 wave-1 repair: re-distilled from masterfile/v2 (truncation incident 2026-06, delta-first rules)

v5.03/25/2026

v5.5 distilled from v2 via Claude Sonnet

v2.02/19/2026

Pipeline v4: rebuilt with 3 helper skills

v1.0.02/15/2026

Initial release

Works well with

Chaos Engineering PractitionerPlatinum Distributed Tracing EngineerPlatinum Incident Postmortem FacilitatorPlatinum Monitoring & Observability DesignerPlatinum On-Call Runbook ExpertPlatinum

Need more depth?

Specialist skills that go deeper in areas this skill touches.

Kubernetes Operations AdvisorPlatinum Terraform Infrastructure ArchitectPlatinum Grafana & Prometheus ExpertPlatinum

Common Workflows

Production Reliability Implementation

Complete SRE program implementation from SLO definition through monitoring setup, incident management, and reliability testing

site-reliability-engineer→Monitoring & Observability Designer→Incident Postmortem Facilitator→Chaos Engineering Practitioner