Mercury SkillsMercury Skills
v1.0.0 cosmicstack-labs

SRE Practices

SLIs/SLOs/SLAs, error budgets, incident response, postmortems, and reliability patterns

View source0 downloads
srereliabilitysloincident-responsepostmortem

SRE Practices#

Apply Site Reliability Engineering principles.

Service Level Concepts#

TermDefinitionExample
SLIMeasured metricRequest latency p95 < 500ms
SLOTarget threshold for SLI99.9% of requests < 500ms
SLAContractual commitment (usually looser than SLO)99.5% uptime

Choosing SLOs#

  • Pick metrics users actually care about
  • Start with availability + latency + durability
  • Tighten SLOs over time as reliability improves
  • Don't over-constrain (cost/effort vs. benefit)

Error Budgets#

Error Budget = 100% - SLO
Example: 99.9% SLO → 0.1% error budget = ~8.7 hours/month

How Error Budgets Work#

  • If error budget remaining → can deploy new features
  • If error budget exhausted → freeze deployments, focus on reliability
  • Error budget burn rate alerts trigger incident response
  • Balance innovation velocity with system stability

Incident Response#

Severity Levels#

LevelDefinitionResponse
SEV1System down, affecting many usersImmediate, all hands
SEV2Degraded but operational30min response
SEV3Minor issue, workaround existsNext business day
SEV4Cosmetic, non-criticalNext sprint

Incident Command System#

  • Incident Commander: Coordinates response
  • Communications Lead: Status updates, stakeholder comms
  • Operations Lead: Technical investigation
  • Scribe: Timeline and action log

Postmortem Culture#

  • Blameless: systems failed, not people
  • Focus on: detection, response, prevention
  • Action items with owners and due dates
  • Share postmortems org-wide
  • Track action items to completion

Reliability Patterns#

  • Circuit breaker: stop cascading failures
  • Bulkhead: isolate failure domains
  • Retry with exponential backoff + jitter
  • Rate limiting: protect against traffic spikes
  • Graceful degradation: degrade features, not the whole system

More in DevOps

View all →