Cloud DevOps SRE Playbook

A playbook for senior DevOps SRE

PLAYBOOK #1: INFRASTRUCTURE MATURITY & SENIOR SIGNALS

For: Senior DevOps, SRE, Platform Engineer & Tech Lead

“Infrastructure is not just code, it is an operations culture.” This playbook helps you assess system health, prepare for senior level interviews, and most importantly learn how to read an organization through its ownership culture.

PART 1: 5 Signals of a Mature System

For seniors to quickly evaluate infrastructure maturity when taking over or building from scratch.

1. Observability 2.0 (Beyond Monitoring)

Standard: You have “The Three Pillars”: metrics, logs, and traces.

Senior signal: Dashboards are separated per microservice. Alerting is tuned by symptoms rather than causes to reduce alert fatigue.

Reality check: Can you immediately identify which service is slow (latency) without checking each server one by one?

2. SLO & Incident Culture (Accountability without blame)

Standard: Clear SLI and SLO per service, not just “99.9% uptime”.

Senior signal: Blameless postmortems after incidents. Every incident produces action items that go into a prioritized backlog. Use an error budget to balance feature velocity and reliability.

3. Immutable Infrastructure & GitOps

Standard: Infrastructure is managed as code (Terraform, CloudFormation, Pulumi).

Senior signal: No manual console changes. Drift detection is in place. Every change goes through pull requests, reviews, and an audit trail.

4. CI CD & Progressive Delivery

Standard: Automated pipeline from code to deploy.

Senior signal: Automated rollback within 5 minutes. Use safer release strategies like canary or blue green to isolate risk.

5. FinOps & Capacity Planning

Standard: Monthly cloud cost reporting.

Senior signal: Workload based optimization (right sizing, spot, reserved instances). Capacity thinking based on user growth forecasting.

PART 2: Senior Interview Framework (The Architect Mindset)

In senior interviews, the right answer matters less than your thinking process.

1. Formula: Problem → Constraints → Trade-offs

Do not jump to a technology choice right away (for example, “Use Kubernetes”). Start with:

Problem: What is the real problem? (scalability, latency, cost)

Constraints: What constraints exist? (budget, team size, legacy)

Trade offs: Why this solution over others? (for example, latency vs consistency, CAP theorem)

2. Strategy triangle: Reliability – Security – Cost

A senior must balance these three. Improving reliability increases cost. Adding security can reduce velocity. What do you prioritize in which context?

3. Quantify success (Success metrics)

Use numbers to answer:

DORA metrics: How did you improve deployment frequency or MTTR?

Efficiency: How much did you reduce infra cost or change failure rate?

PART 3: Reverse Interviewing: Reading Ownership Signals

Do not end up in a “ticket center”. Use these questions to evaluate the team you may join.

On call: “How is the on call rotation organized? Any compensation or follow the sun? What are after hours expectations?”

SLO: “Do you measure success by concrete SLOs or generic uptime? What happens when the error budget is exhausted?”

Postmortem: “Share an example of the most recent major incident. How did the postmortem work and who owned the action items?”

Autonomy: “Is the platform team self service or ticket based?”

References

This playbook is compiled based on principles from the Google SRE Books, a standard reference for operating large scale systems.

1. Google SRE Books

Google publishes the full content online for free. This is where the concept of SRE (site reliability engineering) was defined.

Books homepage: https://sre.google/books/

Site Reliability Engineering (The original SRE book): https://sre.google/sre-book/table-of-contents/

The Site Reliability Workbook: https://sre.google/workbook/table-of-contents/

Building Secure and Reliable Systems: https://google.github.io/building-secure-and-reliable-systems/raw/toc.html

2. DORA Report (State of DevOps)

DORA (DevOps Research and Assessment) is a globally recognized research org on DevOps performance, now under Google Cloud. DORA metrics are a standard to measure team maturity.

DORA homepage: https://dora.dev/

Latest report (2024 - 2025): https://cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report

DORA metrics guide (The four keys): https://dora.dev/guides/dora-metrics/

Deployment frequency
Lead time for changes
Change failure rate
Failed service recovery time (MTTR)

Recommended jobs

Want to work in an engineering excellence environment? Explore roles waiting for you on ITGuru.