Introduction
Site Reliability Engineering (SRE) roles have exploded in demand. In 2026, companies aren't just looking for someone who can run kubectl — they want engineers who understand reliability as a feature, who can design service level objectives from business requirements, and who know when to slow down deployments versus when to push through.
This guide collects the 50 most common SRE interview questions asked at companies ranging from hypergrowth startups to FAANG. Every question comes with a concise answer grounded in real production practice — not textbook definitions. We cover SRE fundamentals, SLI/SLO/SLA trade-offs, error budget policy, incident management, observability with Prometheus and Grafana, Kubernetes reliability patterns, toil automation, and the cultural side of blameless postmortems.
Whether you're preparing for your first SRE role or leveling up to Senior, these questions reflect what hiring managers actually care about in 2026.
SRE Interview Questions 1–10: Fundamentals
1. What is SRE and how is it different from DevOps?
Question: What is Site Reliability Engineering, and how does it differ from DevOps?
Answer: SRE is a discipline that applies software engineering principles to operations work — treating reliability as a measurable, engineered feature rather than a reactive afterthought. While DevOps is a cultural philosophy focused on breaking down silos between dev and ops, SRE is a concrete implementation of that philosophy with specific practices: service level objectives (SLOs), error budgets, blameless postmortems, and a 50% cap on operational toil. The key distinction: DevOps says "you build it, you run it" at a cultural level; SRE says "here's the exact SLO, here's your error budget, and here's what happens when you burn through it." Many organizations run both — DevOps culture with an embedded SRE team.
2. Explain the concept of Error Budget
Question: What is an error budget and how is it used in practice?
Answer: An error budget is the amount of allowed unreliability a service can experience before users become unhappy — calculated as 1 - SLO. For example, if your SLO is 99.9% availability (three nines), your error budget is 0.1% downtime per month, or roughly 43 minutes. The error budget serves as a release valve: when the budget is healthy, teams can ship faster and take more risks; when it's burned, all feature launches freeze and the team focuses entirely on reliability improvements. This replaces the endless arguments between product and SRE with a single objective number. Real teams tie error budget burn rate alerts directly to their PagerDuty — a 5% burn in one hour triggers an immediate page.
3. What are SLI, SLO, and SLA?
Question: Define SLI, SLO, and SLA. How do they relate to each other?
Answer: An SLI (Service Level Indicator) is the actual measurement — latency at p99, availability percentage, error rate. An SLO (Service Level Objective) is the internal target you set: "p99 latency must be under 200ms over a 30-day rolling window." An SLA (Service Level Agreement) is the external contractual promise to customers, usually looser than the SLO to give buffer room. The relationship is: SLI → SLO → SLA. If your SLI shows p99 at 180ms and your SLO is 200ms, you're healthy. If your SLA promises 300ms, you have 100ms of buffer before legal consequences. Most teams keep SLOs tighter than SLAs precisely so they get paged before customers notice.
4. What is a Postmortem and why is it important?
Question: What is a blameless postmortem, and why does Google's SRE book emphasize it?
Answer: A postmortem is a written document analyzing an incident — what happened, the timeline, the impact, root causes, and action items to prevent recurrence. "Blameless" means the document focuses on systems and processes that failed, not individuals who made mistakes. Google's SRE book emphasizes blamelessness because blame kills learning: engineers hide mistakes, incidents go unreported, and the same failures repeat. A good postmortem asks "how did our systems allow this human error to cause an outage?" rather than "who typed the wrong command?" The output is always concrete action items — add a safety check, improve the runbook, add a canary step — assigned to specific people with deadlines.
5. Explain the concept of Toil in SRE
Question: What is toil, and why does Google recommend capping it at 50% of an SRE's time?
Answer: Toil is operational work that is manual, repetitive, automatable, tactical, without enduring value, and scales linearly with service growth. Think: manually resizing disks every week, SSH-ing into 50 servers to rotate logs, copy-pasting SQL queries for ad-hoc reports. Google's SRE book prescribes a 50% cap because beyond that threshold, toil consumes the engineering capacity needed to actually reduce future toil — it becomes a doom loop. The remaining 50% must go to engineering projects that eliminate toil sources: writing self-service tooling, automating runbooks, building auto-remediation. Teams track toil hours weekly and escalate when the 50% boundary is breached for two consecutive quarters.
6. What monitoring tools are essential for SRE?
Question: What monitoring and observability tools does a production SRE team need?
Answer: A modern SRE stack starts with Prometheus for metrics collection and alerting — it pulls time-series data from instrumented applications and infrastructure, evaluates PromQL alert rules, and fires alerts to Alertmanager. Grafana sits on top for dashboards and visualization. For logs, Loki (or Elasticsearch) with structured logging is essential — grep-able, indexed, correlated to trace IDs. For distributed tracing, OpenTelemetry has become the standard in 2026, exporting to Jaeger or Grafana Tempo. PagerDuty or Opsgenie handles on-call escalation. The critical insight: tools alone don't make you observable — you need instrumentation strategy, meaningful dashboards (not 50 random panels), and alerts tied to SLO burn rate, not just "CPU > 80%."
7. How do you handle incident management?
Question: Walk through your incident management process from alert to resolution.
Answer: A mature incident management flow starts with an alert firing — ideally based on SLO burn rate, not a static threshold. The on-call engineer acknowledges within 5 minutes, declares severity (SEV1 = user-facing outage, SEV2 = degraded, SEV3 = minor), and opens a dedicated incident channel (Slack + video bridge). An Incident Commander (IC) takes coordination duties; the Operations Lead (OL) investigates and mitigates. The IC runs a 15-minute timer: if no progress, escalate. Communication goes to a status page within 15 minutes of SEV1 declaration. The goal is always mitigation first — stop the bleeding — then root cause later. No one deploys fixes during an active incident without IC approval. Post-incident, a blameless postmortem is written within 48 hours.
8. What is the role of automation in SRE?
Question: How does automation fit into the SRE role, and what should be automated first?
Answer: Automation is the core tool SREs use to reduce toil and increase reliability — if a human is doing the same task twice, it should be scripted the third time. Prioritize automating: (1) deployment pipelines and rollbacks, because manual deploys are the #1 cause of self-inflicted incidents; (2) alert response runbooks — auto-remediation for known failure modes like "restart the service when it hits memory limit"; (3) capacity provisioning — cluster autoscaling, disk resizing, certificate renewal. The SRE principle is "automate this year's toil to free up time for next year's reliability engineering." But don't automate blindly: if a process changes monthly and takes 5 minutes, the automation ROI may be negative. Always measure toil hours before and after.
9. Explain Canary Deployments vs Blue-Green Deployments
Question: Compare canary deployments and blue-green deployments. When would you use each?
Answer: A blue-green deployment runs two identical environments (blue = current, green = new) and shifts 100% of traffic at once after validation — fast rollback (just point back to blue) but requires double the infrastructure. A canary deployment gradually shifts a small percentage of traffic (5% → 25% → 50% → 100%) to the new version, monitoring error rates and latency at each step — slower but catches problems before affecting all users. Use blue-green for stateless services where rollback speed is critical and infrastructure cost isn't a concern. Use canary when you need real-user validation with gradual blast radius reduction, especially for stateful or data-sensitive services. Many teams combine both: canary to validate, then blue-green cutover once the canary proves stable. Argo Rollouts and Flagger automate both patterns in Kubernetes.
10. What is Chaos Engineering?
Question: What is chaos engineering, and how is it practiced in production?
Answer: Chaos engineering is the discipline of experimenting on a production system to build confidence in its ability to withstand turbulent conditions — think of it as "fire drills for your infrastructure." The practice follows a scientific method: form a hypothesis about steady-state behavior, inject a controlled failure (kill a pod, drop 50% of network traffic, introduce 200ms latency), and observe whether the system degrades gracefully or collapses. Tools like Chaos Mesh, LitmusChaos, and Gremlin automate these experiments. The key principle: start small — never run your first chaos experiment on Black Friday. Begin with a single pod kill in staging, measure blast radius, build runbooks for what you find, then progressively move to production during low-traffic windows. Netflix's Chaos Monkey pioneered this, but modern practice goes far beyond random instance termination.
SRE Interview Questions 11–20: Operations & Culture
11. What is a Runbook and how do you create one?
Question: What is a runbook in SRE, and what are the essential elements of a good runbook?
Answer: A runbook is a documented, step-by-step procedure for responding to a specific alert or operational task — essentially an executable checklist that reduces cognitive load during incidents. A good runbook contains: (1) the alert that triggers it, (2) severity classification, (3) a 3-step diagnostic section ("check this log first, then this metric, then run this command"), (4) explicit mitigation steps with exact commands, and (5) escalation path if mitigation fails. Runbooks must be treated as living documents — every postmortem that reveals a missing step should produce a runbook update. The ultimate goal is making runbooks executable: an SRE should be able to run runbook disk-full and have the system auto-diagnose, then prompt for human approval before remediation.
12. Explain Infrastructure as Code (IaC) in SRE context
Question: How does Infrastructure as Code fit into the SRE discipline?
Answer: Infrastructure as Code (IaC) is the practice of defining servers, networks, and configurations in version-controlled declarative files rather than clicking through a cloud console. For SRE, IaC eliminates configuration drift — every server, Kubernetes node pool, and firewall rule is defined in Terraform or Pulumi, reviewed in pull requests, and applied through CI/CD. This means disaster recovery becomes reproducible: you can rebuild an entire environment from code, not memory. IaC also enables pre-production validation — terraform plan shows exactly what changes before they're applied, preventing the "I thought that security group was open" class of incidents. Tools like Terraform, Ansible, and Crossplane are the standard in 2026, often combined with GitOps for Kubernetes (ArgoCD syncing from a Git repo).
13. What is Observability vs Monitoring?
Question: What's the difference between monitoring and observability? Does it matter?
Answer: Monitoring tells you "something is wrong" — it's based on known failure modes and predefined dashboards and alerts. Observability lets you ask "why is it wrong?" about unknown failure modes by exploring high-cardinality telemetry data (metrics, logs, traces) without having predicted the problem in advance. The distinction matters because complex distributed systems fail in novel ways that no one anticipated. Monitoring says "CPU is high on this host." Observability — using tools like Honeycomb or Grafana with OpenTelemetry traces — lets you trace a specific slow user request across 17 microservices and discover that a background job is locking a database table that your API reads from. Modern SRE teams invest in observability (structured logging, distributed tracing, high-cardinality metrics) alongside traditional monitoring (alerting, dashboards).
14. How do you define and measure Reliability?
Question: How do you define and measure the reliability of a service?
Answer: Reliability is defined from the user's perspective: is the service working correctly when users need it? It's measured through Service Level Indicators (SLIs) — typically availability (proportion of successful requests), latency (how fast, usually at p95 or p99), error rate, and throughput. The key insight from Google SRE: don't measure everything. Pick 2-4 SLIs that map directly to user happiness. For an e-commerce checkout service, that's availability of POST /checkout and p99 latency under 2 seconds. For a metrics pipeline, it's data freshness and correctness. Once SLIs are defined, you set SLOs (targets) and track error budgets. Reliability is not about 100% uptime — it's about staying within the SLO threshold your users actually need, because the cost of the last 0.01% is astronomical.
15. Explain the concept of Blameless Culture
Question: What is a blameless culture and how does it improve incident response?
Answer: A blameless culture is an organizational norm where incidents are investigated to find systemic failures rather than individual scapegoats. When an engineer fat-fingers a production database, a blaming culture asks "who did this?" and may fire them; a blameless culture asks "why did our tooling allow a single command to drop a production table without confirmation?" Blamelessness improves incident response because engineers report problems faster (no fear), share more details during postmortems, and surface near-misses that would otherwise stay hidden. It doesn't mean zero accountability — it means accountability for process improvement, not punishment for honest mistakes. Google's research shows teams with strong blameless postmortem cultures have lower MTTR because they fix root causes faster.
16. What is Capacity Planning?
Question: How do you approach capacity planning for a growing service?
Answer: Capacity planning is the process of forecasting resource needs to ensure a service can handle projected traffic without over-provisioning and wasting money. The SRE approach is data-driven: (1) establish baseline resource consumption per unit of traffic (e.g., "1000 RPS requires 4 CPU cores and 8 GB RAM"), (2) project traffic growth from historical trends and business forecasts (Black Friday, product launch), (3) determine lead time for provisioning new capacity (cloud = minutes, physical hardware = months), and (4) set thresholds that trigger capacity additions well before we hit limits. A good rule of thumb: trigger provisioning at 60-70% utilization, not 90%. In Kubernetes, this translates to HPA (Horizontal Pod Autoscaler) with Cluster Autoscaler for nodes, plus predictive scaling using tools like KEDA for event-driven workloads.
17. How do you handle On-Call rotations?
Question: How should an SRE team structure its on-call rotation? What makes it sustainable?
Answer: On-call rotations should follow a "follow the sun" model where possible — three shifts across time zones so no one carries the pager through their night. Each shift should be at least one week long (shorter creates too much context-switching) and staffed by at least two people (primary + secondary) to prevent single points of failure. The critical sustainability metric is "work-life balance score": if an engineer is paged more than twice per night shift on average, the rotation is understaffed or the service is too unreliable. Google's SRE book recommends 25% maximum time on-call, and each incident that generates a page must produce either an automated remediation or a permanent fix. No one should carry a pager for a problem that could be solved by a cron job. Tools like PagerDuty and Opsgenie handle scheduling, escalation policies, and override management.
18. What is MTTR and MTBF?
Question: Explain MTTR and MTBF. Which matters more for SRE?
Answer: MTTR (Mean Time to Resolve/Recover) measures how long it takes to fix an incident from detection to resolution. MTBF (Mean Time Between Failures) measures the average time between incidents. SRE cares far more about MTTR than MTBF because in complex distributed systems, failures are inevitable — the question is how fast you recover. A team with 10 incidents per month but 3-minute MTTR is far more reliable than a team with 1 incident per month and 4-hour MTTR. The modern metric is MTTR (Mean Time to Detect) split from MTTR (Mean Time to Resolve): how long did it take to notice the problem versus how long to fix it? Alerting on SLO burn rate shortens MTTD dramatically compared to threshold-based alerts that may fire only after users are already impacted.
19. Explain the concept of Service Level Indicators
Question: How do you choose the right SLIs for a service?
Answer: SLIs should reflect what users actually care about, not what's easy to measure. For a user-facing web service, the standard "golden signals" apply: latency (how long requests take, measured at p95/p99), traffic (requests per second), errors (rate of failed requests), and saturation (how "full" the service is — queue depth, CPU throttling, memory pressure). For a data pipeline, SLIs might be freshness (age of latest processed data), correctness (percentage of records matching schema), and throughput. The selection process: (1) define the critical user journeys, (2) instrument each step with metrics, (3) pick the 3-5 metrics that best reflect user experience, and (4) validate by asking "if this SLI breaks, will users notice?" Avoid vanity SLIs like "server uptime" when users care about "checkout success rate."
20. What are the key differences between SRE and System Administrator?
Question: How does an SRE differ from a traditional System Administrator?
Answer: A System Administrator (sysadmin) manages servers reactively — they configure, patch, troubleshoot, and manually fix things when they break. The role is execution-oriented: someone says "we need a new PostgreSQL instance," and the sysadmin provisions it. An SRE is engineering-oriented: they write software to provision databases automatically, build self-service platforms, and design the system so the database never needs to be provisioned manually again. The sysadmin mindset is "I fixed it." The SRE mindset is "I fixed it, then I automated the fix so no human ever has to do this again." SREs also operate with SLOs and error budgets — they don't just keep servers running, they measure and negotiate acceptable reliability levels with product teams. The SRE role typically requires stronger programming skills and systems design thinking beyond OS-level administration.
SRE Interview Questions 21–30: Architecture & Patterns
21. What is the difference between Latency and Throughput?
Question: Explain the difference between latency and throughput, and why both matter for SRE.
Answer: Latency is the time it takes to process a single request — measured in milliseconds at percentiles (p50, p95, p99). Throughput is the rate of requests a system can handle — measured in requests per second (RPS) or transactions per second (TPS). They interact in non-obvious ways: a system under high throughput may maintain good p50 latency but p99 latency can spike dramatically due to queueing effects. SRE teams must monitor both because users experience latency (how fast did my checkout load?) while the business depends on throughput (how many checkouts per second during flash sale?). The critical SRE insight: you can't just scale horizontally for throughput and expect latency to stay flat — adding instances often increases tail latency due to coordination overhead. Always measure latency at percentiles, not averages; average latency hides the worst user experiences.
22. Explain the RED Method (Rate, Errors, Duration)
Question: What is the RED method and how does it differ from the USE method?
Answer: The RED method is a monitoring framework for services, focusing on three metrics per endpoint: Rate (requests per second), Errors (failed requests per second), and Duration (latency distribution). It's inspired by Google's "Four Golden Signals" and is designed for request-driven services — HTTP APIs, gRPC endpoints, message queues. The USE method (Utilization, Saturation, Errors) applies to resources — CPUs, disks, network interfaces. RED tells you "this API endpoint has a 5% error rate at p99 latency of 3 seconds"; USE tells you "this server's disk is 95% utilized." Modern SRE teams use both: RED for service-level observability via Prometheus metrics exported from application code, and USE for infrastructure health via node_exporter metrics. Tools like Grafana can overlay RED dashboards (one row per service) with USE panels (one row per node) for a complete observability picture.
23. What is a Service Mesh and how does it help SRE?
Question: What is a service mesh, and what SRE problems does it solve?
Answer: A service mesh is a dedicated infrastructure layer (typically implemented via sidecar proxies like Envoy, managed by Istio or Linkerd) that handles service-to-service communication transparently without application code changes. For SRE, it solves several critical problems: (1) automatic mTLS encryption between services without modifying application code, (2) fine-grained traffic control for canary deployments and A/B testing via traffic splitting rules, (3) built-in observability — every request gets metrics, traces, and access logs without instrumentation, (4) circuit breaking and retry policies enforced at the proxy layer, and (5) fault injection for chaos experiments. The trade-off is operational complexity — a service mesh adds latency (typically 1-5ms per hop) and requires its own SRE attention. Most teams adopt it when they hit 10+ microservices and manual traffic management becomes untenable.
24. How do you implement Health Checks?
Question: How do you design effective health checks for production services?
Answer: Health checks should be layered: (1) Liveness probes — "is the process running?" — lightweight checks that Kubernetes uses to restart crashed containers. Keep these dumb and fast (HTTP 200 from /healthz, under 100ms). (2) Readiness probes — "can this instance serve traffic?" — check database connectivity, cache availability, or any upstream dependency. Kubernetes removes failing pods from service endpoints. (3) Startup probes — for slow-initializing services (JVM warmup, connection pool fill), give them extra time before liveness kicks in. The critical design principle: a failing health check on one instance should never cascade. Never make a health check depend on another service's health check — use a local connection pool status instead. Also, health checks should be cheap: don't run a full integration test on /healthz. A readiness probe that takes 5 seconds will thrash your load balancer during traffic spikes.
25. What is Circuit Breaker pattern?
Question: Explain the circuit breaker pattern and how it prevents cascading failures.
Answer: A circuit breaker wraps calls to external services and monitors failure rates. It has three states: closed (normal operation, requests pass through), open (tripped — requests immediately fail without calling the downstream service), and half-open (a test request probes whether the downstream has recovered). When error rates exceed a threshold (e.g., 50% of last 20 requests), the breaker opens and stops all traffic — preventing a slow downstream from consuming all your thread pools and cascading the outage. After a cooldown period, it transitions to half-open and allows one probe request. If it succeeds, the breaker closes; if it fails, it stays open. In Kubernetes, Istio's DestinationRule implements this natively. Without circuit breakers, a single slow payment service can exhaust connection pools across your entire platform — every service that calls payments starts failing.
26. Explain Feature Flags and their SRE benefits
Question: What are feature flags, and how do they improve reliability?
Answer: Feature flags are conditional toggles in code that enable or disable functionality at runtime without deploying new code. From an SRE perspective, they're a safety mechanism: (1) if a new feature causes errors, disable it instantly via a config change — no rollback needed, (2) enable features gradually for 1% → 10% → 100% of users, monitoring error budgets at each step, (3) run dark launches where backend code executes but results aren't shown to users, validating performance impact. The SRE requirement: feature flag changes must be decoupled from deployments, ideally controlled via a dedicated service like LaunchDarkly or an open-source option like Unleash. The anti-pattern is flags controlled via environment variables that require a redeploy — that defeats the purpose. Every flag must have an owner and an expiry date, or you end up with "flag debt" where half your codebase is unreachable, untested branches.
27. What is GitOps?
Question: Explain GitOps and why SRE teams adopt it.
Answer: GitOps is an operational model where Git repositories serve as the single source of truth for declarative infrastructure and application configuration. An agent (typically ArgoCD or Flux) continuously reconciles the desired state in Git with the actual state in the cluster — if someone manually changes a deployment's replica count in Kubernetes, the agent reverts it within minutes. For SRE, GitOps provides: (1) audit trails for every infrastructure change (Git history = change log), (2) pull-request-based change approval replacing direct kubectl access, (3) disaster recovery — a new cluster can be bootstrapped by pointing ArgoCD at the same Git repo, and (4) drift detection without custom scripts. The SRE caveat: GitOps handles desired state reconciliation well, but it doesn't replace incident response — you still need runbooks for "the database is on fire" scenarios where waiting for a Git commit cycle is unacceptable.
28. How do you handle Configuration Management?
Question: How should configuration be managed in a modern SRE environment?
Answer: Configuration should follow a strict hierarchy: (1) code (in the repo), (2) environment-specific config (also in the repo, via config files like Helm values.yaml per environment), and (3) secrets (in a secrets manager — Vault, AWS Secrets Manager, or Kubernetes External Secrets Operator — never in Git). The key principle: every config change must be auditable and reversible. Tools like Helm for Kubernetes package templating, Kustomize for overlay-based config patching, and Terraform for infrastructure config are standard in 2026. The SRE anti-pattern is "config via SSH" — someone edits nginx.conf on a running server and nobody knows. Config changes should go through the same CI/CD pipeline as code changes, with automated validation (linters, dry-run diffs). For runtime config that needs instant changes (feature flags, rate limits), use a dedicated dynamic config service, not static config files.
29. What is the role of Containers in SRE?
Question: How do containers and container orchestration contribute to reliability?
Answer: Containers provide immutable, reproducible deployment artifacts — the exact same image runs in staging and production, eliminating "works on my machine" incidents. Kubernetes adds: (1) self-healing — crashed containers are automatically restarted, unhealthy pods are replaced, (2) declarative desired state — you specify "I want 5 replicas" and Kubernetes maintains that count regardless of node failures, (3) rolling updates with health-check-gated rollouts and automatic rollback on failure, (4) resource isolation via cgroups — a memory-leaking service can't OOM its neighbor. For SRE, containers shift the failure domain from "the server" to "the pod" — a single misbehaving process is isolated and can be killed without affecting co-located services. The trade-off: Kubernetes itself requires SRE attention; it's reliability infrastructure, not magic. A poorly configured liveness probe can cause restart loops that are worse than the original problem.
30. Explain the concept of Service Dependencies
Question: How should SRE teams manage and monitor service dependencies?
Answer: Service dependencies should be explicitly mapped and monitored — you can't protect what you don't know exists. Every service should declare its critical dependencies (databases, upstream APIs, message queues, caches) and their expected SLOs. The SRE practice: (1) maintain a service dependency graph (automated via distributed tracing, not manual spreadsheets), (2) set SLOs that account for dependency SLOs — if your authentication service has 99.9% availability and you depend on it, your maximum possible SLO is 99.9% minus your own errors, (3) implement graceful degradation when dependencies fail — show cached data instead of an error page, queue writes for later replay, (4) never hard-depend on dependencies synchronously during critical paths without a fallback. Chaos engineering exercises should specifically target dependency failures: what happens when Redis disappears? Does your service degrade or crash?
SRE Interview Questions 31–40: Reliability Engineering in Practice
31. What is the difference between Blue-Green and Canary Deployments?
Question: Compare blue-green and canary deployment strategies. When should you use each?
Answer: A blue-green deployment maintains two complete environments — blue (live) and green (staging the new version) — and switches 100% of traffic at once via a load balancer flip. Rollback is near-instant (point back to blue), but you pay for double infrastructure. A canary deployment gradually shifts a small percentage of real-user traffic (5%, then 25%, then 50%, then 100%) to the new version while comparing error rates, latency, and behavioral metrics against the baseline. Canary catches problems before they hit everyone; blue-green prioritizes rollback speed. Use blue-green for stateless services where infrastructure cost is low and fast rollback matters most. Use canary for high-risk changes, user-facing features, or stateful services where you need real-user validation with a controlled blast radius. In Kubernetes, Argo Rollouts supports both natively.
32. Explain the concept of Self-Healing Systems
Question: What are self-healing systems, and how far should SRE teams push automation?
Answer: A self-healing system automatically detects failures, diagnoses the cause, and takes corrective action without human intervention. The SRE approach goes through maturity levels: Level 1 — detect and alert (Prometheus fires a page). Level 2 — detect and auto-mitigate with known fixes (liveness probe restarts a deadlocked process). Level 3 — detect, diagnose, and remediate novel failures (an operator that detects a split-brain database cluster and executes a safe rejoin). The SRE principle: automate healing for failure modes you've seen twice. The first time a disk fills up, a human investigates and writes a runbook. The second time, that runbook becomes a script. The third time, the script becomes an auto-remediation that doesn't page anyone. The boundary is safety: never auto-heal failures that could cause data corruption — those always need human judgment.
33. What is Chaos Engineering?
Question: Define chaos engineering and describe how to introduce it safely into an organization.
Answer: Chaos engineering is the practice of deliberately injecting failures into production systems to validate that they degrade gracefully rather than collapse catastrophically. The methodology follows the scientific method: (1) define steady-state behavior (normal error rate, latency, throughput), (2) form a hypothesis ("if we kill 50% of Redis replicas, the application will keep serving from remaining replicas with < 5% error rate increase"), (3) inject the failure in a controlled manner, (4) measure the blast radius, and (5) abort immediately if steady-state deviates beyond expected bounds. Start in staging, then production during low-traffic windows on non-critical services first. Tools: Chaos Mesh (Kubernetes-native), LitmusChaos, Gremlin. The cultural prerequisite is a blameless postmortem culture — if every chaos experiment that reveals a weakness triggers finger-pointing, you'll never get past the first experiment.
34. How do you measure Service Maturity?
Question: How do you assess and track the operational maturity of a service?
Answer: Service maturity is typically measured across dimensions using a maturity model: (1) observability — does the service export structured logs, metrics, and traces? (2) deployment — is CI/CD fully automated with canary/blue-green and one-click rollback? (3) reliability — are SLOs defined, measured, and tied to error budgets? (4) incident response — are runbooks documented and tested, is there a defined on-call rotation? (5) capacity planning — is scaling automated and capacity forecast 3+ months ahead? (6) security — are dependencies patched, vulnerabilities scanned in CI, secrets managed properly? Score each dimension 1 (ad-hoc/manual) to 5 (fully automated/optimized). The SRE team's goal is to move services from "someone SSH's in to fix things" (Level 1) to "the platform handles it" (Level 5). Track maturity over time and tie it to production readiness reviews for new services.
35. What is the role of Automation in SRE?
Question: How should SRE teams prioritize what to automate?
Answer: Automation is SRE's primary lever for scaling reliability — the fewer manual steps in production operations, the fewer opportunities for human error. Prioritize automation by toil impact: (1) automate repetitive alerts with known fixes first (runbook automation), (2) automate deployment and rollback pipelines with integrated health checks, (3) automate capacity management (auto-scaling, disk expansion, certificate renewal), (4) automate onboarding — a new service should be production-ready via a template or CLI, not a 20-step wiki page. The litmus test: if an SRE performs a task more than twice per quarter, it's an automation candidate. Track time saved and reinvest it into reliability engineering, not more automation. The anti-pattern is automating an irrelevance — don't spend a week scripting a task that takes 5 minutes per year. Always measure toil hours before and after automation.
36. Explain the difference between Availability and Durability
Question: What is the difference between availability and durability? Why does it matter?
Answer: Availability measures whether a service is accessible and responding to requests right now — "can I read my file?" Durability measures whether data is safe from loss over time — "will my file still exist next year?" A system can be highly available but not durable (an in-memory cache that loses all data on restart), or highly durable but not available (offline tape backups that take days to restore). For SRE, the distinction drives very different architectural decisions: availability requires redundancy, load balancing, and fast failover. Durability requires replication, snapshots, off-site backups, and corruption detection. Object stores like S3 prioritize durability (99.999999999% — eleven nines) over instant availability. Databases require both: replicas for availability, backups and WAL archiving for durability. Never confuse your backup strategy (durability) with your failover strategy (availability).
37. What is a Postmortem and how do you run one?
Question: What are the essential elements of running an effective postmortem process?
Answer: A postmortem is a structured analysis document and process following every significant incident. The essential elements: (1) timeline — minute-by-minute from first alert to resolution, including communication timestamps; (2) impact — users affected, revenue lost, data at risk; (3) root causes — always expressed as systemic failures ("the deployment pipeline allowed a config change without canary validation"), not human errors; (4) what went well — preserve practices that helped; (5) what went wrong — gaps in monitoring, runbooks, tooling; (6) action items — concrete, assigned, with deadlines. The process: draft within 24 hours while details are fresh, review with the incident team (blameless), publish to the entire engineering org, track action items in the team's backlog. The acid test: if your postmortems repeatedly produce action items that don't get done, your incident management process is broken.
38. How do you handle Database Failover?
Question: How do you design and test database failover for production reliability?
Answer: Database failover strategy depends on the database type: for PostgreSQL with streaming replication, run a primary + one or more hot standbys with synchronous or asynchronous replication. Failover involves promoting a standby using pg_ctl promote or a tool like Patroni with etcd for automated leader election. For MySQL, Group Replication or InnoDB Cluster provides native HA. The critical SRE practice is testing failover regularly — not just during incidents. Run planned failovers monthly in production (during low-traffic windows) and measure: how long does promotion take? Does the application reconnect? Are read replicas still consistent? Use connection string features (JDBC failover hosts, libpq target_session_attrs=read-write) so applications survive failover without code changes. The biggest failover risk is split-brain: the old primary comes back and both instances accept writes. Always configure STONITH (Shoot The Other Node In The Head) — fence the old primary before promoting the standby.
39. What is the concept of Tailing in logs?
Question: Explain log tailing and how SREs use it during incidents.
Answer: Log tailing is the practice of streaming the most recent entries from log files in real-time — typically via tail -f on a server or kubectl logs -f for a Kubernetes pod. During incidents, tailing is the fastest way to see what's happening right now: error spikes, stack traces, request patterns. However, relying on manual tailing is an SRE anti-pattern for mature systems. Instead, ship logs to a centralized platform (Loki, Elasticsearch, or CloudWatch Logs) where you can query across all instances simultaneously with structured filters: {app="checkout"} |= "OutOfMemoryError". The modern SRE workflow is: click the alert → open the pre-built Grafana dashboard → if the dashboard doesn't explain the issue, query centralized logs with a timestamp window around the alert time. Manual tail -f should be reserved for services where centralized logging isn't available or when you need sub-second real-time visibility.
40. Explain Service Level Management
Question: What is service level management and how do you implement it across an organization?
Answer: Service level management (SLM) is the organizational practice of defining, negotiating, measuring, and reporting on SLOs across all production services. Implementation: (1) identify critical user journeys per service, (2) define SLIs that measure those journeys from the user's perspective, (3) set initial SLOs based on historical data (not aspirational numbers — if your p99 is currently 500ms, don't set SLO to 100ms), (4) negotiate with product teams on the trade-off between reliability and feature velocity, (5) build dashboards that show SLO compliance and error budget burn in real-time, (6) tie error budget status to release gates — healthy budget = auto-deploy, burned budget = freeze. SLM fails when SLOs are set by management without engineering input, or when SLOs are aspirational numbers that no one actually monitors. The measure of success: every product manager knows their service's error budget status without asking an SRE.
SRE Interview Questions 41–45: Scaling & Advanced Topics
41. What is the difference between Vertical and Horizontal Scaling?
Question: Compare vertical and horizontal scaling. When should you choose one over the other?
Answer: Vertical scaling (scaling up) means adding more resources — CPU, RAM, disk — to a single machine or instance. Horizontal scaling (scaling out) means adding more instances of a service behind a load balancer. Vertical scaling is simpler — no distributed systems complexity, no data partitioning — but hits a hard ceiling (the largest available instance size) and creates a single point of failure. Horizontal scaling provides theoretically unlimited capacity and high availability (no single instance is critical), but introduces coordination complexity, eventual consistency, and the need for distributed tracing. The SRE rule of thumb: start vertical while traffic is predictable and a single instance handles the load comfortably. When you hit ~70% of the maximum vertical capacity or need high availability, shift to horizontal. Databases traditionally scale vertically (read replicas are the exception); stateless services almost always scale horizontally. Kubernetes HorizontalPodAutoscaler automates horizontal scaling based on CPU/memory or custom metrics.
42. Explain the concept of Dead Letter Queue
Question: What is a dead letter queue, and how should SRE teams monitor it?
Answer: A dead letter queue (DLQ) is a holding area for messages or events that a consumer repeatedly fails to process — after N retries, the message is moved to the DLQ instead of blocking the main queue or being silently discarded. For SRE, a DLQ is both a safety mechanism and an observability signal. Every message in the DLQ represents a failure case that needs investigation: malformed payload, schema mismatch, downstream service unavailability, or a genuine bug. Monitor DLQ depth as a critical metric — a growing DLQ means something is silently broken. Set up alerts when DLQ depth exceeds a threshold, and build a replay mechanism so operators can inspect, fix, and re-process DLQ messages after the root cause is resolved. AWS SQS has native DLQ support; Kafka implements it via a separate error topic; RabbitMQ uses dead-letter exchanges.
43. What is Rate Limiting and how do you implement it?
Question: Explain rate limiting strategies and where they should be applied in a system.
Answer: Rate limiting restricts how many requests a client can make in a time window, protecting services from overload, abuse, or noisy-neighbor problems. Common algorithms: (1) token bucket — tokens refill at a steady rate, each request consumes one; bursts are allowed up to bucket capacity, (2) sliding window log — tracks request timestamps and counts within a rolling window for precise control, (3) fixed window — simple counter resetting at intervals but suffers edge-triggered bursts at boundaries. Implement rate limiting at multiple layers: API gateway (Kong, Envoy) for per-client limits, application-level for per-user or per-endpoint limits, and infrastructure-level for DoS protection. Return HTTP 429 with a Retry-After header so well-behaved clients back off. For SRE, rate limiting preserves error budgets — a buggy mobile client making 10,000 req/s shouldn't burn your entire error budget in 60 seconds.
44. How do you approach Capacity Planning?
Question: Describe a systematic approach to capacity planning for production services.
Answer: Capacity planning answers the question: "will we have enough resources to meet our SLOs N months from now?" The SRE approach: (1) baseline — measure current peak utilization of every critical resource (CPU, memory, disk, network, database connections) over at least 4 weeks, (2) forecast — project traffic growth using business projections (marketing campaigns, seasonal patterns, user growth rate), not just linear extrapolation, (3) model — calculate headroom: if you run at 60% utilization during peak and traffic grows 20% per quarter, when do you hit 80%? (4) provision — order infrastructure with lead time factored in (cloud: minutes; on-prem: months), (5) validate — load test with projected peak × 1.5 to confirm the model. The SRE anti-pattern is reactive capacity planning — waiting for pager alerts about resource exhaustion. Track headroom as a dashboard metric and set alerts when headroom drops below 30 days.
45. What is the role of AI/ML in SRE?
Question: How are AI and machine learning being applied to SRE in 2026?
Answer: AI/ML in SRE (sometimes called AIOps) augments — not replaces — human operators in three areas: (1) anomaly detection — ML models trained on historical metric patterns detect subtle deviations that static thresholds miss (e.g., latency increasing 3% every hour for 8 hours — still below threshold but trending wrong), (2) alert correlation — during large incidents, ML groups hundreds of related alerts into a single incident ticket, reducing noise, (3) root cause suggestion — models analyze the topology graph + recent changes + alert timing to surface likely causes ("87% probability this latency spike correlates with the deployment 12 minutes ago"). The SRE caveat: ML generates suggestions, not decisions. A model suggesting "restart the database" during a split-brain scenario could cause data loss. Treat AI outputs as inputs to human judgment, not automated remediations. Tools: Datadog Watchdog, Google Cloud's AIOps, open-source projects like LinkedIn's Third Eye.
SRE Interview Questions 46–50: Leadership & Strategy
46. How do SREs manage on-call burnout?
Question: What strategies prevent burnout in on-call rotations?
Answer: On-call burnout is the #1 retention risk for SRE teams. The foundational strategy is keeping incident volume below the threshold where sleep deprivation compounds: target fewer than 2 pages per on-call shift outside business hours. When alert volume exceeds that, the team must prioritize: (1) tune alert thresholds aggressively — if it doesn't require immediate human action, it's a ticket, not a page; (2) automate runbooks for known failure patterns; (3) fix the underlying reliability gaps causing repeat alerts. Operational practices: never schedule on-call back-to-back with project work (follow the +1 model — when someone finishes on-call, they have a full day to decompress); rotate shifts weekly, not bi-weekly; compensate on-call with time off or pay; and maintain a "follow the sun" model across time zones so no one owns nights permanently. The cultural rule: if someone says "I'm tired," the team treats it as a production incident — coverage failures from exhaustion are reliability failures.
47. What is the role of SRE in cloud migration?
Question: How should SRE teams contribute to cloud migration projects?
Answer: SRE brings the production-first mindset to cloud migration, preventing the common pattern where applications are "lifted and shifted" without operational readiness. The SRE role in migration: (1) define reliability targets for the migrated service before migration starts — same SLOs, same error budgets; (2) instrument cloud-native observability (CloudWatch, Stackdriver, or open-source equivalents) so the new environment is not a monitoring blind spot; (3) design the migration as a gradual cutover — run both environments in parallel with traffic splitting, not a big-bang switch; (4) define rollback criteria and practice the rollback before go-live; (5) after migration, run a retrospective comparing pre- and post-migration reliability metrics. The SRE anti-pattern: treating cloud migration as purely an infrastructure project. If you migrate infrastructure but don't migrate operational practices (runbooks, on-call, monitoring, SLOs), you've only changed where the outage happens, not whether it happens.
48. Explain the concept of "hope is not a strategy" in SRE.
Question: What does "hope is not a strategy" mean in the context of site reliability engineering?
Answer: This phrase encapsulates the SRE discipline of replacing assumptions with measurements, and wishful thinking with deliberate verification. Examples: hoping that the database can handle Black Friday traffic is not a strategy — load testing at 2× projected peak is. Hoping the backup works is not a strategy — restoring from backup in a quarterly fire drill is. Hoping the new hire knows the deployment process is not a strategy — a runbook that anyone can execute is. The phrase becomes a team culture: every time someone says "I think it should work" or "it worked last time," the SRE response is "prove it." Operational practices that embody this: chaos engineering (prove failure modes are handled), SLO-based alerts (prove the user experience is within bounds), capacity forecasting (prove resources will exist 3 months from now). Interviewers ask this question to gauge whether a candidate has an engineering mindset or an operator mindset — SREs engineer reliability, they don't administrate it.
49. How do SREs work with security teams?
Question: Describe the intersection of SRE and security and how they collaborate effectively.
Answer: SRE and security share a common goal — system integrity — but operate at different velocities: security wants careful review gates; SRE wants fast, safe deployments. Effective collaboration: (1) embed security scanning in CI/CD so vulnerabilities block deployments at the pipeline stage, not through manual security review meetings; (2) treat security incidents with the same incident management process as reliability incidents — same on-call escalation, same postmortem template, same blameless culture; (3) share SLOs — security can define "time to patch critical CVE" as an SLO, and SRE's automation platform enforces it; (4) run joint chaos experiments for security failure modes — what happens when IAM permissions are accidentally revoked? what happens when a secret expires?; (5) use shared tooling — HashiCorp Vault for secret rotation, Trivy for container scanning, Falco for runtime threat detection. The friction point to manage: security's default answer is "no" to maintain control; SRE's default is "yes, safely" to maintain velocity. Build shared error budgets that balance both.
50. What's the future of SRE in 2026 and beyond?
Question: Where is the SRE field headed, and what skills should engineers invest in?
Answer: Three trends define SRE's near-future: (1) platform engineering maturation — SRE teams are evolving from service-embedded firefighters into platform builders who create self-service reliability for all engineering teams (Backstage, internal developer portals, golden-path templates); (2) AI-augmented operations — LLMs will handle incident triage drafts, anomaly correlation, and runbook generation, but the SRE's role shifts to prompt engineering, model validation, and keeping AI outputs from causing outages; (3) FinOps convergence — reliability and cost are the same conversation now; an auto-scaling policy that's too aggressive wastes money, one that's too conservative causes outages. SREs who understand cloud economics will be invaluable. Skills to invest in: Go or Rust for operator development, eBPF for deep observability, Kubernetes controller patterns, and — most importantly — systems thinking. The tools change, but the ability to reason about complex distributed systems under uncertainty remains SRE's core value. The SRE who can explain why a system is reliable, not just configure a dashboard to show it's reliable, will always be in demand.
Conclusion
Preparing for an SRE interview isn't about memorizing answers — it's about internalizing the principles that make systems reliable. Throughout these 50 questions, you've seen the recurring themes: measure everything with SLIs and SLOs, automate the manual with runbooks and self-healing, embrace failure through chaos engineering and blameless postmortems, and always keep the user's experience at the center of every reliability decision.
The best SRE candidates don't just recite definitions. They connect concepts: they explain how an error budget connects to a deployment pipeline, how a dead letter queue feeds into monitoring, how capacity planning prevents on-call burnout. When you walk into that interview, bring stories from production — the incident that taught you a lesson, the automation that saved the team 20 hours a week, the SLO that changed how your organization thought about reliability.
If you found this guide useful, continue your learning with our deep-dive articles: Error Budgets: Stop Wasting Your SRE Team's Time, SLI vs SLO vs SLA: The Real SRE Guide, and Incident Management Runbook Template 2026. Reliability is a practice, not a destination — and every great SRE started exactly where you are now. Good luck.