Eugene Chernenko

AI, Engineering Management, Distributed Systems, SRE, Productivity

Blue/Green vs Canary Deployments

2026-05-01

Blue/green and canary are the two foundational patterns for deploying new versions of a service without taking it down. They're often discussed interchangeably, but they have meaningfully different shapes – and those differences cascade into how you validate, roll back, handle data, and budget infrastructure.

The Core Difference

The core difference is the shape of the traffic shift.

Blue/green is binary – stand up a parallel "green" environment, validate it, then flip the load balancer so 100% of traffic moves at once.

Canary is gradual – route a small slice of real production traffic (1% → 5% → 25% → 100%) to the new version while the old keeps serving the rest.

That single difference cascades into almost everything else.

What Cascades From That Difference

Blast radius vs. rollback speed

Blue/green has the fastest rollback (flip back) but the worst blast radius if something slips past pre-cutover validation – every user hits the bad version simultaneously. Canary inverts this: rollback is slightly messier (drain canary, route back), but only the canary slice is ever exposed, so SLO-gated promotion bounds the damage.

Validation signal

This is the underrated one. Blue/green's pre-cutover validation is necessarily synthetic – smoke tests, load tests, maybe shadow traffic. You don't get real production signal until you flip, and by then it's everyone. Canary gives you real production signal – real cardinality, real traffic mix, real downstream interactions – at controlled exposure. For changes whose failure modes only emerge under production conditions (perf refactors, ML models, algorithm swaps, infra changes), canary is strictly more informative.

Infrastructure cost

Blue/green needs ~2x capacity during the cutover window. Canary needs baseline + canary slice. For a 500-node fleet that math gets noticeable.

Soak

In blue/green, soak is "let green sit then flip, then watch carefully with rollback prepped." In canary, soak is structurally built into each stage – hold at 5% for N minutes against your SLOs (Service Level Objectives) before promoting. The automation story is cleaner for canary because each gate is a quantitative check on real metrics.

State and data – the part that bites both

Both deployment styles require forward/backward compatibility of data, schemas, message formats, and caches. Canary because both versions are simultaneously live; blue/green because you might cut back over after the new version has written data. Expand/contract migrations are non-negotiable for either. The difference is canary forces you to confront this on every deploy; blue/green lets teams cheat for a while and then explode spectacularly during a rollback.

When to pick which

Canary fits anything behavior-affecting or high-risk where production signal matters more than atomic cutover. Blue/green fits when you need atomic semantics (some compliance/audit cases), when canary routing is hard (stateful long-lived connections, certain stream protocols), or when "two versions live at once" itself creates correctness problems your contracts can't paper over (e.g., a money-flow change where mixed routing produces ambiguous accounting).

The pattern most mature orgs converge on is hybrid: blue/green at the infrastructure / cluster / region level (big platform moves, K8s (Kubernetes) upgrades, AZ (Availability Zone) shifts), canary at the service/app level for everyday deploys, and feature flags layered on top to decouple deploy from release – so the binary is rolled out via canary, but the new behavior is gated and ramped independently. That last decoupling is what lets you actually deploy on Fridays.

A Ladder of Examples

To make the canary pattern concrete, here are seven levels of growing complexity.

Level 1 — The textbook canary

Fleet of 100 pods, all v1. Deploy 5 pods of v2 alongside, route 5% of traffic there. Watch error rate and latency for 10 min. Healthy → bump to 25 v2 / 75 v1, then 50/50, then 100% v2. Two versions live simultaneously, gradual shift. Classic canary.

Level 2 — Blue/green for contrast

Same fleet. Stand up a separate 100-pod environment running v2 (the "green" stack). Smoke-test it via an internal hostname. When ready, flip the load balancer: 0% → 100% to green in one step. Blue stays warm for 30 min as a rollback target, then gets torn down. No gradual mixing – atomic switch.

Level 3 — Canary with automated SLO gates

Same canary as Level 1, but Argo Rollouts (or Spinnaker, Flagger) runs it. Each stage has a quantitative gate: "p99 latency must stay under 250ms and 5xx rate under 0.1% for 10 min before promoting." Fails the gate → automatic rollback, no human in the loop. This is where canary stops being "watch dashboards manually" and becomes a real progressive delivery pipeline.

Level 4 — Canary on a stateful service

Now v2 changes the database schema. You can't just route 5% of traffic to v2 – v1 and v2 are reading/writing the same DB. So you do expand/contract: deploy v1.5 first, which writes both old and new columns but reads old. Then canary v2, which reads new. Then a cleanup release drops the old column. Three deploys for one logical change. This is where people learn the hard way that canary requires forward/backward compatible data contracts.

Level 5 — Canary decoupled from release via feature flags

v2 ships the new checkout algorithm, but it's behind a flag defaulted off. Canary the binary to 100% with the flag off – pure infra rollout, zero behavior change, low risk. Then separately ramp the flag 1% → 10% → 50% → 100% over days, gated on business metrics (conversion, revenue per session), not just SRE (Site Reliability Engineering) metrics. Two independent ramps: deploy and release. This is what most mature orgs actually do.

Level 6 — Hybrid, multi-region, the real world

You're rolling v2 across 5 AWS regions. Within each region, canary 5% → 25% → 100% with SLO gates. Across regions, blue/green-style: fully complete us-west-2, soak 24h, then us-east-1, then EU, etc. Feature flag controls behavior independently. If region 3 detects regression, automated rollback in that region only; other regions hold. This is what a large-scale rollout actually looks like – canary at the service layer, blue/green at the region layer, flags at the feature layer, all composed.

Level 7 — The weird ones

Takeaway

The mental model at the bottom of the ladder is simple: canary is two versions running simultaneously with traffic gradually shifting from one to the other; blue/green is an atomic flip between two parallel environments. The upper rungs are all variations of three questions: what's the selection function, what's the gate, and what's the unit of rollout?

Most production-grade rollouts you'll see in mature orgs aren't pure canary or pure blue/green – they compose both, plus feature flags, across different layers of the stack.