Blue/Green vs Canary Deployments
Blue/green and canary are the two foundational patterns for deploying new versions of a service without taking it down. They're often discussed interchangeably, but they have meaningfully different shapes – and those differences cascade into how you validate, roll back, handle data, and budget infrastructure.
The Core Difference
The core difference is the shape of the traffic shift.
Blue/green is binary – stand up a parallel "green" environment, validate it, then flip the load balancer so 100% of traffic moves at once.
Canary is gradual – route a small slice of real production traffic (1% → 5% → 25% → 100%) to the new version while the old keeps serving the rest.
That single difference cascades into almost everything else.
What Cascades From That Difference
Blast radius vs. rollback speed
Blue/green has the fastest rollback (flip back) but the worst blast radius if something slips past pre-cutover validation – every user hits the bad version simultaneously. Canary inverts this: rollback is slightly messier (drain canary, route back), but only the canary slice is ever exposed, so SLO-gated promotion bounds the damage.
Validation signal
This is the underrated one. Blue/green's pre-cutover validation is necessarily synthetic – smoke tests, load tests, maybe shadow traffic. You don't get real production signal until you flip, and by then it's everyone. Canary gives you real production signal – real cardinality, real traffic mix, real downstream interactions – at controlled exposure. For changes whose failure modes only emerge under production conditions (perf refactors, ML models, algorithm swaps, infra changes), canary is strictly more informative.
Infrastructure cost
Blue/green needs ~2x capacity during the cutover window. Canary needs baseline + canary slice. For a 500-node fleet that math gets noticeable.
Soak
In blue/green, soak is "let green sit then flip, then watch carefully with rollback prepped." In canary, soak is structurally built into each stage – hold at 5% for N minutes against your SLOs (Service Level Objectives) before promoting. The automation story is cleaner for canary because each gate is a quantitative check on real metrics.
State and data – the part that bites both
Both deployment styles require forward/backward compatibility of data, schemas, message formats, and caches. Canary because both versions are simultaneously live; blue/green because you might cut back over after the new version has written data. Expand/contract migrations are non-negotiable for either. The difference is canary forces you to confront this on every deploy; blue/green lets teams cheat for a while and then explode spectacularly during a rollback.
When to pick which
Canary fits anything behavior-affecting or high-risk where production signal matters more than atomic cutover. Blue/green fits when you need atomic semantics (some compliance/audit cases), when canary routing is hard (stateful long-lived connections, certain stream protocols), or when "two versions live at once" itself creates correctness problems your contracts can't paper over (e.g., a money-flow change where mixed routing produces ambiguous accounting).
The pattern most mature orgs converge on is hybrid: blue/green at the infrastructure / cluster / region level (big platform moves, K8s (Kubernetes) upgrades, AZ (Availability Zone) shifts), canary at the service/app level for everyday deploys, and feature flags layered on top to decouple deploy from release – so the binary is rolled out via canary, but the new behavior is gated and ramped independently. That last decoupling is what lets you actually deploy on Fridays.
A Ladder of Examples
To make the canary pattern concrete, here are seven levels of growing complexity.
Level 1 — The textbook canary
Fleet of 100 pods, all v1. Deploy 5 pods of v2 alongside, route 5% of traffic there. Watch error rate and latency for 10 min. Healthy → bump to 25 v2 / 75 v1, then 50/50, then 100% v2. Two versions live simultaneously, gradual shift. Classic canary.
Level 2 — Blue/green for contrast
Same fleet. Stand up a separate 100-pod environment running v2 (the "green" stack). Smoke-test it via an internal hostname. When ready, flip the load balancer: 0% → 100% to green in one step. Blue stays warm for 30 min as a rollback target, then gets torn down. No gradual mixing – atomic switch.
Level 3 — Canary with automated SLO gates
Same canary as Level 1, but Argo Rollouts (or Spinnaker, Flagger) runs it. Each stage has a quantitative gate: "p99 latency must stay under 250ms and 5xx rate under 0.1% for 10 min before promoting." Fails the gate → automatic rollback, no human in the loop. This is where canary stops being "watch dashboards manually" and becomes a real progressive delivery pipeline.
Level 4 — Canary on a stateful service
Now v2 changes the database schema. You can't just route 5% of traffic to v2 – v1 and v2 are reading/writing the same DB. So you do expand/contract: deploy v1.5 first, which writes both old and new columns but reads old. Then canary v2, which reads new. Then a cleanup release drops the old column. Three deploys for one logical change. This is where people learn the hard way that canary requires forward/backward compatible data contracts.
Level 5 — Canary decoupled from release via feature flags
v2 ships the new checkout algorithm, but it's behind a flag defaulted off. Canary the binary to 100% with the flag off – pure infra rollout, zero behavior change, low risk. Then separately ramp the flag 1% → 10% → 50% → 100% over days, gated on business metrics (conversion, revenue per session), not just SRE (Site Reliability Engineering) metrics. Two independent ramps: deploy and release. This is what most mature orgs actually do.
Level 6 — Hybrid, multi-region, the real world
You're rolling v2 across 5 AWS regions. Within each region, canary 5% → 25% → 100% with SLO gates. Across regions, blue/green-style: fully complete us-west-2, soak 24h, then us-east-1, then EU, etc. Feature flag controls behavior independently. If region 3 detects regression, automated rollback in that region only; other regions hold. This is what a large-scale rollout actually looks like – canary at the service layer, blue/green at the region layer, flags at the feature layer, all composed.
Level 7 — The weird ones
- Shadow / dark traffic: v2 receives a copy of production traffic but its responses are discarded. Zero user impact, full production signal. Great for ML models or perf refactors. Not really canary (no user is served by it) but often confused for it.
- Sticky canary: route specific users (internal employees, beta cohort) to v2 instead of a random 5%. Same shape, different selection function.
- Per-tenant canary: in B2B SaaS, canary by customer tenant, not by request. Tenant 1 is on v2, tenants 2–100 on v1. Useful when behavior must be consistent within a customer.
Takeaway
The mental model at the bottom of the ladder is simple: canary is two versions running simultaneously with traffic gradually shifting from one to the other; blue/green is an atomic flip between two parallel environments. The upper rungs are all variations of three questions: what's the selection function, what's the gate, and what's the unit of rollout?
Most production-grade rollouts you'll see in mature orgs aren't pure canary or pure blue/green – they compose both, plus feature flags, across different layers of the stack.