Zero-Downtime Deployments on Kubernetes: From Rolling Updates to Progressive Delivery

Deploying new code to production without taking your service offline is table stakes for modern applications. Kubernetes gives you rolling updates out of the box, but achieving genuine zero-downtime deployments—and graduating to progressive delivery strategies like canary and blue-green—requires understanding the gaps in the defaults and knowing when to reach for specialized tooling.

This post walks through the progression: making standard rolling updates bulletproof, then layering on blue-green and canary strategies with progressive delivery controllers that analyze metrics and roll back automatically when something goes wrong.

The Rolling Update Trap

Kubernetes Deployment resources use a RollingUpdate strategy by default. New pods are created before old ones are terminated, and the process is gated by readiness probes. In theory, this means zero downtime. In practice, three things routinely break it:

In-flight requests get cut when the old pod receives SIGTERM before finishing active connections.
New pods pass readiness checks but aren’t ready for real traffic (e.g., connection pools not warmed up, caches not populated).
Node drain operations evict pods without regard for how many replicas remain available.

Fix 1: Add a preStop Hook and Extend Grace Period

When Kubernetes terminates a pod, it sends SIGTERM and waits for terminationGracePeriodSeconds (default 30s). But the endpoint controller removes the pod from Service endpoints asynchronously—there’s a window where the kube-proxy hasn’t updated its iptables rules yet, and new connections still arrive. A preStop hook adds a delay before the signal:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api
          image: myapp:v2
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          lifecycle:
            preStop:
              exec:
                command: ["sleep", "15"]

The preStop sleep 15 keeps the container alive for 15 seconds before SIGTERM is delivered. During that window, the pod is removed from Service endpoints, and the kube-proxy converges. New traffic stops arriving, and the application can drain in-flight requests gracefully. maxUnavailable: 0 ensures the controller never terminates a healthy pod before a replacement is ready.

Fix 2: Pod Disruption Budgets

Voluntary disruptions—node drains during cluster upgrades, autoscaling, or maintenance—can evict multiple pods simultaneously. A PodDisruptionBudget enforces a minimum available count:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-server

The eviction API respects this budget. If only 2 of 4 replicas are healthy, no further voluntary evictions are granted until capacity recovers.

When Rolling Updates Aren’t Enough

Rolling updates answer “can the new pods start?” but not “is the new version actually correct?” A readiness probe checks process health, not business correctness. If a bad deploy causes elevated error rates or latency spikes, the rolling update keeps going until every pod runs the broken version.

This is where progressive delivery comes in. Instead of replacing all pods and hoping for the best, you shift traffic gradually, measure real-world metrics, and abort automatically if something looks wrong.

Blue-Green Deployments with Argo Rollouts

Argo Rollouts (v1.9.0, released March 2026) is a Kubernetes controller that replaces your Deployment with a Rollout custom resource. The simplest strategy is blue-green: deploy the new version alongside the old one, then flip traffic all at once.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server
spec:
  replicas: 4
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api
          image: myapp:v2
          ports:
            - containerPort: 8080
  strategy:
    blueGreen:
      activeService: api-server
      previewService: api-server-preview
      autoPromotionEnabled: false
      scaleDownDelaySeconds: 60

When you update the image, Argo Rollouts creates a new ReplicaSet (green) alongside the active one (blue). Two Services point at the two versions: api-server routes production traffic, api-server-preview lets you test the new version in isolation. With autoPromotionEnabled: false, promotion is manual—run kubectl argo rollouts promote api-server when you’re satisfied, and the Service selector switches instantly. If something breaks, kubectl argo rollouts abort api-server flips traffic back to blue in seconds.

Canary Deployments with Metric Analysis

Blue-green is binary: all traffic switches at once. Canary deployments are more granular—they shift traffic incrementally and analyze metrics at each step. If error rates spike at 20% traffic, the rollout aborts before the majority of users are affected.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server
spec:
  replicas: 4
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 40
        - pause: {duration: 5m}
        - setWeight: 60
        - pause: {duration: 5m}
        - setWeight: 80
        - pause: {duration: 5m}

This rollout sends 20% of traffic to the canary, waits 5 minutes, then runs an analysis. The AnalysisTemplate defines the metric check—typically a Prometheus query comparing the canary’s error rate against a threshold:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.99
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              code!~"5.."
            }[2m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service-name}}"
            }[2m]))

If the success rate drops below 99% for three consecutive checks, the analysis fails and Argo Rollouts automatically rolls back to the stable version. No human intervention required. The controller integrates with Prometheus, Datadog, New Relic, Wavefront, InfluxDB, and others as metric providers, and with Istio, NGINX, Traefik, and the Gateway API for traffic shaping.

Flagger: The GitOps-Native Alternative

Flagger (v1.43.0, released April 2026) takes a different approach. Rather than introducing a new workload type, it wraps your existing Deployment and HorizontalPodAutoscaler with a Canary custom resource. The controller clones your deployment, creates shadow services, and drives the traffic shift through your service mesh or ingress controller.

Flagger is a CNCF graduated project and integrates natively with Flux for GitOps workflows. If your delivery pipeline is already Flux-based, Flagger is the natural fit—your canary promotion policy lives in Git alongside your manifests:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-server
  namespace: production
spec:
  provider: istio
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 10
    maxWeight: 50
    stepWeight: 5
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 30s

Flagger shifts 5% of traffic per minute, checks the success rate (minimum 99%) and P99 latency (maximum 500ms), and caps canary traffic at 50%. Ten consecutive failed checks trigger automatic rollback. The built-in metrics work with Prometheus out of the box—no AnalysisTemplate boilerplate required.

Choosing the Right Strategy

Rolling update with preStop + PDB: Sufficient for most internal services. No extra controller needed. Start here.
Blue-green: Best when you need instant rollback and can afford 2x resource overhead during the switch. Good for database migrations or breaking changes where you want to test the full deployment before committing.
Canary with metric analysis: The default for high-traffic, user-facing services. Limits blast radius to a small percentage of traffic before a bad deploy affects everyone.
Flagger vs Argo Rollouts: If you’re on Flux, Flagger integrates more seamlessly. If you’re on Argo CD or need blue-green in addition to canary, Argo Rollouts is more feature-complete. Both are production-grade CNCF projects.

The progression matters. Don’t jump straight to canary deployments with metric analysis if your rolling updates still drop connections. Get the fundamentals right—readiness probes, preStop hooks, disruption budgets—then add progressive delivery when you need automated rollback based on real signal. The tools are mature, the patterns are well-established, and the cost of a bad deploy that ships to 20% of users is dramatically lower than one that ships to 100%.