Kubernetes8 min read

Zero-Downtime Deployments with Azure Kubernetes Service

How to implement rolling updates, health checks, and automated rollback strategies for production AKS clusters.

2025-01-15

Zero-Downtime Deployments with Azure Kubernetes Service

Deploying updates to production without impacting users is a critical requirement for any modern application. Azure Kubernetes Service (AKS) provides powerful primitives for achieving zero-downtime deployments, but they need to be configured correctly.

Rolling Update Strategy

The default deployment strategy in Kubernetes is RollingUpdate, which gradually replaces old pods with new ones. The key parameters to tune are maxUnavailable and maxSurge:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    spec:
      containers:
        - name: api
          image: myregistry.azurecr.io/web-api:v2.1.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20

Setting maxUnavailable: 0 ensures that no existing pod is terminated before a new one is ready. Combined with maxSurge: 1, Kubernetes will create one extra pod at a time, wait for it to pass readiness checks, then terminate an old pod.

Health Checks Are Non-Negotiable

Without proper readiness and liveness probes, Kubernetes has no way to know if your new pod is actually serving traffic correctly. A readiness probe gates traffic routing — a pod that fails its readiness probe won't receive requests from the Service. A liveness probe restarts pods that are stuck or deadlocked.

For APIs, I recommend a dedicated /healthz endpoint that checks:

Database connectivity
Cache availability
Essential downstream service health

func healthHandler(w http.ResponseWriter, r *http.Request) {
    if err := db.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{"status": "unhealthy", "reason": err.Error()})
        return
    }
    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{"status": "healthy"})
}

Pod Disruption Budgets

When AKS performs node upgrades or scaling events, it needs to evict pods. A PodDisruptionBudget (PDB) ensures that a minimum number of pods remain available during voluntary disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-api

This guarantees that at least 2 pods are always running, even during node maintenance operations.

Automated Rollbacks with Flagger

For production workloads, I use Flagger to automate canary deployments on AKS. Flagger progressively shifts traffic to the new version while monitoring metrics. If error rates or latency exceed thresholds, it automatically rolls back:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: web-api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  progressDeadlineSeconds: 600
  analysis:
    interval: 30s
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m

This configuration increases traffic to the canary by 10% every 30 seconds, up to 50%, and rolls back if the success rate drops below 99%.

Key Takeaways

Zero-downtime deployments on AKS are achievable with the right configuration. The essential ingredients are proper rolling update settings, robust health checks, pod disruption budgets, and ideally automated canary analysis. Don't deploy to production without them.

Share this article

LinkedIn Twitter