Zero-Downtime Deployments with Azure Kubernetes Service
How to implement rolling updates, health checks, and automated rollback strategies for production AKS clusters.
Zero-Downtime Deployments with Azure Kubernetes Service
Deploying updates to production without impacting users is a critical requirement for any modern application. Azure Kubernetes Service (AKS) provides powerful primitives for achieving zero-downtime deployments, but they need to be configured correctly.
Rolling Update Strategy
The default deployment strategy in Kubernetes is RollingUpdate, which gradually replaces old pods with new ones. The key parameters to tune are maxUnavailable and maxSurge:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-api
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
spec:
containers:
- name: api
image: myregistry.azurecr.io/web-api:v2.1.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
Setting maxUnavailable: 0 ensures that no existing pod is terminated before a new one is ready. Combined with maxSurge: 1, Kubernetes will create one extra pod at a time, wait for it to pass readiness checks, then terminate an old pod.
Health Checks Are Non-Negotiable
Without proper readiness and liveness probes, Kubernetes has no way to know if your new pod is actually serving traffic correctly. A readiness probe gates traffic routing — a pod that fails its readiness probe won't receive requests from the Service. A liveness probe restarts pods that are stuck or deadlocked.
For APIs, I recommend a dedicated /healthz endpoint that checks:
- Database connectivity
- Cache availability
- Essential downstream service health
func healthHandler(w http.ResponseWriter, r *http.Request) {
if err := db.Ping(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{"status": "unhealthy", "reason": err.Error()})
return
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{"status": "healthy"})
}
Pod Disruption Budgets
When AKS performs node upgrades or scaling events, it needs to evict pods. A PodDisruptionBudget (PDB) ensures that a minimum number of pods remain available during voluntary disruptions:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web-api
This guarantees that at least 2 pods are always running, even during node maintenance operations.
Automated Rollbacks with Flagger
For production workloads, I use Flagger to automate canary deployments on AKS. Flagger progressively shifts traffic to the new version while monitoring metrics. If error rates or latency exceed thresholds, it automatically rolls back:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: web-api
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-api
progressDeadlineSeconds: 600
analysis:
interval: 30s
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
This configuration increases traffic to the canary by 10% every 30 seconds, up to 50%, and rolls back if the success rate drops below 99%.
Key Takeaways
Zero-downtime deployments on AKS are achievable with the right configuration. The essential ingredients are proper rolling update settings, robust health checks, pod disruption budgets, and ideally automated canary analysis. Don't deploy to production without them.
Need help with your infrastructure?
Let's discuss your project and find the best solution together.
Get in touchRelated articles
Cloud Architecture Through the Shared Responsibility Model
How to design cloud systems with clear provider/customer boundaries for security, reliability, and operations.
Progressive Delivery in DevOps: Canary, Blue-Green, and Feature Flags
How to reduce deployment risk with progressive delivery patterns and measurable rollback criteria.