Kubernetes at Scale: Our Journey to 99.9% Uptime
How we achieved enterprise-grade reliability using Kubernetes, including our monitoring setup, deployment strategies, incident response, and the hard lessons learned along the way.
Achieving 99.9% uptime means your service can only be down for 8.76 hours per year—or about 43 minutes per month. When you're running critical infrastructure for enterprise clients, every second of downtime matters. At ElseBlock Labs, we've spent the last three years refining our Kubernetes infrastructure to consistently achieve and exceed this reliability target. This is our story of transformation, filled with technical challenges, architectural decisions, and hard-won lessons.
The Starting Point: Chaos and Fire-Fighting
Three years ago, our infrastructure was a collection of VMs running Docker containers, managed through a combination of bash scripts and hope. Our "deployment strategy" involved SSH-ing into servers and running docker-compose up. Monitoring meant checking if the website was up. On-call meant waking up to angry customer emails.
Our reliability metrics were sobering:
- Average uptime: 97.5% (18 hours of downtime per month)
- Mean Time To Recovery (MTTR): 2.5 hours
- Deployment failure rate: 15%
- Number of 3 AM wake-up calls: Too many to count
We knew we needed a complete infrastructure overhaul. Kubernetes was the obvious choice, but the journey from chaos to 99.9% uptime was anything but straightforward.
Phase 1: Foundation - Building on Solid Ground
Cluster Architecture
We started with a multi-region, highly available architecture:
# EKS Cluster Configuration
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production-cluster
region: us-east-1
version: "1.28"
availabilityZones:
- us-east-1a
- us-east-1b
- us-east-1c
nodeGroups:
- name: system-nodes
instanceType: t3.large
desiredCapacity: 3
minSize: 3
maxSize: 9
availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1c"]
labels:
role: system
taints:
- key: CriticalAddonsOnly
value: "true"
effect: NoSchedule
- name: application-nodes
instanceType: c5.2xlarge
desiredCapacity: 6
minSize: 6
maxSize: 30
availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1c"]
labels:
role: application
spot: true
spotInstancePools: 3
Key architectural decisions:
- Multi-AZ deployment: Spreading nodes across availability zones for resilience
- Node segregation: System components on dedicated nodes with taints
- Spot instances: 70% spot instances for cost optimization with proper disruption handling
- Auto-scaling: Both cluster and pod autoscaling based on metrics
GitOps with ArgoCD
We adopted GitOps as our deployment philosophy, using ArgoCD for declarative, version-controlled deployments:
# ArgoCD Application Definition
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-services
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/elseblock/k8s-manifests
targetRevision: main
path: production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
Phase 2: Observability - You Can't Fix What You Can't See
The Monitoring Stack
We built a comprehensive monitoring stack using the Prometheus ecosystem:
# Prometheus Stack Components
monitoring-stack:
prometheus-operator: v0.68.0
prometheus: v2.45.0
alertmanager: v0.26.0
grafana: v10.0.0
thanos: v0.32.0
loki: v2.9.0
tempo: v2.2.0
Custom Metrics That Matter
Beyond standard metrics, we track business-critical KPIs:
// Custom metrics example
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
RequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "app_request_duration_seconds",
Help: "Duration of HTTP requests in seconds",
Buckets: []float64{0.001, 0.01, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0},
},
[]string{"service", "method", "endpoint", "status"},
)
BusinessTransactions = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "app_business_transactions_total",
Help: "Total number of business transactions",
},
[]string{"service", "transaction_type", "status"},
)
ErrorBudgetRemaining = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "slo_error_budget_remaining_ratio",
Help: "Remaining error budget as a ratio",
},
[]string{"service", "slo_name"},
)
)
SLOs and Error Budgets
We define Service Level Objectives (SLOs) for all critical services:
# SLO Definition
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: api-service-slo
spec:
service: "api"
labels:
team: "platform"
slos:
- name: "requests-availability"
objective: 99.9
description: "99.9% of requests should be successful"
sli:
events:
error_query: |
sum(rate(app_request_duration_seconds_count{service="api",status=~"5.."}[5m]))
total_query: |
sum(rate(app_request_duration_seconds_count{service="api"}[5m]))
alerting:
name: APIServiceHighErrorRate
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
Phase 3: Deployment Strategies - Rolling Out Without Rolling Over
Progressive Delivery with Flagger
We use Flagger for automated canary deployments and progressive traffic shifting:
# Canary Deployment Configuration
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
service:
port: 80
targetPort: 8080
gateways:
- public-gateway
hosts:
- api.elseblock.io
analysis:
interval: 1m
threshold: 10
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
webhooks:
- name: load-test
url: http://flagger-loadtester/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://api-service-canary/"
Blue-Green Deployments for Critical Services
For services that can't tolerate any risk, we use blue-green deployments:
#!/bin/bash
# Blue-Green Deployment Script
NAMESPACE="production"
SERVICE="critical-service"
NEW_VERSION=$1
echo "Deploying $SERVICE version $NEW_VERSION as green..."
# Deploy green version
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: ${SERVICE}-green
namespace: ${NAMESPACE}
spec:
replicas: 3
selector:
matchLabels:
app: ${SERVICE}
version: green
template:
metadata:
labels:
app: ${SERVICE}
version: green
spec:
containers:
- name: app
image: elseblock/${SERVICE}:${NEW_VERSION}
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
EOF
# Wait for green deployment to be ready
kubectl wait --for=condition=available \
--timeout=600s \
deployment/${SERVICE}-green \
-n ${NAMESPACE}
# Run smoke tests
./run-smoke-tests.sh ${SERVICE}-green
if [ $? -eq 0 ]; then
echo "Smoke tests passed. Switching traffic to green..."
# Update service selector to green
kubectl patch service ${SERVICE} -n ${NAMESPACE} \
-p '{"spec":{"selector":{"version":"green"}}}'
echo "Traffic switched. Monitoring for 5 minutes..."
sleep 300
# Check error rate
ERROR_RATE=$(prometheus_query "rate(app_errors_total{service=\"${SERVICE}\"}[5m])")
if (( $(echo "$ERROR_RATE < 0.001" | bc -l) )); then
echo "Deployment successful. Cleaning up blue deployment..."
kubectl delete deployment ${SERVICE}-blue -n ${NAMESPACE}
kubectl patch deployment ${SERVICE}-green -n ${NAMESPACE} \
--type='json' -p='[{"op": "replace", "path": "/metadata/name", "value":"'${SERVICE}'-blue"}]'
else
echo "High error rate detected! Rolling back..."
kubectl patch service ${SERVICE} -n ${NAMESPACE} \
-p '{"spec":{"selector":{"version":"blue"}}}'
kubectl delete deployment ${SERVICE}-green -n ${NAMESPACE}
exit 1
fi
else
echo "Smoke tests failed. Aborting deployment."
kubectl delete deployment ${SERVICE}-green -n ${NAMESPACE}
exit 1
fi
Phase 4: Resilience - Preparing for the Worst
Chaos Engineering with Litmus
We regularly test our system's resilience using chaos experiments:
# Chaos Experiment: Node Failure
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: node-failure-chaos
spec:
appinfo:
appns: production
applabel: app=critical-service
chaosServiceAccount: litmus-admin
experiments:
- name: node-drain
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '300'
- name: NODE_LABEL
value: 'role=application'
- name: DRAIN_TIMEOUT
value: '90'
probe:
- name: check-service-availability
type: httpProbe
httpProbe/inputs:
url: https://api.elseblock.io/health
insecureSkipVerify: false
responseTimeout: 5000
method:
get:
criteria: ==
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5
interval: 2
retry: 3
Disaster Recovery and Backup
We maintain comprehensive backup and recovery procedures:
# Velero Backup Configuration
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
hooks:
resources:
- name: database-backup
includedNamespaces:
- production
labelSelector:
matchLabels:
component: database
pre:
- exec:
container: postgres
command:
- /bin/bash
- -c
- "pg_dump -U $POSTGRES_USER $POSTGRES_DB > /backup/db-$(date +%Y%m%d).sql"
onError: Fail
timeout: 10m
includedNamespaces:
- production
- monitoring
excludedResources:
- events
- events.events.k8s.io
ttl: 720h # 30 days retention
storageLocation: s3-backup
volumeSnapshotLocations:
- aws-ebs-snapshots
Phase 5: Incident Response - When Things Go Wrong
Automated Incident Management
We've automated much of our incident response process:
# Incident Response Bot
import asyncio
from datetime import datetime
from typing import Dict, List
import aiohttp
from kubernetes import client, config
class IncidentResponseBot:
def __init__(self):
self.k8s_client = self._init_k8s_client()
self.slack_webhook = os.environ['SLACK_WEBHOOK']
self.pagerduty_key = os.environ['PAGERDUTY_KEY']
async def handle_alert(self, alert: Dict):
severity = alert['labels']['severity']
service = alert['labels']['service']
# Create incident record
incident_id = await self.create_incident(alert)
# Execute automatic remediation if available
if remediation := self.get_remediation_action(alert):
await self.execute_remediation(remediation, incident_id)
# Escalate based on severity
if severity == 'critical':
await self.page_oncall(alert, incident_id)
elif severity == 'warning':
await self.notify_slack(alert, incident_id)
# Start collecting diagnostic data
asyncio.create_task(
self.collect_diagnostics(service, incident_id)
)
async def execute_remediation(self, action: str, incident_id: str):
remediation_actions = {
'restart_pod': self.restart_unhealthy_pods,
'scale_up': self.scale_deployment,
'clear_cache': self.clear_application_cache,
'rotate_credentials': self.rotate_credentials,
'failover_database': self.initiate_db_failover,
}
if handler := remediation_actions.get(action):
try:
await handler()
await self.update_incident(
incident_id,
f"Automatic remediation '{action}' executed successfully"
)
except Exception as e:
await self.update_incident(
incident_id,
f"Automatic remediation '{action}' failed: {str(e)}"
)
await self.escalate_to_human(incident_id)
async def collect_diagnostics(self, service: str, incident_id: str):
diagnostics = {
'timestamp': datetime.utcnow().isoformat(),
'service': service,
'pod_logs': await self.get_pod_logs(service),
'metrics': await self.get_metrics_snapshot(service),
'events': await self.get_k8s_events(service),
'traces': await self.get_distributed_traces(service),
}
# Store diagnostics for post-mortem
await self.store_diagnostics(incident_id, diagnostics)
Runbook Automation
We've codified our runbooks into executable automation:
# Automated Runbook Example
apiVersion: v1
kind: ConfigMap
metadata:
name: runbook-high-memory
data:
runbook.yaml: |
name: High Memory Usage
trigger:
alert: HighMemoryUsage
threshold: 85
steps:
- name: identify_memory_consumers
action: execute
command: |
kubectl top pods -n production --sort-by=memory | head -10
- name: check_for_memory_leaks
action: analyze
metrics:
- container_memory_usage_bytes
- container_memory_working_set_bytes
duration: 1h
- name: restart_if_leak_detected
action: conditional
condition: memory_leak_detected
true_action:
restart_deployment:
graceful: true
max_surge: 1
max_unavailable: 0
false_action:
scale_horizontally:
max_replicas: 10
- name: notify_team
action: notify
channels:
- slack: '#platform-alerts'
- email: '[email protected]'
message: "High memory usage detected and remediated"
Results: The 99.9% Achievement
After implementing these strategies, our metrics tell a compelling story:
Before vs. After
| Metric | Before | After | Improvement |
|---|---|---|---|
| Uptime | 97.5% | 99.94% | +2.44% |
| MTTR | 2.5 hours | 12 minutes | -91.7% |
| Deployment Failures | 15% | 0.3% | -98% |
| Incident Volume | 45/month | 3/month | -93.3% |
| On-Call Pages | 28/month | 2/month | -92.9% |
Cost Optimization
Reliability improvements also led to cost savings:
- 70% reduction in incident-related overtime
- 40% infrastructure cost reduction through better resource utilization
- 85% reduction in customer credits due to SLA violations
Key Lessons Learned
1. Observability is Non-Negotiable
You cannot operate what you cannot observe. Invest heavily in monitoring, logging, and tracing from day one. The cost of comprehensive observability is a fraction of the cost of downtime.
2. Automate Everything, But Stay in Control
Automation is powerful, but it needs circuit breakers. Every automated system should have:
- Manual override capabilities
- Audit logging
- Rollback mechanisms
- Human approval gates for critical actions
3. Practice Failures Before They Find You
Our chaos engineering exercises have prevented countless production incidents. Regular failure injection helps you:
- Discover unknown failure modes
- Validate your assumptions
- Train your team in a safe environment
- Build confidence in your systems
4. Culture Matters as Much as Technology
Technology alone doesn't create reliability. You need:
- Blameless post-mortems
- Shared ownership of reliability
- Investment in learning and improvement
- Recognition that reliability is a feature, not a nice-to-have
5. Error Budgets Change Everything
Implementing error budgets transformed our culture:
- Product and engineering now share reliability goals
- Teams can make informed risk/speed trade-offs
- Innovation is encouraged within acceptable risk bounds
The Road Ahead
We're not stopping at 99.9%. Our roadmap includes:
Service Mesh with Istio
We're implementing Istio for:
- Advanced traffic management
- End-to-end encryption
- Fine-grained observability
- Circuit breaking and retry logic
Multi-Region Active-Active
Moving beyond single-region availability to true global resilience.
AI-Powered Incident Prevention
Using machine learning to predict and prevent incidents before they occur.
Conclusion
Achieving 99.9% uptime isn't about any single technology or practice—it's about building a comprehensive reliability culture supported by the right tools, processes, and people. Every organization's journey will be different, but the principles remain the same: observe everything, automate wisely, prepare for failure, and never stop improving.
The investment in reliability pays dividends not just in reduced downtime, but in team happiness, customer trust, and the ability to innovate with confidence. When your platform is rock-solid, your engineers can focus on building features instead of fighting fires.
Remember: perfection is impossible, but excellence is achievable. Start where you are, measure everything, and improve incrementally. Your future on-call engineers will thank you.
Are you on a similar reliability journey? We'd love to hear about your experiences and challenges. Reach out to our team to share stories or discuss how we can help improve your infrastructure reliability.