Kubernetes at Scale: Our Journey to 99.9% Uptime

Achieving 99.9% uptime means your service can only be down for 8.76 hours per year—or about 43 minutes per month. When you're running critical infrastructure for enterprise clients, every second of downtime matters. At ElseBlock Labs, we've spent the last three years refining our Kubernetes infrastructure to consistently achieve and exceed this reliability target. This is our story of transformation, filled with technical challenges, architectural decisions, and hard-won lessons.

The Starting Point: Chaos and Fire-Fighting

Three years ago, our infrastructure was a collection of VMs running Docker containers, managed through a combination of bash scripts and hope. Our "deployment strategy" involved SSH-ing into servers and running docker-compose up. Monitoring meant checking if the website was up. On-call meant waking up to angry customer emails.

Our reliability metrics were sobering:

Average uptime: 97.5% (18 hours of downtime per month)
Mean Time To Recovery (MTTR): 2.5 hours
Deployment failure rate: 15%
Number of 3 AM wake-up calls: Too many to count

We knew we needed a complete infrastructure overhaul. Kubernetes was the obvious choice, but the journey from chaos to 99.9% uptime was anything but straightforward.

Phase 1: Foundation - Building on Solid Ground

Cluster Architecture

We started with a multi-region, highly available architecture:


# EKS Cluster Configuration
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: production-cluster
  region: us-east-1
  version: "1.28"

availabilityZones:
  - us-east-1a
  - us-east-1b
  - us-east-1c

nodeGroups:
  - name: system-nodes
    instanceType: t3.large
    desiredCapacity: 3
    minSize: 3
    maxSize: 9
    availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1c"]
    labels:
      role: system
    taints:
      - key: CriticalAddonsOnly
        value: "true"
        effect: NoSchedule
        
  - name: application-nodes
    instanceType: c5.2xlarge
    desiredCapacity: 6
    minSize: 6
    maxSize: 30
    availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1c"]
    labels:
      role: application
    spot: true
    spotInstancePools: 3

Key architectural decisions:

Multi-AZ deployment: Spreading nodes across availability zones for resilience
Node segregation: System components on dedicated nodes with taints
Spot instances: 70% spot instances for cost optimization with proper disruption handling
Auto-scaling: Both cluster and pod autoscaling based on metrics

GitOps with ArgoCD

We adopted GitOps as our deployment philosophy, using ArgoCD for declarative, version-controlled deployments:


# ArgoCD Application Definition
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-services
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/elseblock/k8s-manifests
    targetRevision: main
    path: production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
    - CreateNamespace=true
    - PrunePropagationPolicy=foreground
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Phase 2: Observability - You Can't Fix What You Can't See

The Monitoring Stack

We built a comprehensive monitoring stack using the Prometheus ecosystem:


# Prometheus Stack Components
monitoring-stack:
  prometheus-operator: v0.68.0
  prometheus: v2.45.0
  alertmanager: v0.26.0
  grafana: v10.0.0
  thanos: v0.32.0
  loki: v2.9.0
  tempo: v2.2.0

Custom Metrics That Matter

Beyond standard metrics, we track business-critical KPIs:


// Custom metrics example
package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    RequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "app_request_duration_seconds",
            Help: "Duration of HTTP requests in seconds",
            Buckets: []float64{0.001, 0.01, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0},
        },
        []string{"service", "method", "endpoint", "status"},
    )
    
    BusinessTransactions = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "app_business_transactions_total",
            Help: "Total number of business transactions",
        },
        []string{"service", "transaction_type", "status"},
    )
    
    ErrorBudgetRemaining = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "slo_error_budget_remaining_ratio",
            Help: "Remaining error budget as a ratio",
        },
        []string{"service", "slo_name"},
    )
)

SLOs and Error Budgets

We define Service Level Objectives (SLOs) for all critical services:


# SLO Definition
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: api-service-slo
spec:
  service: "api"
  labels:
    team: "platform"
  slos:
    - name: "requests-availability"
      objective: 99.9
      description: "99.9% of requests should be successful"
      sli:
        events:
          error_query: |
            sum(rate(app_request_duration_seconds_count{service="api",status=~"5.."}[5m]))
          total_query: |
            sum(rate(app_request_duration_seconds_count{service="api"}[5m]))
      alerting:
        name: APIServiceHighErrorRate
        page_alert:
          labels:
            severity: critical
        ticket_alert:
          labels:
            severity: warning

Phase 3: Deployment Strategies - Rolling Out Without Rolling Over

Progressive Delivery with Flagger

We use Flagger for automated canary deployments and progressive traffic shifting:


# Canary Deployment Configuration
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  service:
    port: 80
    targetPort: 8080
    gateways:
    - public-gateway
    hosts:
    - api.elseblock.io
  analysis:
    interval: 1m
    threshold: 10
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
    webhooks:
    - name: load-test
      url: http://flagger-loadtester/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://api-service-canary/"

Blue-Green Deployments for Critical Services

For services that can't tolerate any risk, we use blue-green deployments:


#!/bin/bash
# Blue-Green Deployment Script

NAMESPACE="production"
SERVICE="critical-service"
NEW_VERSION=$1

echo "Deploying $SERVICE version $NEW_VERSION as green..."

# Deploy green version
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ${SERVICE}-green
  namespace: ${NAMESPACE}
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ${SERVICE}
      version: green
  template:
    metadata:
      labels:
        app: ${SERVICE}
        version: green
    spec:
      containers:
      - name: app
        image: elseblock/${SERVICE}:${NEW_VERSION}
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
EOF

# Wait for green deployment to be ready
kubectl wait --for=condition=available \
  --timeout=600s \
  deployment/${SERVICE}-green \
  -n ${NAMESPACE}

# Run smoke tests
./run-smoke-tests.sh ${SERVICE}-green

if [ $? -eq 0 ]; then
  echo "Smoke tests passed. Switching traffic to green..."
  
  # Update service selector to green
  kubectl patch service ${SERVICE} -n ${NAMESPACE} \
    -p '{"spec":{"selector":{"version":"green"}}}'
  
  echo "Traffic switched. Monitoring for 5 minutes..."
  sleep 300
  
  # Check error rate
  ERROR_RATE=$(prometheus_query "rate(app_errors_total{service=\"${SERVICE}\"}[5m])")
  
  if (( $(echo "$ERROR_RATE < 0.001" | bc -l) )); then
    echo "Deployment successful. Cleaning up blue deployment..."
    kubectl delete deployment ${SERVICE}-blue -n ${NAMESPACE}
    kubectl patch deployment ${SERVICE}-green -n ${NAMESPACE} \
      --type='json' -p='[{"op": "replace", "path": "/metadata/name", "value":"'${SERVICE}'-blue"}]'
  else
    echo "High error rate detected! Rolling back..."
    kubectl patch service ${SERVICE} -n ${NAMESPACE} \
      -p '{"spec":{"selector":{"version":"blue"}}}'
    kubectl delete deployment ${SERVICE}-green -n ${NAMESPACE}
    exit 1
  fi
else
  echo "Smoke tests failed. Aborting deployment."
  kubectl delete deployment ${SERVICE}-green -n ${NAMESPACE}
  exit 1
fi

Phase 4: Resilience - Preparing for the Worst

Chaos Engineering with Litmus

We regularly test our system's resilience using chaos experiments:


# Chaos Experiment: Node Failure
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: node-failure-chaos
spec:
  appinfo:
    appns: production
    applabel: app=critical-service
  chaosServiceAccount: litmus-admin
  experiments:
  - name: node-drain
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '300'
        - name: NODE_LABEL
          value: 'role=application'
        - name: DRAIN_TIMEOUT
          value: '90'
      probe:
      - name: check-service-availability
        type: httpProbe
        httpProbe/inputs:
          url: https://api.elseblock.io/health
          insecureSkipVerify: false
          responseTimeout: 5000
          method:
            get:
              criteria: ==
              responseCode: "200"
        mode: Continuous
        runProperties:
          probeTimeout: 5
          interval: 2
          retry: 3

Disaster Recovery and Backup

We maintain comprehensive backup and recovery procedures:


# Velero Backup Configuration
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    hooks:
      resources:
      - name: database-backup
        includedNamespaces:
        - production
        labelSelector:
          matchLabels:
            component: database
        pre:
        - exec:
            container: postgres
            command:
            - /bin/bash
            - -c
            - "pg_dump -U $POSTGRES_USER $POSTGRES_DB > /backup/db-$(date +%Y%m%d).sql"
            onError: Fail
            timeout: 10m
    includedNamespaces:
    - production
    - monitoring
    excludedResources:
    - events
    - events.events.k8s.io
    ttl: 720h  # 30 days retention
    storageLocation: s3-backup
    volumeSnapshotLocations:
    - aws-ebs-snapshots

Phase 5: Incident Response - When Things Go Wrong

Automated Incident Management

We've automated much of our incident response process:


# Incident Response Bot
import asyncio
from datetime import datetime
from typing import Dict, List
import aiohttp
from kubernetes import client, config

class IncidentResponseBot:
    def __init__(self):
        self.k8s_client = self._init_k8s_client()
        self.slack_webhook = os.environ['SLACK_WEBHOOK']
        self.pagerduty_key = os.environ['PAGERDUTY_KEY']
        
    async def handle_alert(self, alert: Dict):
        severity = alert['labels']['severity']
        service = alert['labels']['service']
        
        # Create incident record
        incident_id = await self.create_incident(alert)
        
        # Execute automatic remediation if available
        if remediation := self.get_remediation_action(alert):
            await self.execute_remediation(remediation, incident_id)
        
        # Escalate based on severity
        if severity == 'critical':
            await self.page_oncall(alert, incident_id)
        elif severity == 'warning':
            await self.notify_slack(alert, incident_id)
        
        # Start collecting diagnostic data
        asyncio.create_task(
            self.collect_diagnostics(service, incident_id)
        )
    
    async def execute_remediation(self, action: str, incident_id: str):
        remediation_actions = {
            'restart_pod': self.restart_unhealthy_pods,
            'scale_up': self.scale_deployment,
            'clear_cache': self.clear_application_cache,
            'rotate_credentials': self.rotate_credentials,
            'failover_database': self.initiate_db_failover,
        }
        
        if handler := remediation_actions.get(action):
            try:
                await handler()
                await self.update_incident(
                    incident_id, 
                    f"Automatic remediation '{action}' executed successfully"
                )
            except Exception as e:
                await self.update_incident(
                    incident_id,
                    f"Automatic remediation '{action}' failed: {str(e)}"
                )
                await self.escalate_to_human(incident_id)
    
    async def collect_diagnostics(self, service: str, incident_id: str):
        diagnostics = {
            'timestamp': datetime.utcnow().isoformat(),
            'service': service,
            'pod_logs': await self.get_pod_logs(service),
            'metrics': await self.get_metrics_snapshot(service),
            'events': await self.get_k8s_events(service),
            'traces': await self.get_distributed_traces(service),
        }
        
        # Store diagnostics for post-mortem
        await self.store_diagnostics(incident_id, diagnostics)

Runbook Automation

We've codified our runbooks into executable automation:


# Automated Runbook Example
apiVersion: v1
kind: ConfigMap
metadata:
  name: runbook-high-memory
data:
  runbook.yaml: |
    name: High Memory Usage
    trigger:
      alert: HighMemoryUsage
      threshold: 85
    steps:
      - name: identify_memory_consumers
        action: execute
        command: |
          kubectl top pods -n production --sort-by=memory | head -10
        
      - name: check_for_memory_leaks
        action: analyze
        metrics:
          - container_memory_usage_bytes
          - container_memory_working_set_bytes
        duration: 1h
        
      - name: restart_if_leak_detected
        action: conditional
        condition: memory_leak_detected
        true_action:
          restart_deployment:
            graceful: true
            max_surge: 1
            max_unavailable: 0
        false_action:
          scale_horizontally:
            max_replicas: 10
            
      - name: notify_team
        action: notify
        channels:
          - slack: '#platform-alerts'
          - email: '[email protected]'
        message: "High memory usage detected and remediated"

Results: The 99.9% Achievement

After implementing these strategies, our metrics tell a compelling story:

Before vs. After

Metric	Before	After	Improvement
Uptime	97.5%	99.94%	+2.44%
MTTR	2.5 hours	12 minutes	-91.7%
Deployment Failures	15%	0.3%	-98%
Incident Volume	45/month	3/month	-93.3%
On-Call Pages	28/month	2/month	-92.9%

Cost Optimization

Reliability improvements also led to cost savings:

70% reduction in incident-related overtime
40% infrastructure cost reduction through better resource utilization
85% reduction in customer credits due to SLA violations

Key Lessons Learned

1. Observability is Non-Negotiable

You cannot operate what you cannot observe. Invest heavily in monitoring, logging, and tracing from day one. The cost of comprehensive observability is a fraction of the cost of downtime.

2. Automate Everything, But Stay in Control

Automation is powerful, but it needs circuit breakers. Every automated system should have:

Manual override capabilities
Audit logging
Rollback mechanisms
Human approval gates for critical actions

3. Practice Failures Before They Find You

Our chaos engineering exercises have prevented countless production incidents. Regular failure injection helps you:

Discover unknown failure modes
Validate your assumptions
Train your team in a safe environment
Build confidence in your systems

4. Culture Matters as Much as Technology

Technology alone doesn't create reliability. You need:

Blameless post-mortems
Shared ownership of reliability
Investment in learning and improvement
Recognition that reliability is a feature, not a nice-to-have

5. Error Budgets Change Everything

Implementing error budgets transformed our culture:

Product and engineering now share reliability goals
Teams can make informed risk/speed trade-offs
Innovation is encouraged within acceptable risk bounds

The Road Ahead

We're not stopping at 99.9%. Our roadmap includes:

Service Mesh with Istio

We're implementing Istio for:

Advanced traffic management
End-to-end encryption
Fine-grained observability
Circuit breaking and retry logic

Multi-Region Active-Active

Moving beyond single-region availability to true global resilience.

AI-Powered Incident Prevention

Using machine learning to predict and prevent incidents before they occur.

Conclusion

Achieving 99.9% uptime isn't about any single technology or practice—it's about building a comprehensive reliability culture supported by the right tools, processes, and people. Every organization's journey will be different, but the principles remain the same: observe everything, automate wisely, prepare for failure, and never stop improving.

The investment in reliability pays dividends not just in reduced downtime, but in team happiness, customer trust, and the ability to innovate with confidence. When your platform is rock-solid, your engineers can focus on building features instead of fighting fires.

Remember: perfection is impossible, but excellence is achievable. Start where you are, measure everything, and improve incrementally. Your future on-call engineers will thank you.

Are you on a similar reliability journey? We'd love to hear about your experiences and challenges. Reach out to our team to share stories or discuss how we can help improve your infrastructure reliability.