Chaos Engineering

Automated Experiments

Chaos Mesh Configuration

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: multi-cloud-latency
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      'app': 'payment-service'
  delay:
    latency: '100ms'
    correlation: '100'
    jitter: '0ms'

Multi-Cloud Resilience

AWS Fault Injection

apiVersion: fis.aws.k8s.aws/v1alpha1
kind: Experiment
metadata:
  name: availability-zone-failure
spec:
  description: "Simulate AZ failure"
  targets:
    - name: instances
      resourceType: aws:ec2:instance
      selectionMode: ALL
      filters:
        - path: Placement.AvailabilityZone
          values:
            - us-west-2a
  actions:
    - name: stop-instances
      actionId: aws:ec2:stop-instances
  stopConditions:
    - source: none

Service Resilience Testing

LitmusChaos Experiments

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: service-disruption
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=payment'
    appkind: 'deployment'
  chaosServiceAccount: chaos-admin
  monitoring: true
  jobCleanUpPolicy: 'delete'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'

Metrics Collection

Prometheus Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: chaos-metrics
spec:
  groups:
    - name: chaos.rules
      rules:
        - record: chaos_experiment_status
          expr: sum(rate(chaos_experiment_complete[5m])) by (result, experiment)
        - alert: ChaosExperimentFailure
          expr: chaos_experiment_status{result="failed"} > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Chaos experiment failed"

Best Practices

  1. Experiment Design

    • Start small

    • Hypothesis-driven

    • Blast radius control

    • Automated rollback

  2. Monitoring

    • Real-time metrics

    • Business KPIs

    • User impact

    • System resilience

  3. Documentation

    • Experiment results

    • Lessons learned

    • Remediation steps

    • System improvements

  4. Team Culture

    • Blameless postmortems

    • Regular gamedays

    • Knowledge sharing

    • Continuous learning

Last updated