Chaos Engineering
Automated Experiments
Chaos Mesh Configuration
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: multi-cloud-latency
spec:
action: delay
mode: one
selector:
namespaces:
- default
labelSelectors:
'app': 'payment-service'
delay:
latency: '100ms'
correlation: '100'
jitter: '0ms'
Multi-Cloud Resilience
AWS Fault Injection
apiVersion: fis.aws.k8s.aws/v1alpha1
kind: Experiment
metadata:
name: availability-zone-failure
spec:
description: "Simulate AZ failure"
targets:
- name: instances
resourceType: aws:ec2:instance
selectionMode: ALL
filters:
- path: Placement.AvailabilityZone
values:
- us-west-2a
actions:
- name: stop-instances
actionId: aws:ec2:stop-instances
stopConditions:
- source: none
Service Resilience Testing
LitmusChaos Experiments
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: service-disruption
spec:
appinfo:
appns: 'default'
applabel: 'app=payment'
appkind: 'deployment'
chaosServiceAccount: chaos-admin
monitoring: true
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
Metrics Collection
Prometheus Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: chaos-metrics
spec:
groups:
- name: chaos.rules
rules:
- record: chaos_experiment_status
expr: sum(rate(chaos_experiment_complete[5m])) by (result, experiment)
- alert: ChaosExperimentFailure
expr: chaos_experiment_status{result="failed"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Chaos experiment failed"
Best Practices
Experiment Design
Start small
Hypothesis-driven
Blast radius control
Automated rollback
Monitoring
Real-time metrics
Business KPIs
User impact
System resilience
Documentation
Experiment results
Lessons learned
Remediation steps
System improvements
Team Culture
Blameless postmortems
Regular gamedays
Knowledge sharing
Continuous learning
Last updated