DevOps help for Cloud Platform Engineers
  • Welcome!
  • Quick Start Guide
  • About Me
  • CV
  • Contribute
  • 🧠DevOps & SRE Foundations
    • DevOps Overview
      • Engineering Fundamentals
      • Implementing DevOps Strategy
      • DevOps Readiness Assessment
      • Lifecycle Management
      • The 12 Factor App
      • Design for Self Healing
      • Incident Management Best Practices (2025)
    • SRE Fundamentals
      • Toil Reduction
      • System Simplicity
      • Real-world Scenarios
        • AWS VM Log Monitoring API
    • Agile Development
      • Team Agreements
        • Definition of Done
        • Definition of Ready
        • Team Manifesto
        • Working Agreement
    • Industry Scenarios
      • Finance and Banking
      • Public Sector (UK/EU)
      • Energy Sector Edge Computing
  • DevOps Practices
    • Platform Engineering
    • FinOps
    • Observability
      • Modern Practices
  • 🚀Modern DevOps Practices
    • Infrastructure Testing
    • Modern Development
    • Database DevOps
  • 🛠️Infrastructure as Code (IaC)
    • Terraform
      • Cloud Integrations - Provider-specific implementations
        • Azure Scenarios
          • Azure Authetication
            • Service Principal
            • Service Principal in block
            • Service Principal in env
        • AWS Scenarios
          • AWS Authentication
        • GCP Scenarios
          • GCP Authentication
      • Testing and Validation
        • Unit Testing
        • Integration Testing
        • End-to-End Testing
        • Terratest Guide
      • Best Practices
        • State Management
        • Security
        • Code Organization
        • Performance
      • Tools & Utilities - Enhancing the Terraform workflow
        • Terraform Docs
        • TFLint
        • Checkov
        • Terrascan
      • CI/CD Integration - Automating infrastructure deployment
        • GitHub Actions
        • Azure Pipelines
        • GitLab CI
    • Bicep
      • Getting Started - First steps with Bicep [BEGINNER]
      • Template Specs
      • Best Practices - Guidelines for effective Bicep implementations
      • Modules - Building reusable components [INTERMEDIATE]
      • Examples - Sample implementations for common scenarios
      • Advanced Features
      • CI/CD Integration - Automating Bicep deployments
        • GitHub Actions
        • Azure Pipelines
  • 💰Cost Management & FinOps
    • Cloud Cost Optimization
  • 🐳Containers & Orchestration
    • Containerization Overview
      • Docker
        • Dockerfile Best Practices
        • Docker Compose
      • Kubernetes
        • CLI Tools - Essential command-line utilities
          • Kubectl
          • Kubens
          • Kubectx
        • Core Concepts
        • Components
        • Best Practices
          • Pod Security
          • Security Monitoring
          • Resource Limits
        • Advanced Features - Beyond the basics [ADVANCED]
          • Service Mesh
            • Istio
            • Linkerd
          • Ingress Controllers
            • NGINX
            • Traefik
            • Kong
            • Gloo Edge
            • Contour
        • Tips
          • Status in Pods
          • Resource handling
          • Pod Troubleshooting Commands
        • Enterprise Architecture
        • Health Management
        • Security & Compliance
        • Virtual Clusters
      • OpenShift
  • Service Mesh & Networking
    • Service Mesh Implementation
  • Architecture Patterns
    • Data Mesh
    • Multi-Cloud Networking
    • Disaster Recovery
    • Chaos Engineering
  • Edge Computing
    • Implementation Guide
      • Serverless Edge
      • IoT Edge Patterns
      • Real-Time Processing
      • Edge AI/ML
      • Security Hardening
      • Observability Patterns
      • Network Optimization
      • Storage Patterns
  • 🔄CI/CD & GitOps
    • CI/CD Overview
      • Continuous Integration
      • Continuous Delivery
        • Deployment Strategies
        • Secrets Management
        • Blue-Green Deployments
        • Deployment Metrics
        • Progressive Delivery
        • Release Management for DevOps/SRE (2025)
      • CI/CD Platforms - Tool selection and implementation
        • Azure DevOps
          • Pipelines
            • Stages
            • Jobs
            • Steps
            • Templates - Reusable pipeline components
            • Extends
            • Service Connections - External service authentication
            • Best Practices for 2025
            • Agents and Runners
            • Third-Party Integrations
            • Azure DevOps CLI
          • Boards & Work Items
        • GitHub Actions
        • GitLab
          • GitLab Runner
          • Real-life scenarios
          • Installation guides
          • Pros and Cons
          • Comparison with alternatives
      • GitOps
        • Modern GitOps Practices
        • GitOps Patterns for Multi-Cloud (2025)
        • Flux
          • Overview
          • Progressive Delivery
          • Use GitOps with Flux, GitHub and AKS
  • Source Control
    • Source Control Overview
      • Git Branching Strategies
      • Component Versioning
      • Kubernetes Manifest Versioning
      • GitLab
      • Creating a Fork
      • Naming Branches
      • Pull Requests
      • Integrating LLMs into Source Control Workflows
  • ☁️Cloud Platforms
    • Cloud Strategy
      • AWS to Azure
      • Azure to AWS
      • GCP to Azure
      • AWS to GCP
      • GCP to AWS
    • Azure
      • Best Practices
        • Azure Best Practices Overview
        • Azure Architecture Best Practices
        • Azure Naming Standards
        • Azure Tags
        • Azure Security Best Practices
      • Landing Zones
      • Services
        • Azure Active Directory (AAD)
        • Azure Monitor
        • Azure Key Vault
        • Azure Service Bus
        • Azure DNS
        • Azure App Service
        • Azure Batch
        • Azure Machine Learning
        • Azure OpenAI Service
        • Azure Cognitive Services
        • Azure Kubernetes Service (AKS)
        • Azure Databricks
        • Azure SQL Database
      • Monitoring
      • Administration Tools - Platform management interfaces
        • Azure PowerShell
        • Azure CLI
      • Tips & Tricks
    • AWS
      • Authentication
      • Best Practices
      • Tips & Tricks
      • Services
        • AWS IAM (Identity and Access Management)
        • Amazon CloudWatch
        • Amazon SNS (Simple Notification Service)
        • Amazon SQS (Simple Queue Service)
        • Amazon Route 53
        • AWS Elastic Beanstalk
        • AWS Batch
        • Amazon SageMaker
        • Amazon Bedrock
        • Amazon Comprehend
    • Google Cloud
      • Services
        • Cloud CDN
        • Cloud DNS
        • Cloud Load Balancing
        • Google Kubernetes Engine (GKE)
        • Cloud Run
        • Artifact Registry
        • Compute Engine
        • Cloud Functions
        • App Engine
        • Cloud Storage
        • Persistent Disk
        • Filestore
        • Cloud SQL
        • Cloud Spanner
        • Firestore
        • Bigtable
        • BigQuery
        • VPC (Virtual Private Cloud)
  • 🔐Security & Compliance
    • DevSecOps Overview
      • DevSecOps Pipeline Security
      • DevSecOps
        • Real-life Examples
        • Scanning & Protection - Automated security tooling
          • Dependency Scanning
          • Credential Scanning
          • Container Security Scanning
          • Static Code Analysis
            • Best Practices
            • Tool Integration Guide
            • Pipeline Configuration
        • CI/CD Security
        • Secrets Rotation
      • Supply Chain Security
        • SLSA Framework
        • Binary Authorization
        • Artifact Signing
      • Security Best Practices
        • Threat Modeling
        • Kubernetes Security
      • SecOps
      • Zero Trust Model
      • Cloud Compliance
        • ISO/IEC 27001:2022
        • ISO 22301:2019
        • PCI DSS
        • CSA STAR
      • Security Frameworks
      • SIEM and SOAR
  • Security Architecture
    • Zero Trust Implementation
      • Identity Management
      • Network Security
      • Access Control
  • 🔍Observability & Monitoring
    • Observability Fundamentals
      • Logging
      • Metrics
      • Tracing
      • Dashboards
      • SLOs and SLAs
      • Observability as Code
      • Pipeline Observability
  • 🧪Testing Strategies
    • Testing Overview
      • Modern Testing Approaches
      • End-to-End Testing
      • Unit Testing
      • Performance Testing
        • Load Testing
      • Fault Injection Testing
      • Integration Testing
      • Smoke Testing
  • 🤖AI Integration
    • AIops Overview
      • Workflow Automation
      • Predictive Analytics
      • Code Quality
  • 🧠AI & LLM Integration
    • Overview
      • Claude
        • Installation Guide
        • Project Guides
        • MCP Server Setup
        • LLM Comparison
      • Ollama
        • Installation Guide
        • Configuration
        • Models and Fine-tuning
        • DevOps Usage
        • Docker Setup
        • GPU Setup
        • Open WebUI
      • Copilot
        • Installation Guide
        • VS Code Integration
        • CLI Usage
      • Gemini
        • Installation Guides - Platform-specific setup
          • Linux Installation
          • WSL Installation
          • NixOS Installation
        • Gemini 2.5 Features
        • Roles and Agents
        • NotebookML Guide
        • Cloud Infrastructure Deployment
        • Summary
  • 💻Development Environment
    • DevOps Tools
      • Operating Systems - Development platforms
        • NixOS
          • Install NixOS: PC, Mac, WSL
          • Nix Language Deep Dive
          • Nix Language Fundamentals
            • Nix Functions and Techniques
            • Building Packages with Nix
            • NixOS Configuration Patterns
            • Flakes: The Future of Nix
          • NixOS Generators: Azure & QEMU
        • WSL2
          • Distributions
          • Terminal Setup
      • Editor Environments
      • CLI Tools
        • Azure CLI
        • PowerShell
        • Linux Commands
          • SSH - Secure Shell)
            • SSH Config
            • SSH Port Forwarding
        • Linux Fundametals
        • Cloud init
          • Cloud init examples
        • YAML Tools
          • How to create a k8s yaml file - How to create YAML config
          • YQ the tool
  • 📚Programming Languages
    • Python
    • Go
    • JavaScript/TypeScript
    • Java
    • Rust
  • Platform Engineering
    • Implementation Guide
  • FinOps
    • Implementation Guide
  • AIOps
    • LLMOps Guide
  • Should Learn
    • Should Learn
    • Linux
      • Commands
      • OS
      • Services
    • Terraform
    • Getting Started - Installation and initial setup [BEGINNER]
    • Cloud Integrations
    • Testing and Validation - Ensuring infrastructure quality
      • Unit Testing
      • Integration Testing
      • End-to-End Testing
      • Terratest Guide
    • Best Practices - Production-ready implementation strategies
      • State Management
      • Security
      • Code Organization
      • Performance
    • Tools & Utilities
    • CI/CD Integration
    • Bicep
    • Kubernetes
      • kubectl
    • Ansible
    • Puppet
    • Java
    • Rust
    • Azure CLI
  • 📖Documentation Best Practices
    • Documentation Strategy
      • Project Documentation
      • Release Notes
      • Static Sites
      • Documentation Templates
      • Real-World Examples
  • 📋Reference Materials
    • Glossary
    • Tool Comparison
    • Tool Decision Guides
    • Recommended Reading
    • Troubleshooting Guide
    • Development Setup
Powered by GitBook
On this page
  • When To Use
  • How to Use
  • Best Practices and Advice
  • Fault Injection Testing Frameworks and Tools
  • Conclusion
Edit on GitHub
  1. Testing Strategies
  2. Testing Overview

Fault Injection Testing

PreviousLoad TestingNextIntegration Testing

Last updated 5 days ago

Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its . The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time.

When To Use

Problem Addressed

Systems need to be resilient to the conditions that caused inevitable production disruptions. Modern applications are built with an increasing number of dependencies; on infrastructure, platform, network, 3rd party software or APIs, etc. Such systems increase the risk of impact from dependency disruptions. Each dependent component may fail. Furthermore, its interactions with other components may propagate the failure.

Fault injection methods are a way to increase coverage and validate software robustness and error handling, either at build-time or at run-time, with the intention of "embracing failure" as part of the development lifecycle. These methods assist engineering teams in designing and continuously validating for failure, accounting for known and unknown failure conditions, architect for redundancy, employ retry and back-off mechanisms, etc.

Applicable to

  • Software - Error handling code paths, in-process memory management.

    • Example tests: Edge-case unit/integration tests and/or load tests (i.e. stress and soak).

  • Protocol - Vulnerabilities in communication interfaces such as command line parameters or APIs.

    • Example tests: provides invalid, unexpected, or random data as input we can assess the level of protocol stability of a component.

  • Infrastructure - Outages, networking issues, hardware failures.

    • Example tests: Using different methods to cause fault in the underlying infrastructure such as Shut down virtual machine (VM) instances, crash processes, expire certificates, introduce network latency, etc. This level of testing relies on statistical metrics observations over time and measuring the deviations of its observed behavior during fault, or its recovery time.

  • Cloud-Native Applications - Microservice failures, container orchestration issues, auto-scaling errors.

    • Example tests: Targeted pod termination, network policy restrictions, resource throttling, and service mesh fault injection.

How to Use

Architecture

Terminology

  • Fault - The adjudged or hypothesized cause of an error.

  • Error - That part of the system state that may cause a subsequent failure.

  • Failure - An event that occurs when the delivered service deviates from correct state.

  • Blast Radius - The scope of impact that a fault might have on the system or users.

  • Game Day - A scheduled exercise where teams deliberately inject faults into production systems under controlled conditions.

Fault Injection Testing Basics

Fault injection is an advanced form of testing where the system is subjected to different failure modes, and where the testing engineer may know in advance what is the expected outcome, as in the case of release validation tests, or in an exploration to find potential issues in the product, which should be mitigated.

Fault Injection and Chaos Engineering

Fault injection testing is a specific approach to testing one condition. It introduces a failure into a system to validate its robustness. Chaos engineering, coined by Netflix, is a practice for generating new information. There is an overlap in concerns and often in tooling between the terms, and many times chaos engineering uses fault injection to introduce the required effects to the system.

High-level Step-by-step

Fault injection testing in the development cycle

  • Using fuzzing tools in CI.

  • Execute existing end-to-end scenario tests (such as integration or stress tests), which are augmented with fault injection.

  • Write regression and acceptance tests based on issues that were found and fixed or based on resolved service incidents.

  • Ad-hoc (manual) validations of fault in the dev environment for new features.

Cloud-Native Fault Injection Patterns

Modern cloud-native applications run in complex, distributed environments and require specialized fault injection approaches:

  1. Infrastructure Layer

    • Terminate or restart compute instances (VMs, containers)

    • Simulate region or availability zone failures

    • Introduce network partitioning between zones or regions

    • Exhaust resources (CPU, memory, disk) on nodes

  2. Platform Layer

    • Corrupt or delay API responses from cloud services

    • Simulate cloud provider throttling

    • Introduce latency in managed service interactions

    • Force failover of managed databases or cache services

  3. Application Layer

    • Inject errors into service-to-service communication

    • Degrade performance of specific microservices

    • Simulate partial failures in distributed transactions

    • Force circuit breaker activation in resilience patterns

Multi-Cloud Fault Injection Examples

Fault injection approaches vary slightly between cloud providers:

AWS:

# AWS FIS (Fault Injection Simulator) experiment template example
{
  "description": "API throttling test on Lambda functions",
  "targets": {
    "throttleFunction": {
      "resourceType": "aws:lambda:function",
      "resourceTags": {
        "Application": "payment-processor"
      },
      "filters": [
        {
          "path": "State",
          "values": [ "Active" ]
        }
      ]
    }
  },
  "actions": {
    "throttleAPI": {
      "actionId": "aws:lambda:throttle-invocation",
      "parameters": {
        "duration": "PT5M"
      },
      "targets": {
        "functions": "throttleFunction"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate"
    }
  ],
  "roleArn": "arn:aws:iam::123456789012:role/FISExperimentRole"
}

Azure:

# Azure Chaos Studio experiment template example
{
  "properties": {
    "steps": [
      {
        "name": "Terminate AKS Pods",
        "branches": [
          {
            "name": "Kill Pods",
            "actions": [
              {
                "type": "continuous",
                "name": "urn:csci:microsoft:kubernetes:pod-chaos:latest",
                "parameters": [
                  {
                    "key": "jsonSpec",
                    "value": "{\"action\":\"pod-delete\",\"mode\":\"fixed\",\"selector\":{\"namespaces\":[\"api-services\"],\"labels\":{\"app\":\"payment-api\"}},\"count\":2}"
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

GCP:

# Example using Chaos Mesh on GKE
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-api-network-delay
  namespace: chaos-testing
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - payment-services
    labelSelectors:
      app: "payment-gateway"
  delay:
    latency: "200ms"
    correlation: "25"
    jitter: "50ms"
  duration: "5m"

LLM-Assisted Resilience Testing

Large Language Models (LLMs) can enhance fault injection testing in several ways:

  1. Automated Scenario Generation

    • LLMs can analyze system architecture diagrams and generate targeted fault scenarios

    • They can identify non-obvious failure modes by examining system dependencies

  2. Intelligent Test Creation

    • Generate Chaos Mesh or Litmus Chaos experiments based on architecture descriptions

    • Create targeted fault injection scenarios for specific application concerns

  3. Root Cause Analysis

    • Process observability data during fault injection to identify cascading failures

    • Correlate metrics, logs, and traces to understand failure propagation

Here's an example of using an LLM to generate infrastructure fault scenarios:

# Example script using OpenAI to generate infrastructure fault scenarios
import openai
import json
import yaml

# Set your OpenAI API key here
openai.api_key = os.getenv("OPENAI_API_KEY")

def generate_fault_scenarios(system_description):
    """Generate fault scenarios based on system description."""
    prompt = f"""
    Based on the following system description, generate 3 realistic fault injection scenarios 
    that would test the system's resilience. For each scenario, include:
    1. The component or service being targeted
    2. The type of fault to inject
    3. The expected impact
    4. How to measure resilience
    5. A Kubernetes YAML for Litmus or Chaos Mesh to implement this test
    
    System description:
    {system_description}
    
    Format your response as JSON.
    """
    
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a DevOps engineer specializing in chaos engineering and fault injection testing."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7
    )
    
    return json.loads(response.choices[0].message['content'])

# Example usage
system = """
Our application is a microservices-based payment processing system deployed on Kubernetes with the following components:
- payment-api (3 replicas): Frontend API that receives payment requests from customers
- transaction-service (2 replicas): Validates and processes transactions
- fraud-detection (2 replicas): Analyzes transactions for fraudulent activity
- payment-db (PostgreSQL): Stores transaction data with a primary and read replica
- redis-cache (3 node cluster): Caches user session and transaction data
- message-queue (Kafka, 3 brokers): Handles asynchronous processing of events
"""

scenarios = generate_fault_scenarios(system)
print(yaml.dump(scenarios))

This would generate targeted fault scenarios that you can implement in your testing pipeline.

Fault injection testing in the release cycle

Much like Synthetic Monitoring Tests, fault injection testing in the release cycle is a part of Shift-Right testing approach, which uses safe methods to perform tests in a production or pre-production environment. Given the nature of distributed, cloud-based applications, it is very difficult to simulate the real behavior of services outside their production environment. Testers are encouraged to run tests where it really matters, on a live system with customer traffic.

Fault injection tests rely on metrics observability and are usually statistical; The following high-level steps provide a sample of practicing fault injection and chaos engineering:

  • Measure and define a steady (healthy) state for the system's interoperability.

  • Create hypotheses based on predicted behavior when a fault is introduced.

  • Introduce real-world fault-events to the system.

  • Measure the state and compare it to the baseline state.

  • Document the process and the observations.

  • Identify and act on the result.

Fault injection testing in kubernetes

With the advancement of kubernetes (k8s) as the infrastructure platform, fault injection testing in kubernetes has become inevitable to ensure that system behaves in a reliable manner in the event of a fault or failure. There could be different type of workloads running within a k8s cluster which are written in different languages. For eg. within a K8s cluster, you can run a micro service, a web app and/or a scheduled job. Hence you need to have mechanism to inject fault into any kind of workloads running within the cluster. In addition, kubernetes clusters are managed differently from traditional infrastructure. The tools used for fault injection testing within kubernetes should have compatibility with k8s infrastructure. These are the main characteristics which are required:

  • Ease of injecting fault into kubernetes pods.

  • Support for faster tool installation within the cluster.

  • Support for YAML based configurations which works well with kubernetes.

  • Ease of customization to add custom resources.

  • Support for workflows to deploy various workloads and faults.

  • Ease of maintainability of the tool

  • Ease of integration with telemetry

Best Practices and Advice

Experimenting in production has the benefit of running tests against a live system with real user traffic, ensuring its health, or building confidence in its ability to handle errors gracefully. However, it has the potential to cause unnecessary customer pain. A test can either succeed or fail. In the event of failure, there will likely be some impact on the production environment. Thinking about the Blast Radius of the effect, should the test fail, is a crucial step to conduct beforehand. The following practices may help minimize such risk:

  • Run tests in a non-production environment first. Understand how the system behaves in a safe environment, using synthetic workload, before introducing potential risk to customer traffic.

  • Use fault injection as gates in various stages through the CD pipeline.

  • Strive to achieve a balance between collecting actual result data while affecting as few production users as possible.

  • Use defensive design principles such as circuit breaking and the bulkhead patterns.

  • Agreed on a budget (in terms of Service Level Objective (SLO)) as an investment in chaos and fault injection.

  • Grow the risk incrementally - Start with hardening the core and expand out in layers. At each point, progress should be locked in with automated regression tests.

Fault Injection Testing Frameworks and Tools

Fuzzing

Chaos

Conclusion

From the principals of chaos: "The harder it is to disrupt the steady-state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large".

Fault-Error-Failure cycle - A key mechanism in dependability: A fault may cause an error. An error may cause further errors within the system boundary; therefore each new error acts as a fault. When error states are observed at the system boundary, they are termed failures. (Modeled by )

Fault injection is an effective way to find security bugs in software, so much so that the requires fuzzing at every untrusted interface of every product and penetration testing which includes introducing faults to the system, to uncover potential vulnerabilities resulting from coding errors, system configuration faults, or other operational deployment weaknesses.

Automated fault injection coverage in a CI pipeline promotes a approach of testing earlier in the lifecycle for potential issues. Examples of performing fault injection during the development lifecycle:

Deploy and test on Blue/Green and Canary deployments. Use methods such as traffic shadowing (a.k.a. ) to get customer traffic to the staging slot.

- is a Microsoft open-source self-hosted fuzzing-as-a-service platform which is easy to integrate into CI pipelines.

and - Popular fuzz tools by Google's project zero team which is used locally to target binaries on Linux or Windows.

- A web-focused fuzzer owned by OWASP which can be found in distributions.

- An in-preview tool for orchestrating controlled fault injection experiments on Azure resources.

- A declarative, modular chaos platform with many extensions, including the .

- An Openshift-specific chaos tool, maintained by Redhat.

- The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB).

- A .NET library for chaos testing and fault injection integrated with the library for resilience engineering.

- A CNCF open source tool for chaos testing and fault injection for kubernetes cluster.

provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing.

Fault injection techniques increase resilience and confidence in the products we ship. They are used across the industry to validate applications and platforms before and while they are delivered to customers. Fault injection is a powerful tool and should be used with caution. Cases such as the , which was caused due to a deployment of code that was meant to be “dark launched”, entail the importance of curtailing the blast radius in the system during experiments.

🧪
stability and reliability
Fuzzing
Laprie/Avizienis
Microsoft Security Development Lifecycle
Shift-Left
Dark Traffic
OneFuzz
AFL
WinAFL
WebScarab
Kali linux
Azure Chaos Studio
Chaos toolkit
Azure actions and probes kit
Kraken
Chaos Monkey
Simmy
Polly
Litmus
This ISE dev blog post
Cloudflare 30 minute global outage