DevOps help for Cloud Platform Engineers
  • Welcome!
  • Quick Start Guide
  • About Me
  • CV
  • 🧠DevOps & SRE Foundations
    • DevOps Overview
      • Engineering Fundamentals
      • Implementing DevOps Strategy
      • DevOps Readiness Assessment
      • Lifecycle Management
      • The 12 Factor App
      • Design for Self Healing
      • Incident Management Best Practices (2025)
    • SRE Fundamentals
      • Toil Reduction
      • System Simplicity
      • Real-world Scenarios
        • AWS VM Log Monitoring API
    • Agile Development
      • Team Agreements
        • Definition of Done
        • Definition of Ready
        • Team Manifesto
        • Working Agreement
    • Industry Scenarios
      • Finance and Banking
      • Public Sector (UK/EU)
      • Energy Sector Edge Computing
  • DevOps Practices
    • Platform Engineering
    • FinOps
    • Observability
      • Modern Practices
  • 🚀Modern DevOps Practices
    • Infrastructure Testing
    • Modern Development
    • Database DevOps
  • 🛠️Infrastructure as Code (IaC)
    • Terraform
      • Getting Started - Installation and initial setup [BEGINNER]
      • Cloud Integrations - Provider-specific implementations
        • Azure Scenarios
        • AWS Scenarios
        • GCP Scenarios
      • Testing and Validation - Ensuring infrastructure quality
        • Unit Testing
        • Integration Testing
        • End-to-End Testing
        • Terratest Guide
      • Best Practices - Production-ready implementation strategies
        • State Management
        • Security
        • Code Organization
        • Performance
      • Tools & Utilities - Enhancing the Terraform workflow
        • Terraform Docs
        • TFLint
        • Checkov
        • Terrascan
      • CI/CD Integration - Automating infrastructure deployment
        • GitHub Actions - GitHub-based automation workflows
        • Azure Pipelines - Azure DevOps integration
        • GitLab CI - GitLab-based deployment pipelines
    • Bicep
      • Getting Started - First steps with Bicep [BEGINNER]
      • Template Specs
      • Best Practices - Guidelines for effective Bicep implementations
      • Modules - Building reusable components [INTERMEDIATE]
      • Examples - Sample implementations for common scenarios
      • Advanced Features
      • CI/CD Integration - Automating Bicep deployments
        • GitHub Actions
        • Azure Pipelines
  • 💰Cost Management & FinOps
    • Cloud Cost Optimization
  • 🐳Containers & Orchestration
    • Containerization Overview
    • Docker
      • Dockerfile Best Practices
      • Docker Compose
    • Kubernetes
      • CLI Tools - Essential command-line utilities
        • Kubectl
        • Kubens
        • Kubectx
      • Core Concepts
      • Components
      • Best Practices
        • Pod Security
        • Security Monitoring
        • Resource Limits
      • Advanced Features - Beyond the basics [ADVANCED]
        • Service Mesh
        • Ingress Controllers
          • NGINX
          • Traefik
          • Kong
          • Gloo Edge
      • Troubleshooting - Diagnosing and resolving common issues
        • Pod Troubleshooting Commands
      • Enterprise Architecture
      • Health Management
      • Security & Compliance
      • Virtual Clusters
    • OpenShift
  • Service Mesh & Networking
    • Service Mesh Implementation
  • Architecture Patterns
    • Data Mesh
    • Multi-Cloud Networking
    • Disaster Recovery
    • Chaos Engineering
  • Edge Computing
    • Implementation Guide
    • Serverless Edge
    • IoT Edge Patterns
    • Real-Time Processing
    • Edge AI/ML
    • Security Hardening
    • Observability Patterns
    • Network Optimization
    • Storage Patterns
  • 🔄CI/CD & GitOps
    • CI/CD Overview
    • Continuous Integration
    • Continuous Delivery
      • Deployment Strategies
      • Secrets Management
      • Blue-Green Deployments
      • Deployment Metrics
      • Progressive Delivery
      • Release Management for DevOps/SRE (2025)
    • CI/CD Platforms - Tool selection and implementation
      • Azure DevOps
        • Pipelines
          • Stages
          • Jobs
          • Steps
          • Templates - Reusable pipeline components
          • Extends
          • Service Connections - External service authentication
          • Best Practices for 2025
          • Agents and Runners
          • Third-Party Integrations
          • Azure DevOps CLI
        • Boards & Work Items
      • GitHub Actions
      • GitLab
        • GitLab Runner
        • Real-life scenarios
        • Installation guides
        • Pros and Cons
        • Comparison with alternatives
    • GitOps
      • Modern GitOps Practices
      • Flux
        • Overview
        • Progressive Delivery
        • Use GitOps with Flux, GitHub and AKS
  • ☁️Cloud Platforms
    • Cloud Strategy
    • Azure
      • Best Practices
      • Landing Zones
      • Services
      • Monitoring
      • Administration Tools - Platform management interfaces
        • Azure PowerShell
        • Azure CLI
      • Tips & Tricks
    • AWS
      • Authentication
      • Best Practices
      • Tips & Tricks
    • Google Cloud
      • Services
    • Private Cloud
  • 🔐Security & Compliance
    • DevSecOps Overview
    • DevSecOps Pipeline Security
    • DevSecOps
      • Real-life Examples
      • Scanning & Protection - Automated security tooling
        • Dependency Scanning
        • Credential Scanning
        • Container Security Scanning
        • Static Code Analysis
          • Best Practices
          • Tool Integration Guide
          • Pipeline Configuration
      • CI/CD Security
      • Secrets Rotation
    • Supply Chain Security
      • SLSA Framework
      • Binary Authorization
      • Artifact Signing
    • Security Best Practices
      • Threat Modeling
      • Kubernetes Security
    • SecOps
    • Zero Trust Model
    • Cloud Compliance
      • ISO/IEC 27001:2022
      • ISO 22301:2019
      • PCI DSS
      • CSA STAR
    • Security Frameworks
    • SIEM and SOAR
  • Security Architecture
    • Zero Trust Implementation
      • Identity Management
      • Network Security
      • Access Control
  • 🔍Observability & Monitoring
    • Observability Fundamentals
    • Logging
    • Metrics
    • Tracing
    • Dashboards
    • SLOs and SLAs
    • Observability as Code
    • Pipeline Observability
  • 🧪Testing Strategies
    • Testing Overview
    • Modern Testing Approaches
    • End-to-End Testing
    • Unit Testing
    • Performance Testing
      • Load Testing
    • Fault Injection Testing
    • Integration Testing
    • Smoke Testing
  • 🤖AI Integration
    • AIops Overview
      • Workflow Automation
      • Predictive Analytics
      • Code Quality
  • 🧠AI & LLM Integration
    • Overview
    • Claude
      • Installation Guide
      • Project Guides
      • MCP Server Setup
      • LLM Comparison
    • Ollama
      • Installation Guide
      • Configuration
      • Models and Fine-tuning
      • DevOps Usage
      • Docker Setup
      • GPU Setup
      • Open WebUI
    • Copilot
      • Installation Guide
      • VS Code Integration
      • CLI Usage
    • Gemini
      • Installation Guides - Platform-specific setup
        • Linux Installation
        • WSL Installation
        • NixOS Installation
      • Gemini 2.5 Features
      • Roles and Agents
      • NotebookML Guide
      • Cloud Infrastructure Deployment
      • Summary
  • 💻Development Environment
    • Tools Overview
    • DevOps Tools
    • Operating Systems - Development platforms
      • NixOS
        • Installation
        • Nix Language Guide
        • DevEnv with Nix
        • Cloud Deployments
      • WSL2
        • Distributions
        • Terminal Setup
    • Editor Environments
    • CLI Tools
      • Azure CLI
      • PowerShell
      • Linux Commands
      • YAML Tools
  • 📚Programming Languages
    • Python
    • Go
    • JavaScript/TypeScript
    • Java
    • Rust
  • 📖Documentation Best Practices
    • Documentation Strategy
    • Project Documentation
    • Release Notes
    • Static Sites
    • Documentation Templates
    • Real-World Examples
  • 📋Reference Materials
    • Glossary
    • Tool Comparison
    • Recommended Reading
    • Troubleshooting Guide
  • Platform Engineering
    • Implementation Guide
  • FinOps
    • Implementation Guide
  • AIOps
    • LLMOps Guide
  • Development Setup
    • Development Setup
Powered by GitBook
On this page
  • When To Use
  • How to Use
  • Best Practices and Advice
  • Fault Injection Testing Frameworks and Tools
  • Conclusion
Edit on GitHub
  1. Testing Strategies

Fault Injection Testing

PreviousLoad TestingNextIntegration Testing

Last updated 1 year ago

Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its . The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time.

When To Use

Problem Addressed

Systems need to be resilient to the conditions that caused inevitable production disruptions. Modern applications are built with an increasing number of dependencies; on infrastructure, platform, network, 3rd party software or APIs, etc. Such systems increase the risk of impact from dependency disruptions. Each dependent component may fail. Furthermore, its interactions with other components may propagate the failure.

Fault injection methods are a way to increase coverage and validate software robustness and error handling, either at build-time or at run-time, with the intention of "embracing failure" as part of the development lifecycle. These methods assist engineering teams in designing and continuously validating for failure, accounting for known and unknown failure conditions, architect for redundancy, employ retry and back-off mechanisms, etc.

Applicable to

  • Software - Error handling code paths, in-process memory management.

    • Example tests: Edge-case unit/integration tests and/or load tests (i.e. stress and soak).

  • Protocol - Vulnerabilities in communication interfaces such as command line parameters or APIs.

    • Example tests: provides invalid, unexpected, or random data as input we can assess the level of protocol stability of a component.

  • Infrastructure - Outages, networking issues, hardware failures.

    • Example tests: Using different methods to cause fault in the underlying infrastructure such as Shut down virtual machine (VM) instances, crash processes, expire certificates, introduce network latency, etc. This level of testing relies on statistical metrics observations over time and measuring the deviations of its observed behavior during fault, or its recovery time.

How to Use

Architecture

Terminology

  • Fault - The adjudged or hypothesized cause of an error.

  • Error - That part of the system state that may cause a subsequent failure.

  • Failure - An event that occurs when the delivered service deviates from correct state.

Fault Injection Testing Basics

Fault injection is an advanced form of testing where the system is subjected to different failure modes, and where the testing engineer may know in advance what is the expected outcome, as in the case of release validation tests, or in an exploration to find potential issues in the product, which should be mitigated.

Fault Injection and Chaos Engineering

Fault injection testing is a specific approach to testing one condition. It introduces a failure into a system to validate its robustness. Chaos engineering, coined by Netflix, is a practice for generating new information. There is an overlap in concerns and often in tooling between the terms, and many times chaos engineering uses fault injection to introduce the required effects to the system.

High-level Step-by-step

Fault injection testing in the development cycle

  • Using fuzzing tools in CI.

  • Execute existing end-to-end scenario tests (such as integration or stress tests), which are augmented with fault injection.

  • Write regression and acceptance tests based on issues that were found and fixed or based on resolved service incidents.

  • Ad-hoc (manual) validations of fault in the dev environment for new features.

Fault injection testing in the release cycle

Much like Synthetic Monitoring Tests, fault injection testing in the release cycle is a part of Shift-Right testing approach, which uses safe methods to perform tests in a production or pre-production environment. Given the nature of distributed, cloud-based applications, it is very difficult to simulate the real behavior of services outside their production environment. Testers are encouraged to run tests where it really matters, on a live system with customer traffic.

Fault injection tests rely on metrics observability and are usually statistical; The following high-level steps provide a sample of practicing fault injection and chaos engineering:

  • Measure and define a steady (healthy) state for the system's interoperability.

  • Create hypotheses based on predicted behavior when a fault is introduced.

  • Introduce real-world fault-events to the system.

  • Measure the state and compare it to the baseline state.

  • Document the process and the observations.

  • Identify and act on the result.

Fault injection testing in kubernetes

With the advancement of kubernetes (k8s) as the infrastructure platform, fault injection testing in kubernetes has become inevitable to ensure that system behaves in a reliable manner in the event of a fault or failure. There could be different type of workloads running within a k8s cluster which are written in different languages. For eg. within a K8s cluster, you can run a micro service, a web app and/or a scheduled job. Hence you need to have mechanism to inject fault into any kind of workloads running within the cluster. In addition, kubernetes clusters are managed differently from traditional infrastructure. The tools used for fault injection testing within kubernetes should have compatibility with k8s infrastructure. These are the main characteristics which are required:

  • Ease of injecting fault into kubernetes pods.

  • Support for faster tool installation within the cluster.

  • Support for YAML based configurations which works well with kubernetes.

  • Ease of customization to add custom resources.

  • Support for workflows to deploy various workloads and faults.

  • Ease of maintainability of the tool

  • Ease of integration with telemetry

Best Practices and Advice

Experimenting in production has the benefit of running tests against a live system with real user traffic, ensuring its health, or building confidence in its ability to handle errors gracefully. However, it has the potential to cause unnecessary customer pain. A test can either succeed or fail. In the event of failure, there will likely be some impact on the production environment. Thinking about the Blast Radius of the effect, should the test fail, is a crucial step to conduct beforehand. The following practices may help minimize such risk:

  • Run tests in a non-production environment first. Understand how the system behaves in a safe environment, using synthetic workload, before introducing potential risk to customer traffic.

  • Use fault injection as gates in various stages through the CD pipeline.

  • Strive to achieve a balance between collecting actual result data while affecting as few production users as possible.

  • Use defensive design principles such as circuit breaking and the bulkhead patterns.

  • Agreed on a budget (in terms of Service Level Objective (SLO)) as an investment in chaos and fault injection.

  • Grow the risk incrementally - Start with hardening the core and expand out in layers. At each point, progress should be locked in with automated regression tests.

Fault Injection Testing Frameworks and Tools

Fuzzing

Chaos

Conclusion

From the principals of chaos: "The harder it is to disrupt the steady-state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large".

Fault-Error-Failure cycle - A key mechanism in dependability: A fault may cause an error. An error may cause further errors within the system boundary; therefore each new error acts as a fault. When error states are observed at the system boundary, they are termed failures. (Modeled by )

Fault injection is an effective way to find security bugs in software, so much so that the requires fuzzing at every untrusted interface of every product and penetration testing which includes introducing faults to the system, to uncover potential vulnerabilities resulting from coding errors, system configuration faults, or other operational deployment weaknesses.

Automated fault injection coverage in a CI pipeline promotes a approach of testing earlier in the lifecycle for potential issues. Examples of performing fault injection during the development lifecycle:

Deploy and test on Blue/Green and Canary deployments. Use methods such as traffic shadowing (a.k.a. ) to get customer traffic to the staging slot.

- is a Microsoft open-source self-hosted fuzzing-as-a-service platform which is easy to integrate into CI pipelines.

and - Popular fuzz tools by Google's project zero team which is used locally to target binaries on Linux or Windows.

- A web-focused fuzzer owned by OWASP which can be found in distributions.

- An in-preview tool for orchestrating controlled fault injection experiments on Azure resources.

- A declarative, modular chaos platform with many extensions, including the .

- An Openshift-specific chaos tool, maintained by Redhat.

- The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB).

- A .NET library for chaos testing and fault injection integrated with the library for resilience engineering.

- A CNCF open source tool for chaos testing and fault injection for kubernetes cluster.

provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing.

Fault injection techniques increase resilience and confidence in the products we ship. They are used across the industry to validate applications and platforms before and while they are delivered to customers. Fault injection is a powerful tool and should be used with caution. Cases such as the , which was caused due to a deployment of code that was meant to be “dark launched”, entail the importance of curtailing the blast radius in the system during experiments.

🧪
stability and reliability
Fuzzing
Laprie/Avizienis
Microsoft Security Development Lifecycle
Shift-Left
Dark Traffic
OneFuzz
AFL
WinAFL
WebScarab
Kali linux
Azure Chaos Studio
Chaos toolkit
Azure actions and probes kit
Kraken
Chaos Monkey
Simmy
Polly
Litmus
This ISE dev blog post
Cloudflare 30 minute global outage