Fault Isolation – Core Concepts

🔹 Control Plane vs Data Plane

Control Plane:
- Create, update, delete, list resources.
- Complex orchestration, many dependencies.
- Lower volume.
Data Plane:
- Executes day-to-day business of the resource.
- Simpler, fewer dependencies.
- Higher volume.

flowchart TD
  U[User] --> CP[Control Plane<br/>Create/Update/Delete]
  U --> DP[Data Plane<br/>Execute Requests]
  CP -.-> Res[Resources]
  DP --> Res

Action: Architect workloads to rely on the data plane, not continuous control plane availability.

🔹 Example: EC2

Control Plane: launching new EC2 instances.
Data Plane: running EC2 instances.

flowchart TD
  U[User] --> CP[EC2 Control Plane<br/>Launch Instance]
  CP --> NewEC2[New EC2 Instance]
  U -.-> DP[EC2 Data Plane<br/>Running Instances]
  DP --> Running[Existing EC2 Instances Running]

Action: Design for EC2 workloads to keep running if the control plane is unavailable.

🔹 Additional Service Examples

%% two-row layout for better readability
flowchart TB
  %% row containers
  subgraph Row1[ ]
    direction LR
    subgraph RDS["RDS"]
      direction TB
      RDSCP["CP:<br/>CreateDatabaseInstance"]
      RDSDP["DP:<br/>Database Queries"]
    end

    subgraph IAM["IAM"]
      direction TB
      IAMCP["CP:<br/>CreateRole<br/>CreatePolicy"]
      IAMDP["DP:<br/>AuthN/AuthZ"]
    end

    subgraph R53["Route 53"]
      direction TB
      R53CP["CP:<br/>CreateHostedZone<br/>UpdateResourceSet"]
      R53DP["DP:<br/>DNS Resolution<br/>Health Checks"]
    end
  end

  subgraph Row2[ ]
    direction LR
    subgraph ELB["Elastic Load Balancer (ELB)"]
      direction TB
      ELBCP["CP:<br/>CreateLoadBalancer<br/>CreateTargetGroup"]
      ELBDP["DP:<br/>Forward Traffic"]
    end

    subgraph DDB["DynamoDB"]
      direction TB
      DDBCP["CP:<br/>CreateTable<br/>UpdateTable"]
      DDBDP["DP:<br/>GetItem<br/>PutItem<br/>Scan<br/>Query"]
    end

    subgraph S3["S3"]
      direction TB
      S3CP["CP:<br/>CreateBucket<br/>PutBucketPolicy"]
      S3DP["DP:<br/>GetObject<br/>PutObject"]
    end
  end

Action: Map your services into CP vs DP operations. Depend on DP ops for resiliency.

🔹 Failure Likelihood

CP is more complex → fails more often.
DP is simpler → less failure-prone.

Action: Expect control plane disruptions. Build apps so DP continues.

🔹 Criticality

More critical: Data plane.
Because it runs the day-to-day business logic.

Action: Protect data plane availability above all else.

🔹 Unchaining Availability

CP outages should not break workloads.
DP maintains existing state independently.

Action: Decouple workloads from CP during steady-state operations.

🔹 Data Plane Maintains State

CP pushes config → DP.
DP continues running even if CP unavailable.

flowchart LR
  CP[Control Plane] -- Push Config --> DP1[Data Plane 1]
  CP -- Push Config --> DP2[Data Plane 2]
  CP -- Push Config --> DP3[Data Plane 3]
  DP1 -.-> Running[Workloads Continue]
  DP2 -.-> Running
  DP3 -.-> Running

Action: Limit real-time reliance on CP. Push config early, then run steady.

🔹 CP vs DP Takeaways

CP = more dependencies, more likely to fail.
DP = better dependency for resilience.
Separation unchains availability domains.

Action: Always prefer DP dependencies when architecting.

🔹 Static Stability

System can keep operating without changes during dependency outages.
Dependency ≠ Destiny → you’re not forced to fail just because a dependency did.

flowchart LR
  Dep1[Dependency Fails] -->|Still Runs| Sys[System Up]
  Dep2[Another Dependency Fails] -->|Still Runs| Sys

Action: Design for steady-state survivability.

🔹 Static Stability Approaches

Prevent circular dependencies.
Pre-provision capacity/resources.
Maintain state locally.
Eliminate synchronous dependencies (replace with async/indirect sync).

Action — Static Stability Playbook

Async > Sync: Replace cross-service sync RPCs on the hot path with events/queues. Consumers process later; callers don’t block.
DAG-only dependencies: Enforce a one-way graph (no A⇄B). Add a graph linter in CI to fail PRs that introduce cycles.
Split responsibilities: If B “needs” A for an unrelated check, move that check to a third service or a replicated, read-only dataset shared by both.
Maintain state locally: Keep last-known-good config/keys/catalog in memory or on disk; refresh in the background; serve-stale within a safe TTL if upstreams are down.
- Examples: cache JWKS/OIDC keys; snapshot feature flags/config; materialize a read-only price/catalog view (SQLite/RocksDB/Redis).
Pre-provision capacity: Set min capacity / warm pools / provisioned concurrency so cold starts or control-plane lag don’t stall traffic.
Per-AZ locality: Keep caches/queues per AZ; avoid cross-AZ calls on the steady-state read path.
Graceful degradation: On dependency trouble, read-only mode, queue writes, disable non-critical features automatically.
Operational checks: Run a dependency graph audit, “kill the upstream” drills, cold-start tests, and track cache-hit SLOs.

🔹 EC2 Dependency ≠ Destiny Example

Even if the EC2 control plane fails, running instances (data plane) keep working.

Action: Assume data plane survival in your recovery design.

🔹 Lambda Dependency ≠ Destiny

Lambda depends on EC2, but uses warm pools of instances to reduce cold start risk.

Action: Use warm pools, buffers, or caching layers to decouple from single dependencies. When a request arrives and a warm env is available, Lambda skips cold init and runs the handler immediately.

🔹 EC2 Anti-Pattern

❌ One Auto Scaling group stretched across multiple AZs → shared failure risk.

flowchart TD
  SG[One Scaling Group] --> AZ1[AZ1]
  SG --> AZ2[AZ2]
  SG --> AZ3[AZ3]
  note[If SG fails → All AZs impacted ❌]

Action: Never stretch a single scaling group across AZs.

🔹 EC2 Best Practice

✅ Separate scaling groups per AZ, with distribution across multiple regions.

flowchart TD
  SG1[Scaling Group AZ1] --> AZ1[AZ1]
  SG2[Scaling Group AZ2] --> AZ2[AZ2]
  SG3[Scaling Group AZ3] --> AZ3[AZ3]
  note[Failure in one AZ does not cascade ✅]

Action: Architect independent scaling capacity per AZ.

🔹 DNS Serve-Stale (RFC 8767)

Recursive resolvers return cached results when authoritative DNS is unavailable.
Adds resilience against outages and DoS attacks.

flowchart LR
  U[User Query] --> R[Recursive Resolver]
  R -->|Cache Hit| C[Cached Record]
  R -->|Query| A[Authoritative DNS Server]
  A --X--> Fail[Unavailable]
  C --> U

Action: Enable Serve-Stale DNS where supported to maintain service continuity.

🔹 Availability Zone Independence (AZI)

Each AZ has its own control plane + data plane.
Regional services rely on multiple AZs, but workloads should run AZ-local when possible.

flowchart TD
  subgraph Region
    subgraph AZ1
      CP1[Control Plane]
      DP1[Data Plane]
    end
    subgraph AZ2
      CP2[Control Plane]
      DP2[Data Plane]
    end
    subgraph AZ3
      CP3[Control Plane]
      DP3[Data Plane]
    end
  end
  User[User Requests] --> DP1 & DP2 & DP3

Action: Deploy across independent AZs to avoid shared-fate failures.

🔹 Customer AZI Example

Web + DB spread across 3 AZs.
Aurora primary in one AZ, replicas in others.
Network Load Balancers distribute traffic.

flowchart TD
  subgraph Region
    subgraph AZ1
      NLB1[NLB]
      WS1[Web Servers]
      DB1[Aurora Read Replica]
    end
    subgraph AZ2
      NLB2[NLB]
      WS2[Web Servers]
      DB2[Aurora Primary]
    end
    subgraph AZ3
      NLB3[NLB]
      WS3[Web Servers]
      DB3[Aurora Read Replica]
    end
  end
  User[User] --> NLB1 & NLB2 & NLB3

Action: Always spread critical resources across multiple AZs.

🔹 AZI Takeaways

Easier to evacuate workloads from a failing AZ.
Isolates impact of single-AZ failures.
Improves performance and is cost-effective.

Action: Build DR and scaling plans that assume per-AZ isolation.

🔹 Regional Services

Some AWS services are region-scoped (not AZ-scoped).
Example: S3, DynamoDB → built-in regional fault tolerance.

Action: Know which services are regional vs. AZ-bound and design accordingly.

🔹 Fault Isolation Recap

Separate control vs data planes.
Use data plane for resilience.
Apply static stability: survive without changes.
Exploit AZI for stronger fault boundaries.
Combine with regional services for end-to-end resilience.

Action: Architect for layered isolation: DP > AZ > Region.