Fault Isolation – Core Concepts


πŸ”Ή Control Plane vs Data Plane

flowchart TD
  U[User] --> CP[Control Plane<br/>Create/Update/Delete]
  U --> DP[Data Plane<br/>Execute Requests]
  CP -.-> Res[Resources]
  DP --> Res

Action: Architect workloads to rely on the data plane, not continuous control plane availability.


πŸ”Ή Example: EC2

flowchart TD
  U[User] --> CP[EC2 Control Plane<br/>Launch Instance]
  CP --> NewEC2[New EC2 Instance]
  U -.-> DP[EC2 Data Plane<br/>Running Instances]
  DP --> Running[Existing EC2 Instances Running]

Action: Design for EC2 workloads to keep running if the control plane is unavailable.


πŸ”Ή Additional Service Examples

%% two-row layout for better readability
flowchart TB
  %% row containers
  subgraph Row1[ ]
    direction LR
    subgraph RDS["RDS"]
      direction TB
      RDSCP["CP:<br/>CreateDatabaseInstance"]
      RDSDP["DP:<br/>Database Queries"]
    end

    subgraph IAM["IAM"]
      direction TB
      IAMCP["CP:<br/>CreateRole<br/>CreatePolicy"]
      IAMDP["DP:<br/>AuthN/AuthZ"]
    end

    subgraph R53["Route 53"]
      direction TB
      R53CP["CP:<br/>CreateHostedZone<br/>UpdateResourceSet"]
      R53DP["DP:<br/>DNS Resolution<br/>Health Checks"]
    end
  end

  subgraph Row2[ ]
    direction LR
    subgraph ELB["Elastic Load Balancer (ELB)"]
      direction TB
      ELBCP["CP:<br/>CreateLoadBalancer<br/>CreateTargetGroup"]
      ELBDP["DP:<br/>Forward Traffic"]
    end

    subgraph DDB["DynamoDB"]
      direction TB
      DDBCP["CP:<br/>CreateTable<br/>UpdateTable"]
      DDBDP["DP:<br/>GetItem<br/>PutItem<br/>Scan<br/>Query"]
    end

    subgraph S3["S3"]
      direction TB
      S3CP["CP:<br/>CreateBucket<br/>PutBucketPolicy"]
      S3DP["DP:<br/>GetObject<br/>PutObject"]
    end
  end

Action: Map your services into CP vs DP operations. Depend on DP ops for resiliency.


πŸ”Ή Failure Likelihood

Action: Expect control plane disruptions. Build apps so DP continues.


πŸ”Ή Criticality

Action: Protect data plane availability above all else.


πŸ”Ή Unchaining Availability

Action: Decouple workloads from CP during steady-state operations.


πŸ”Ή Data Plane Maintains State

flowchart LR
  CP[Control Plane] -- Push Config --> DP1[Data Plane 1]
  CP -- Push Config --> DP2[Data Plane 2]
  CP -- Push Config --> DP3[Data Plane 3]
  DP1 -.-> Running[Workloads Continue]
  DP2 -.-> Running
  DP3 -.-> Running

Action: Limit real-time reliance on CP. Push config early, then run steady.


πŸ”Ή CP vs DP Takeaways

Action: Always prefer DP dependencies when architecting.


πŸ”Ή Static Stability

flowchart LR
  Dep1[Dependency Fails] -->|Still Runs| Sys[System Up]
  Dep2[Another Dependency Fails] -->|Still Runs| Sys

Action: Design for steady-state survivability.


πŸ”Ή Static Stability Approaches

Action β€” Static Stability Playbook


πŸ”Ή EC2 Dependency β‰  Destiny Example

Action: Assume data plane survival in your recovery design.


πŸ”Ή Lambda Dependency β‰  Destiny

Action: Use warm pools, buffers, or caching layers to decouple from single dependencies. When a request arrives and a warm env is available, Lambda skips cold init and runs the handler immediately.


πŸ”Ή EC2 Anti-Pattern

❌ One Auto Scaling group stretched across multiple AZs β†’ shared failure risk.

flowchart TD
  SG[One Scaling Group] --> AZ1[AZ1]
  SG --> AZ2[AZ2]
  SG --> AZ3[AZ3]
  note[If SG fails β†’ All AZs impacted ❌]

Action: Never stretch a single scaling group across AZs.


πŸ”Ή EC2 Best Practice

βœ… Separate scaling groups per AZ, with distribution across multiple regions.

flowchart TD
  SG1[Scaling Group AZ1] --> AZ1[AZ1]
  SG2[Scaling Group AZ2] --> AZ2[AZ2]
  SG3[Scaling Group AZ3] --> AZ3[AZ3]
  note[Failure in one AZ does not cascade βœ…]

Action: Architect independent scaling capacity per AZ.


πŸ”Ή DNS Serve-Stale (RFC 8767)

flowchart LR
  U[User Query] --> R[Recursive Resolver]
  R -->|Cache Hit| C[Cached Record]
  R -->|Query| A[Authoritative DNS Server]
  A --X--> Fail[Unavailable]
  C --> U

Action: Enable Serve-Stale DNS where supported to maintain service continuity.


πŸ”Ή Availability Zone Independence (AZI)

flowchart TD
  subgraph Region
    subgraph AZ1
      CP1[Control Plane]
      DP1[Data Plane]
    end
    subgraph AZ2
      CP2[Control Plane]
      DP2[Data Plane]
    end
    subgraph AZ3
      CP3[Control Plane]
      DP3[Data Plane]
    end
  end
  User[User Requests] --> DP1 & DP2 & DP3

Action: Deploy across independent AZs to avoid shared-fate failures.


πŸ”Ή Customer AZI Example

flowchart TD
  subgraph Region
    subgraph AZ1
      NLB1[NLB]
      WS1[Web Servers]
      DB1[Aurora Read Replica]
    end
    subgraph AZ2
      NLB2[NLB]
      WS2[Web Servers]
      DB2[Aurora Primary]
    end
    subgraph AZ3
      NLB3[NLB]
      WS3[Web Servers]
      DB3[Aurora Read Replica]
    end
  end
  User[User] --> NLB1 & NLB2 & NLB3

Action: Always spread critical resources across multiple AZs.


πŸ”Ή AZI Takeaways

Action: Build DR and scaling plans that assume per-AZ isolation.


πŸ”Ή Regional Services

Action: Know which services are regional vs. AZ-bound and design accordingly.


πŸ”Ή Fault Isolation Recap

Action: Architect for layered isolation: DP > AZ > Region.