Fault Isolation β Core Concepts
πΉ Control Plane vs Data Plane
- Control Plane:
- Create, update, delete, list resources.
- Complex orchestration, many dependencies.
- Lower volume.
- Data Plane:
- Executes day-to-day business of the resource.
- Simpler, fewer dependencies.
- Higher volume.
flowchart TD
U[User] --> CP[Control Plane<br/>Create/Update/Delete]
U --> DP[Data Plane<br/>Execute Requests]
CP -.-> Res[Resources]
DP --> Res
Action: Architect workloads to rely on the data plane, not continuous control plane availability.
πΉ Example: EC2
- Control Plane: launching new EC2 instances.
- Data Plane: running EC2 instances.
flowchart TD
U[User] --> CP[EC2 Control Plane<br/>Launch Instance]
CP --> NewEC2[New EC2 Instance]
U -.-> DP[EC2 Data Plane<br/>Running Instances]
DP --> Running[Existing EC2 Instances Running]
Action: Design for EC2 workloads to keep running if the control plane is unavailable.
πΉ Additional Service Examples
%% two-row layout for better readability
flowchart TB
%% row containers
subgraph Row1[ ]
direction LR
subgraph RDS["RDS"]
direction TB
RDSCP["CP:<br/>CreateDatabaseInstance"]
RDSDP["DP:<br/>Database Queries"]
end
subgraph IAM["IAM"]
direction TB
IAMCP["CP:<br/>CreateRole<br/>CreatePolicy"]
IAMDP["DP:<br/>AuthN/AuthZ"]
end
subgraph R53["Route 53"]
direction TB
R53CP["CP:<br/>CreateHostedZone<br/>UpdateResourceSet"]
R53DP["DP:<br/>DNS Resolution<br/>Health Checks"]
end
end
subgraph Row2[ ]
direction LR
subgraph ELB["Elastic Load Balancer (ELB)"]
direction TB
ELBCP["CP:<br/>CreateLoadBalancer<br/>CreateTargetGroup"]
ELBDP["DP:<br/>Forward Traffic"]
end
subgraph DDB["DynamoDB"]
direction TB
DDBCP["CP:<br/>CreateTable<br/>UpdateTable"]
DDBDP["DP:<br/>GetItem<br/>PutItem<br/>Scan<br/>Query"]
end
subgraph S3["S3"]
direction TB
S3CP["CP:<br/>CreateBucket<br/>PutBucketPolicy"]
S3DP["DP:<br/>GetObject<br/>PutObject"]
end
end
Action: Map your services into CP vs DP operations. Depend on DP ops for resiliency.
πΉ Failure Likelihood
- CP is more complex β fails more often.
- DP is simpler β less failure-prone.
Action: Expect control plane disruptions. Build apps so DP continues.
πΉ Criticality
- More critical: Data plane.
- Because it runs the day-to-day business logic.
Action: Protect data plane availability above all else.
πΉ Unchaining Availability
- CP outages should not break workloads.
- DP maintains existing state independently.
Action: Decouple workloads from CP during steady-state operations.
πΉ Data Plane Maintains State
- CP pushes config β DP.
- DP continues running even if CP unavailable.
flowchart LR
CP[Control Plane] -- Push Config --> DP1[Data Plane 1]
CP -- Push Config --> DP2[Data Plane 2]
CP -- Push Config --> DP3[Data Plane 3]
DP1 -.-> Running[Workloads Continue]
DP2 -.-> Running
DP3 -.-> Running
Action: Limit real-time reliance on CP. Push config early, then run steady.
πΉ CP vs DP Takeaways
- CP = more dependencies, more likely to fail.
- DP = better dependency for resilience.
- Separation unchains availability domains.
Action: Always prefer DP dependencies when architecting.
πΉ Static Stability
- System can keep operating without changes during dependency outages.
- Dependency β Destiny β youβre not forced to fail just because a dependency did.
flowchart LR
Dep1[Dependency Fails] -->|Still Runs| Sys[System Up]
Dep2[Another Dependency Fails] -->|Still Runs| Sys
Action: Design for steady-state survivability.
πΉ Static Stability Approaches
- Prevent circular dependencies.
- Pre-provision capacity/resources.
- Maintain state locally.
- Eliminate synchronous dependencies (replace with async/indirect sync).
Action β Static Stability Playbook
- Async > Sync: Replace cross-service sync RPCs on the hot path with events/queues. Consumers process later; callers donβt block.
- DAG-only dependencies: Enforce a one-way graph (no AβB). Add a graph linter in CI to fail PRs that introduce cycles.
- Split responsibilities: If B βneedsβ A for an unrelated check, move that check to a third service or a replicated, read-only dataset shared by both.
- Maintain state locally: Keep last-known-good config/keys/catalog in memory or on disk; refresh in the background; serve-stale within a safe TTL if upstreams are down.
- Examples: cache JWKS/OIDC keys; snapshot feature flags/config; materialize a read-only price/catalog view (SQLite/RocksDB/Redis).
- Pre-provision capacity: Set min capacity / warm pools / provisioned concurrency so cold starts or control-plane lag donβt stall traffic.
- Per-AZ locality: Keep caches/queues per AZ; avoid cross-AZ calls on the steady-state read path.
- Graceful degradation: On dependency trouble, read-only mode, queue writes, disable non-critical features automatically.
- Operational checks: Run a dependency graph audit, βkill the upstreamβ drills, cold-start tests, and track cache-hit SLOs.
πΉ EC2 Dependency β Destiny Example
- Even if the EC2 control plane fails, running instances (data plane) keep working.
Action: Assume data plane survival in your recovery design.
πΉ Lambda Dependency β Destiny
- Lambda depends on EC2, but uses warm pools of instances to reduce cold start risk.
Action: Use warm pools, buffers, or caching layers to decouple from single dependencies. When a request arrives and a warm env is available, Lambda skips cold init and runs the handler immediately.
πΉ EC2 Anti-Pattern
β One Auto Scaling group stretched across multiple AZs β shared failure risk.
flowchart TD
SG[One Scaling Group] --> AZ1[AZ1]
SG --> AZ2[AZ2]
SG --> AZ3[AZ3]
note[If SG fails β All AZs impacted β]
Action: Never stretch a single scaling group across AZs.
πΉ EC2 Best Practice
β Separate scaling groups per AZ, with distribution across multiple regions.
flowchart TD
SG1[Scaling Group AZ1] --> AZ1[AZ1]
SG2[Scaling Group AZ2] --> AZ2[AZ2]
SG3[Scaling Group AZ3] --> AZ3[AZ3]
note[Failure in one AZ does not cascade β
]
Action: Architect independent scaling capacity per AZ.
πΉ DNS Serve-Stale (RFC 8767)
- Recursive resolvers return cached results when authoritative DNS is unavailable.
- Adds resilience against outages and DoS attacks.
flowchart LR
U[User Query] --> R[Recursive Resolver]
R -->|Cache Hit| C[Cached Record]
R -->|Query| A[Authoritative DNS Server]
A --X--> Fail[Unavailable]
C --> U
Action: Enable Serve-Stale DNS where supported to maintain service continuity.
πΉ Availability Zone Independence (AZI)
- Each AZ has its own control plane + data plane.
- Regional services rely on multiple AZs, but workloads should run AZ-local when possible.
flowchart TD
subgraph Region
subgraph AZ1
CP1[Control Plane]
DP1[Data Plane]
end
subgraph AZ2
CP2[Control Plane]
DP2[Data Plane]
end
subgraph AZ3
CP3[Control Plane]
DP3[Data Plane]
end
end
User[User Requests] --> DP1 & DP2 & DP3
Action: Deploy across independent AZs to avoid shared-fate failures.
πΉ Customer AZI Example
- Web + DB spread across 3 AZs.
- Aurora primary in one AZ, replicas in others.
- Network Load Balancers distribute traffic.
flowchart TD
subgraph Region
subgraph AZ1
NLB1[NLB]
WS1[Web Servers]
DB1[Aurora Read Replica]
end
subgraph AZ2
NLB2[NLB]
WS2[Web Servers]
DB2[Aurora Primary]
end
subgraph AZ3
NLB3[NLB]
WS3[Web Servers]
DB3[Aurora Read Replica]
end
end
User[User] --> NLB1 & NLB2 & NLB3
Action: Always spread critical resources across multiple AZs.
πΉ AZI Takeaways
- Easier to evacuate workloads from a failing AZ.
- Isolates impact of single-AZ failures.
- Improves performance and is cost-effective.
Action: Build DR and scaling plans that assume per-AZ isolation.
πΉ Regional Services
- Some AWS services are region-scoped (not AZ-scoped).
- Example: S3, DynamoDB β built-in regional fault tolerance.
Action: Know which services are regional vs. AZ-bound and design accordingly.
πΉ Fault Isolation Recap
- Separate control vs data planes.
- Use data plane for resilience.
- Apply static stability: survive without changes.
- Exploit AZI for stronger fault boundaries.
- Combine with regional services for end-to-end resilience.
Action: Architect for layered isolation: DP > AZ > Region.