Resilient Systems
Actionable guidance for designing reliable, resilient, and well-operated systems.
Patterns over products. Actions over theory.
Why this exists
For 15+ years I saved hard-won notes to external drives and left them in desk drawers—rarely revisited, and no one else learned from them.
This site is my public notebook so others can use (and improve) the checklists, patterns, and diagrams I wish I’d had.
What you’ll find (topics)
- Cloud Resilience — fault isolation, static stability, AZ independence, recovery patterns.
- System Design — scalability, consistency & state, backpressure, latency budgets.
- Reliability Engineering (SRE) — SLOs, error budgets, incident response, post-incident learning.
- Architecture Patterns — queues vs pub/sub, idempotency, circuit-breakers, bulkheads, retries/backoff.
- Testing & Chaos — load tests, dependency failure drills, game days, DR runbooks.
Everything is action-oriented: each section ends with clear Action: lines you can paste into runbooks and design reviews.
Who this is for
Primary
- Cloud / Solution Architects
- Site Reliability Engineers (SREs)
- Platform / DevOps Engineers
Also useful for
- Backend & Feature Engineers (owning services in prod)
- Tech Leads & Engineering Managers (reviewing designs/runbooks)
- Security/Compliance & Risk (failure modes, controls)
- Data/ML Platform Engineers (pipelines, state, recovery)
- Incident Commanders & On-call Responders
- QA / Performance Engineers (failure & load test planning)
How to use by role
- Architects: adopt patterns, fault boundaries, and Action checklists in design reviews.
- SRE/Platform: turn Action lines into runbooks, SLOs, chaos & DR tests.
- Developers: map diagrams to your service, swap AWS terms for your cloud, and add to service READMEs.
- Leads/Managers: use sections as acceptance criteria for design docs and launch reviews.
Prereqs
- Working knowledge of cloud primitives (compute, storage, networking) and basic IaC.
- AWS terms are used for examples; map to your cloud using the table above.
What this is / is not
This is
- Cloud-agnostic principles, often illustrated with AWS terms (because they’re widely recognized).
- Diagrams (Mermaid), checklists, and review prompts you can reuse.
This is not
- Vendor docs or copy-paste Terraform/CloudFormation.
- A replacement for product documentation—use these patterns to guide design and operations.
Start here
- Cloud Resilience
Fault isolation, static stability, AZ independence, and recovery actions.
Connect & updates
- GitHub: ahmadalkhaldi2013-star/Resilient-Systems
- LinkedIn: Connect with me
Action: Open an Issue/PR on GitHub with context and what changed.
Action: If something helped you, share it or open a PR with a better diagram, example, or checklist.
How to use this site
- Skim a topic, then copy the Action lines into your runbooks or design reviews.
- Adapt the Mermaid diagrams to your context (swap AWS terms for your cloud).
- Schedule a quick DR/chaos drill and validate the assumptions.