Resilient Systems

Actionable guidance for designing reliable, resilient, and well-operated systems.
Patterns over products. Actions over theory.


Why this exists

For 15+ years I saved hard-won notes to external drives and left them in desk drawers—rarely revisited, and no one else learned from them.
This site is my public notebook so others can use (and improve) the checklists, patterns, and diagrams I wish I’d had.


What you’ll find (topics)

Everything is action-oriented: each section ends with clear Action: lines you can paste into runbooks and design reviews.


Who this is for

Primary

Also useful for

How to use by role

Prereqs


What this is / is not

This is

This is not


Start here


Connect & updates

Action: Open an Issue/PR on GitHub with context and what changed.

Action: If something helped you, share it or open a PR with a better diagram, example, or checklist.


How to use this site

  1. Skim a topic, then copy the Action lines into your runbooks or design reviews.
  2. Adapt the Mermaid diagrams to your context (swap AWS terms for your cloud).
  3. Schedule a quick DR/chaos drill and validate the assumptions.