Post-mortem 15 Jan 2026

"It's the cert" — a 4 AM outage in three acts

Composite story from real SRE war rooms (names changed).

Act I. Pager fires: "API latency". Health checks are green. Someone says TLS. The load balancer is still serving yesterday's cert — not expired, just the wrong SAN for a canary hostname. Customers hit a hostname ops forgot was in the rotation.

Act II. Fixing the cert is fast; redeploying it to four nodes and reloading two different daemons is not. Runbooks disagree on reload vs restart. Someone runs nginx -s reload on a box that was running OpenSSL 1.0-era defaults — silent fail until HUP actually applies.

Act III. After the incident, the team ships better expiry alerts (30 / 14 / 7 days), auto-renew to every node, and a single audit log that answers "who deployed what, when". That's the workflow ManageMyCert is built around.

Moral: expiry dates are easy; coordination across servers is what keeps you sleeping through the night.