Most failures are predictable.
Most deployment failures are predictable. After analyzing 200+ production deployments across our portfolio, we identified 47 specific factors that separate successful deployments from disasters. This assessment helps you catch issues before they become outages.
of disasters had a known signal in the readiness review that was either missed or dismissed.
median revenue + recovery cost of a 1-hour outage at the mid-market SaaS scale.
fewer 'all-hands' incident calls among teams running this checklist pre-deploy.
Before you push the button.
- Change is behind a feature flag
- Rollback path tested in staging within 24h
- Migration runs forward + backward
- Canary plan documented (size, duration, signals)
- On-call engineer is awake and identified
- Incident comms channel created
- Customer-impact statement drafted
- Observability dashboards open in tabs
What to watch in flight.
| Signal | Abort Threshold | Why |
|---|---|---|
| Error rate (5xx) | +50% vs baseline | Below this is noise; above is real |
| P95 latency hot flow | +30% | Mechanical; users feel the tail |
| Auth failure rate | +10% | Auth issues compound across the funnel |
| Saturation (CPU / DB) | > 75% | Headroom is required for natural variance |
| Customer-reported incidents | ≥ 1 unique | Treat the first report as the canary |
After the green light.
- Held canary for full duration before scale-out
- Dashboards green for ≥ 30 min at 100%
- Customer-comms updated (if applicable)
- Flag retired or marked permanent
- Documentation updated
- Post-deploy short retro committed within 48 h
What separates smooth deploys.
✓DO
- Use the same checklist every deploy, even small ones
- Treat the canary phase as load-bearing, not theater
- Pre-approve abort thresholds as code, not opinion
- Run the post-deploy retro within 48 hours
- Build the checklist into CI so deploys can't bypass it
✗DON'T
- Skip the rollback test 'because it's a small change'
- Hide canary signal behind manual dashboard refreshes
- Allow emergency hotfixes to bypass the checklist
- Defer the retro until 'when we have time'
- Treat the checklist as a one-time exercise