Most SLOs are aspirations, not budgets.
A budget you can't meet isn't a budget — it's a wish. We tracked SLO survival across 10,000 services for 12 months and found one dominant pattern: when the SLO was set above what the team was already shipping, it was quietly abandoned within a quarter. When it was set slightly tighter than the historical P95, it became a real operational tool.
Targets that hold, by service type.
| Service Class | P95 Target | P99 Target | Recovery Window |
|---|---|---|---|
| Edge / CDN read | 120 ms | 350 ms | 5 min |
| Gateway / Auth | 200 ms | 600 ms | 10 min |
| Read API (cached) | 180 ms | 500 ms | 10 min |
| Write API (idempotent) | 450 ms | 1.2 s | 30 min |
| Reporting / Analytics | 1.5 s | 4 s | 1 hr |
| Background job (P95 of completion) | 30 s | 120 s | 1 hr |
The biggest mistake teams make is setting one SLO across the whole platform. A 200ms P95 makes sense for a gateway and is suicidal for a reporting service. Class your services first, set bands second.
SLO adherence by company stage.
| Segment | EDGE | AUTH | READ | WRITE | ASYNC |
|---|---|---|---|---|---|
| Seed / Series A | ● | ● | ◐ | ◐ | ◐ |
| Series B / Growth | ■ | ● | ● | ◐ | ◐ |
| Late stage | ■ | ■ | ● | ● | ● |
| Public / Scale | ■ | ■ | ■ | ● | ● |
The biggest jump in adherence happens between Series A and Series B — typically when the org adds a dedicated SRE function and starts treating SLOs as a product surface, not a metric.
The numbers behind the budget.
of allowed downtime per month at three 9s. Most teams overrun this within their first incident — not because of failures, but because of planned changes that consumed budget unnoticed.
of services with SLOs set above last-quarter baseline still met them after 90 days. The other 92% silently dropped them.
of services with SLOs set 10–15% tighter than historical P95 maintained adherence after 90 days — and improved measurably the following quarter.
What moves adherence.
| Lever | Median Lift | Implementation Note |
|---|---|---|
| Tighten by 10–15%, not 2× | +44% | Realistic ceilings hold |
| Class services before banding | +31% | Auth ≠ Reporting; band separately |
| Burn-rate alerts (multi-window) | +22% | Catches drift before exhaustion |
| SLO as a release-gate | +18% | Block deploys when budget is < 20% remaining |
| Quarterly SLO review ritual | +12% | Keeps SLOs honest as the system evolves |
Do this. Don't do that.
✓DO
- Set SLOs 10–15% tighter than last quarter's P95
- Class services before assigning bands
- Use multi-window burn-rate alerts (1h + 6h)
- Treat SLO budget exhaustion as a release-blocking event
- Run a quarterly SLO review with the team that owns the service
✗DON'T
- Aim for four-9s on a service that's never broken three
- Apply a single SLO across the whole platform
- Use only threshold alerts on raw metrics
- Treat error budget as 'free downtime'
- Set SLOs from the top down without team buy-in
A six-step SLO audit.
List all customer-facing services
Anything outside this list is internal — different rules apply. Document the list, refresh quarterly.
Class each service
Edge / Auth / Read / Write / Reporting / Async. Use the table above as a starting matrix.
Pull last-quarter P95 / P99
If you don't have 90 days of data, your SLO is a guess. Wait until you do.
Set the band
10–15% tighter than baseline. If the team can't defend the number, lower it until they can.
Wire burn-rate alerts
Two windows minimum: 1h fast burn, 6h slow burn. Page only on fast burn; page on-call for slow.
Schedule the review
Quarterly retrospective: did the SLO hold, why, what changed. Adjust the band, don't drop it.
- Customer-facing service list current
- Each service is classed
- Bands documented in repo, not in a wiki
- P95 + P99 alerts wired with burn-rate windows
- SLO budget feeds release-gate logic
- Quarterly SLO review on the calendar
- On-call playbook references SLO targets
- Customer-facing status page reflects SLO posture