From Incident Counting to SLIs: How DigitalOcean Rethought Availability
Captured source
source ↗From Incident Counting to SLIs: How DigitalOcean Rethought Availability | DigitalOcean
© 2026 DigitalOcean, LLC. Sitemap .
Dark mode is coming soon. Engineering From Incident Counting to SLIs: How DigitalOcean Rethought Availability
By Miguel Carrera
Published: April 23, 2026 11 min read
(14.4 * 0.0005) and 1 - sli:global:control_plane_services:availability:rate5m > (14.4 * 0.0005) ) or
Medium burn: 6x
( 1 - ssli:global:control_plane_services:availability:rate6h > (6 * 0.0005) and 1 - sli:global:control_plane_services:availability:rate30m > (6 * 0.0005) ) )
Error Budgets as Engineering Policy
With SLIs and multi-window alerting in place, we began strictly tracking error budgets. The error budget is the inverse of the SLO: if our target is 99.95% availability, we have a 0.05% budget for failures over the measurement window.
We use a rolling 30-day window rather than a fixed calendar month. Customer pain is cumulative. Their trust in our platform doesn’t reset on the first day of the month, and neither should our budget.
Tracking
Burn rate (which we covered in the previous section): how fast we’re consuming the budget right now. This catches incidents.
Remaining budget: the absolute percentage left. As consumption crosses specific thresholds, the policy response escalates.
Policy
The error budget is not just a reporting metric. It directly influences what teams can ship and how they allocate their time.
We define four zones:
Area Green (0-60%) Yellow (61%-80%) Orange (81-100%) Red - (>100%)
changes Operate normally Caution. Verify no impact on dependencies. Increase Risk. Pause large rollouts. Standard maintenance and fixes only. Critical Risk. Paus rollouts. Low-impat maintenance and fixes only.
Approvals Standard Team Lead or Senior IC review Staff Eng review Principal Eng review
Resourcing Normal sprint allocation Allocate ~50% sprint to reliability Allocate ~80% sprint to reliability 100% allocation to reliability and debt.
This makes the error budget a decision-making tool rather than just a dashboard metric. When a team is in the green, they have room to ship fast and take risks. When they’re in orange, large rollouts stop, and most of the sprint shifts to reliability work. When they hit red, everything stops except stabilization.
Each product line follows these guidelines. When a high-severity incident impacts multiple products and burns through the budget across the board, the policy makes the response automatic rather than a debate about whether to slow down.
From Core to Inference Cloud
Everything described so far was built for core infrastructure products: CPU Droplets, Spaces, Managed Databases. But, the same framework applies directly to newer product lines, including GPU Droplets, our Inference platform, and AI agents.
This was intentional. We built a framework with clear principles (control/data plane split, magnitude weighting, recording rules, multi-window alerting, error budget policy), and applied them to core products first. Once the patterns were proven, extending to new products is a matter of defining the right SLIs, not rebuilding the infrastructure.
GPU Droplets follow the same model as CPU Droplets, wherein the Control Plane SLI tracks API request success for GPU instance lifecycle operations, and the Data Plane SLI tracks GPU instance availability using the same resource-minutes approach. GPU Droplets already have a published SLA .
For the inference platform and AI agents, we’ve started applying the same framework. For example: Serverless Inference availability is request-based at the serving layer: non-5xx responses as a percentage of total requests to the inference endpoint. AI agents follow the same pattern, measuring request success rate for agent-hosted endpoints.
Availability numbers are easy to publish. What’s harder is building a measurement framework that you actually trust, where the numbers reflect what customers truly experience, rather than how you chose to count incidents. That’s what this system gives us, a precise and weighted view of platform health that doesn’t flatter us when things are partially broken and doesn’t punish us for being honest about failures. When we publish SLAs for the Inference Cloud, our internal operational framework will already be in place.
About the author
Miguel Carrera Author
See author profile
See author profile
Share
Engineering
Start building today From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications. Sign up
Related Articles
Engineering The Inference Tax: How Prefix-Aware Routing Eliminates the Hidden Cost of LLMs at Scale
Piyush Srivastava June 1, 2026 13 min read
Read more
Engineering DigitalOcean Serverless Inference: A Deep Dive
smehta June 1, 2026 17 min read
Read more
Engineering How We Built DigitalOcean Inference Router
Adil Hafeez May 20, 2026 14 min read
Read more
Notability
notability 5.0/10Technical blog post on SLIs, not a model release.