What does this writing signal mean?

DigitalOcean (GradientAI) published From Incident Counting to SLIs: How DigitalOcean Rethought Availability. This talking signal gives public context for research themes, product direction, policy, or launch framing. High-signal details: Technical blog post on SLIs, not a model release. · From Incident Counting to SLIs: How DigitalOcean Rethought Availability | DigitalOcean © 2026 DigitalOcean, LLC. Sitemap . Dark mode is coming soon. Engineering From.... onlylabs links this event to 1 captured evidence page and 6 related writing signals.

DigitalOcean (GradientAI) Writing: From Incident Counting to SLIs: How DigitalOcean Rethought Availability

Captured source

source ↗

digitalocean.com/digitalocean.com/blog/sli-based-availability-framework

From Incident Counting to SLIs: How DigitalOcean Rethought Availability

Source ↗

published Apr 23, 2026seen Jun 5captured Jun 7http 200method plain

From Incident Counting to SLIs: How DigitalOcean Rethought Availability | DigitalOcean

Dark mode is coming soon. Engineering From Incident Counting to SLIs: How DigitalOcean Rethought Availability

By Miguel Carrera

Published: April 23, 2026 11 min read

(14.4 * 0.0005) and 1 - sli:global:control_plane_services:availability:rate5m > (14.4 * 0.0005) ) or

Medium burn: 6x

( 1 - ssli:global:control_plane_services:availability:rate6h > (6 * 0.0005) and 1 - sli:global:control_plane_services:availability:rate30m > (6 * 0.0005) ) )

Error Budgets as Engineering Policy

With SLIs and multi-window alerting in place, we began strictly tracking error budgets. The error budget is the inverse of the SLO: if our target is 99.95% availability, we have a 0.05% budget for failures over the measurement window.

We use a rolling 30-day window rather than a fixed calendar month. Customer pain is cumulative. Their trust in our platform doesn’t reset on the first day of the month, and neither should our budget.

Tracking

Burn rate (which we covered in the previous section): how fast we’re consuming the budget right now. This catches incidents.

Remaining budget: the absolute percentage left. As consumption crosses specific thresholds, the policy response escalates.

Policy

The error budget is not just a reporting metric. It directly influences what teams can ship and how they allocate their time.

We define four zones:

Area Green (0-60%) Yellow (61%-80%) Orange (81-100%) Red - (>100%)

changes Operate normally Caution. Verify no impact on dependencies. Increase Risk. Pause large rollouts. Standard maintenance and fixes only. Critical Risk. Paus rollouts. Low-impat maintenance and fixes only.

Approvals Standard Team Lead or Senior IC review Staff Eng review Principal Eng review

Resourcing Normal sprint allocation Allocate ~50% sprint to reliability Allocate ~80% sprint to reliability 100% allocation to reliability and debt.

This makes the error budget a decision-making tool rather than just a dashboard metric. When a team is in the green, they have room to ship fast and take risks. When they’re in orange, large rollouts stop, and most of the sprint shifts to reliability work. When they hit red, everything stops except stabilization.

Each product line follows these guidelines. When a high-severity incident impacts multiple products and burns through the budget across the board, the policy makes the response automatic rather than a debate about whether to slow down.

From Core to Inference Cloud

Everything described so far was built for core infrastructure products: CPU Droplets, Spaces, Managed Databases. But, the same framework applies directly to newer product lines, including GPU Droplets, our Inference platform, and AI agents.

This was intentional. We built a framework with clear principles (control/data plane split, magnitude weighting, recording rules, multi-window alerting, error budget policy), and applied them to core products first. Once the patterns were proven, extending to new products is a matter of defining the right SLIs, not rebuilding the infrastructure.

GPU Droplets follow the same model as CPU Droplets, wherein the Control Plane SLI tracks API request success for GPU instance lifecycle operations, and the Data Plane SLI tracks GPU instance availability using the same resource-minutes approach. GPU Droplets already have a published SLA .

For the inference platform and AI agents, we’ve started applying the same framework. For example: Serverless Inference availability is request-based at the serving layer: non-5xx responses as a percentage of total requests to the inference endpoint. AI agents follow the same pattern, measuring request success rate for agent-hosted endpoints.

Availability numbers are easy to publish. What’s harder is building a measurement framework that you actually trust, where the numbers reflect what customers truly experience, rather than how you chose to count incidents. That’s what this system gives us, a precise and weighted view of platform health that doesn’t flatter us when things are partially broken and doesn’t punish us for being honest about failures. When we publish SLAs for the Inference Cloud, our internal operational framework will already be in place.

About the author

Miguel Carrera Author

See author profile

Engineering

Start building today From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications. Sign up

Engineering The Inference Tax: How Prefix-Aware Routing Eliminates the Hidden Cost of LLMs at Scale

Piyush Srivastava June 1, 2026 13 min read

Engineering DigitalOcean Serverless Inference: A Deep Dive

smehta June 1, 2026 17 min read

Engineering How We Built DigitalOcean Inference Router

Adil Hafeez May 20, 2026 14 min read

Notability

notability 5.0/10

Technical blog post on SLIs, not a model release.