{"schema_version":"onlylabs.public_analysis_evidence.v1","title":"DigitalOcean (GradientAI) analysis evidence pack","description":"Public onlylabs evidence pack for cited agent analysis: captured pages, ranked public signals, and stored web-search provenance used by the background analysis workflow.","url":"https://onlylabs.fyi/labs/digitalocean","json_url":"https://onlylabs.fyi/analysis/digitalocean/evidence.json","generated_at":"2026-06-11T18:06:14.049Z","org":{"slug":"digitalocean","name":"DigitalOcean (GradientAI)","category":"neocloud","category_label":"Neocloud","dossier_url":"https://onlylabs.fyi/labs/digitalocean"},"analysis":null,"workflow":{"version":"onlylabs-deepagents-analysis-v3","provider":null,"model":null,"agent":null,"public_pack_mode":"local-pages-and-events","live_web_fetches":false,"note":"Public evidence exports do not trigger live Exa calls; stored Exa provenance is included when analysis metadata contains it."},"stats":{"pages":28,"events":140,"web":0,"evidence":88,"signal_desks":{"hiring":32,"forks":0,"releases":17,"talking":11,"repos":0},"data_radar_lanes":null,"data_radar_matches":null,"stored_analysis_evidence":null,"stored_analysis_web":null,"stored_analysis_signal_desks":null,"stored_analysis_data_radar_lanes":null,"stored_analysis_data_radar_matches":null},"stored_web_provenance":null,"evidence":[{"ref":"P1","kind":"page","title":"Financial Reporting Manager","date":"2026-06-11T04:12:10.487+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7843889","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nFinancial Reporting Manager | DigitalOcean"},{"ref":"P2","kind":"page","title":"Financial Reporting Manager","date":"2026-06-11T04:12:06.873+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7843891","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nFinancial Reporting Manager | DigitalOcean"},{"ref":"P3","kind":"page","title":"Financial Reporting Manager","date":"2026-06-11T04:12:04.227+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7843887","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nFinancial Reporting Manager | DigitalOcean"},{"ref":"P4","kind":"page","title":"Financial Reporting Manager","date":"2026-06-11T04:12:02.217+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7843888","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nFinancial Reporting Manager | DigitalOcean"},{"ref":"P5","kind":"page","title":"Financial Reporting Manager","date":"2026-06-11T04:12:02.099+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7843890","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nFinancial Reporting Manager | DigitalOcean"},{"ref":"P6","kind":"page","title":"Staff Forward Deployed Engineer","date":"2026-06-11T07:04:30.791+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7904376","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nStaff Forward Deployed Engineer | DigitalOcean"},{"ref":"P7","kind":"page","title":"Data Center Technician","date":"2026-06-11T07:04:29.169+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7743662","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nData Center Technician | DigitalOcean"},{"ref":"P8","kind":"page","title":"Director of Investor Relations","date":"2026-06-11T07:04:28.72+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7712614","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nDirector of Investor Relations | DigitalOcean"},{"ref":"P9","kind":"page","title":"Staff AI Product Manager","date":"2026-06-11T07:04:28.161+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7497252","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nStaff AI Product Manager | DigitalOcean"},{"ref":"P10","kind":"page","title":"Talent Success Business Partner (8 Months Contract)","date":"2026-06-11T07:04:27.586+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7823426","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nTalent Success Business Partner (8 Months Contract) | DigitalOcean"},{"ref":"P11","kind":"page","title":"Senior Technical Account Manager","date":"2026-06-11T07:04:26.901+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7273062","signal_url":null,"signal_json_url":null,"text":"Apply | DigitalOcean\n\n# Loading position...\n\n## Get started for free\n\nSign up and get $200 in credit for your first 60 days with DigitalOcean.*\n\nGet started\n\n*This promotional offer applies to new accounts only.\n\n© 2026 DigitalOcean, LLC. Sitemap."},{"ref":"P12","kind":"page","title":"Senior Engineer II, GPU Kernel and Performance","date":"2026-06-11T07:04:26.758+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7714480","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nSenior Engineer II, GPU Kernel and Performance | DigitalOcean"},{"ref":"P13","kind":"page","title":"Director of Investor Relations","date":"2026-06-11T07:04:26.613+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7690606","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nDirector of Investor Relations | DigitalOcean"},{"ref":"P14","kind":"page","title":"Senior Software Engineer I (Storage)","date":"2026-06-11T07:04:25.22+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7986250","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nSenior Software Engineer I (Storage) | DigitalOcean"},{"ref":"P15","kind":"page","title":"Principal Software Engineer","date":"2026-06-11T07:04:24.039+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7975301","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nPrincipal Software Engineer | DigitalOcean"},{"ref":"P16","kind":"page","title":"Senior Data Center Engineer","date":"2026-06-11T07:04:24.013+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7430422","signal_url":null,"signal_json_url":null,"text":"Apply | DigitalOcean\n\n# Loading position...\n\n## Get started for free\n\nSign up and get $200 in credit for your first 60 days with DigitalOcean.*\n\nGet started\n\n*This promotional offer applies to new accounts only.\n\n© 2026 DigitalOcean, LLC. Sitemap."},{"ref":"P17","kind":"page","title":"Software Engineer II (Storage)","date":"2026-06-11T07:04:23.735+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7986238","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nSoftware Engineer II (Storage) | DigitalOcean"},{"ref":"P18","kind":"page","title":"Principal Engineer (Managed Database Services)","date":"2026-06-11T07:04:23.668+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7165138","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nPrincipal Engineer (Managed Database Services) | DigitalOcean"},{"ref":"P19","kind":"page","title":"Principal Software Engineer (PaaS)","date":"2026-06-11T07:04:22.722+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7307191","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nPrincipal Software Engineer (PaaS) | DigitalOcean"},{"ref":"P20","kind":"page","title":"digitalocean/godo v1.195.0-beta.1","date":"2026-06-11T07:03:57.970378+00:00","date_source":null,"source_url":"https://github.com/digitalocean/godo/releases/tag/v1.195.0-beta.1","signal_url":null,"signal_json_url":null,"text":"# v1.195.0-beta.1\n\nRepository: digitalocean/godo\n\nTag: v1.195.0-beta.1\n\nPublished: 2026-06-11T05:04:44Z\n\nPrerelease: yes\n\nRelease notes:\nThis is a beta pre-release. `go get ...@latest` will **not** install it — it always resolves to the latest stable release.\n\nTo install this version, pin it explicitly:\n\n```\ngo get github.com/digitalocean/godo@v1.195.0-beta.1\n```\n\n## What's new in this beta\n- #1021 - @SSharma-10 - Add Hosted Agents (OHS) endpoints\n- #1029 - @logwolvy - Hosted agents: decode SPI event stream and surface nested API errors"},{"ref":"P21","kind":"page","title":"digitalocean/doctl v1.161.0-beta.1","date":"2026-06-11T07:03:57.752963+00:00","date_source":null,"source_url":"https://github.com/digitalocean/doctl/releases/tag/v1.161.0-beta.1","signal_url":null,"signal_json_url":null,"text":"# v1.161.0-beta.1\n\nRepository: digitalocean/doctl\n\nTag: v1.161.0-beta.1\n\nPublished: 2026-06-11T05:59:32Z\n\nPrerelease: yes\n\nRelease notes:\n## Changelog\n\n* 871860c188c2b09bdca27c10d467ae394b8b826a Open Harness Server subcommands"},{"ref":"P22","kind":"page","title":"Senior FP&A Analyst - Global Lease","date":"2026-06-11T04:13:21.63+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7808392","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nSenior FP&A Analyst - Global Lease | DigitalOcean"},{"ref":"P23","kind":"page","title":"Senior FP&A Analyst - Global Lease","date":"2026-06-11T04:12:34.229+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7808394","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nSenior FP&A Analyst - Global Lease | DigitalOcean"},{"ref":"P24","kind":"page","title":"Senior FP&A Analyst - Global Lease","date":"2026-06-11T04:12:34.192+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7808393","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nSenior FP&A Analyst - Global Lease | DigitalOcean"},{"ref":"P25","kind":"page","title":"Senior Software Engineer, Security Products","date":"2026-06-11T04:12:33.037+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7455110","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nSenior Software Engineer, Security Products | DigitalOcean"},{"ref":"P26","kind":"page","title":"Senior Software Engineer, Security Products","date":"2026-06-11T04:12:32.503+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7612843","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nSenior Software Engineer, Security Products | DigitalOcean"},{"ref":"P27","kind":"page","title":"Senior FP&A Analyst - Global Lease","date":"2026-06-11T04:12:31.395+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7808391","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nSenior FP&A Analyst - Global Lease | DigitalOcean"},{"ref":"P28","kind":"page","title":"Strategic Technical Account Manager, AI/ML","date":"2026-06-11T04:12:30.678+00:00","date_source":null,"source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7176634","signal_url":null,"signal_json_url":null,"text":"## Start building today\n\nFrom GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.\n\n© 2026 DigitalOcean, LLC. Sitemap.\n\nDark mode is coming soon.\n\n# Loading position...\n\nApply | DigitalOcean"},{"ref":"E1","kind":"event","title":"digitalocean/doctl v1.161.0-beta.1","date":"2026-06-11T05:59:32+00:00","date_source":"source","source_url":"https://github.com/digitalocean/doctl/releases/tag/v1.161.0-beta.1","signal_url":"https://onlylabs.fyi/signals/46867883-fdd2-4c30-a51f-cc84b3a7a2f2","signal_json_url":"https://onlylabs.fyi/signals/46867883-fdd2-4c30-a51f-cc84b3a7a2f2/signal.json","text":"release · digitalocean/doctl v1.161.0-beta.1 · signal_desk=releases · occurred_at=2026-06-11T05:59:32+00:00 · url=https://github.com/digitalocean/doctl/releases/tag/v1.161.0-beta.1 · raw={\"repo\":\"digitalocean/doctl\"}"},{"ref":"E2","kind":"event","title":"digitalocean/godo v1.195.0-beta.1","date":"2026-06-11T05:04:44+00:00","date_source":"source","source_url":"https://github.com/digitalocean/godo/releases/tag/v1.195.0-beta.1","signal_url":"https://onlylabs.fyi/signals/6f5d1141-196e-4beb-9f7d-64cba85032f1","signal_json_url":"https://onlylabs.fyi/signals/6f5d1141-196e-4beb-9f7d-64cba85032f1/signal.json","text":"release · digitalocean/godo v1.195.0-beta.1 · signal_desk=releases · occurred_at=2026-06-11T05:04:44+00:00 · url=https://github.com/digitalocean/godo/releases/tag/v1.195.0-beta.1 · raw={\"repo\":\"digitalocean/godo\"}"},{"ref":"E3","kind":"event","title":"Object Storage - Sr. Engineer I","date":"2026-06-11T02:47:07+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7671124","signal_url":"https://onlylabs.fyi/signals/31ed4106-de93-4651-9a69-7ae26d4734b7","signal_json_url":"https://onlylabs.fyi/signals/31ed4106-de93-4651-9a69-7ae26d4734b7/signal.json","text":"job_opened · Object Storage - Sr. Engineer I · signal_desk=hiring · occurred_at=2026-06-11T02:47:07+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7671124 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E4","kind":"event","title":"Principal Software Engineer","date":"2026-06-11T02:42:20+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7975301","signal_url":"https://onlylabs.fyi/signals/b9d87597-1bb6-4fbb-aa30-f4066d0540aa","signal_json_url":"https://onlylabs.fyi/signals/b9d87597-1bb6-4fbb-aa30-f4066d0540aa/signal.json","text":"job_opened · Principal Software Engineer · signal_desk=hiring · occurred_at=2026-06-11T02:42:20+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7975301 · raw={\"location\":\"San Francisco\",\"ats\":\"greenhouse\"}"},{"ref":"E5","kind":"event","title":"Senior Software Engineer, Security Products","date":"2026-06-11T02:32:23+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7455112","signal_url":"https://onlylabs.fyi/signals/5cbd50bc-9ae9-470a-9716-1f23f513122b","signal_json_url":"https://onlylabs.fyi/signals/5cbd50bc-9ae9-470a-9716-1f23f513122b/signal.json","text":"job_opened · Senior Software Engineer, Security Products · signal_desk=hiring · occurred_at=2026-06-11T02:32:23+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7455112 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E6","kind":"event","title":"Senior Software Engineer, Security Products","date":"2026-06-11T02:32:23+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7455110","signal_url":"https://onlylabs.fyi/signals/317c91c5-de2c-49ba-a20e-ac1eed402060","signal_json_url":"https://onlylabs.fyi/signals/317c91c5-de2c-49ba-a20e-ac1eed402060/signal.json","text":"job_opened · Senior Software Engineer, Security Products · signal_desk=hiring · occurred_at=2026-06-11T02:32:23+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7455110 · raw={\"location\":\"Boston\",\"ats\":\"greenhouse\"}"},{"ref":"E7","kind":"event","title":"Principal Software Engineer (PaaS)","date":"2026-06-11T02:32:16+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7307191","signal_url":"https://onlylabs.fyi/signals/70d31bf1-cf7c-4e0e-a693-249d8c485846","signal_json_url":"https://onlylabs.fyi/signals/70d31bf1-cf7c-4e0e-a693-249d8c485846/signal.json","text":"job_opened · Principal Software Engineer (PaaS) · signal_desk=hiring · occurred_at=2026-06-11T02:32:16+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7307191 · raw={\"location\":\"San Francisco\",\"ats\":\"greenhouse\"}"},{"ref":"E8","kind":"event","title":"Principal Software Engineer (PaaS)","date":"2026-06-11T02:32:16+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7307195","signal_url":"https://onlylabs.fyi/signals/a5166db4-77fb-43ef-bed9-1a7b4d8cbf39","signal_json_url":"https://onlylabs.fyi/signals/a5166db4-77fb-43ef-bed9-1a7b4d8cbf39/signal.json","text":"job_opened · Principal Software Engineer (PaaS) · signal_desk=hiring · occurred_at=2026-06-11T02:32:16+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7307195 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E9","kind":"event","title":"Principal Engineer, Managed Agents","date":"2026-06-11T02:32:13+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7184352","signal_url":"https://onlylabs.fyi/signals/8e8807f0-cc93-47ba-a935-830121d176e6","signal_json_url":"https://onlylabs.fyi/signals/8e8807f0-cc93-47ba-a935-830121d176e6/signal.json","text":"job_opened · Principal Engineer, Managed Agents · signal_desk=hiring · occurred_at=2026-06-11T02:32:13+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7184352 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E10","kind":"event","title":"Principal Engineer (Managed Database Services)","date":"2026-06-11T02:32:10+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7165138","signal_url":"https://onlylabs.fyi/signals/979001ad-1ae6-46d7-a31c-325eb7051d88","signal_json_url":"https://onlylabs.fyi/signals/979001ad-1ae6-46d7-a31c-325eb7051d88/signal.json","text":"job_opened · Principal Engineer (Managed Database Services) · signal_desk=hiring · occurred_at=2026-06-11T02:32:10+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7165138 · raw={\"location\":\"San Francisco\",\"ats\":\"greenhouse\"}"},{"ref":"E11","kind":"event","title":"Principal Engineer (Managed Database Services)","date":"2026-06-11T02:32:10+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7165140","signal_url":"https://onlylabs.fyi/signals/0a207112-6c10-4d62-a3a6-339dfc20bcea","signal_json_url":"https://onlylabs.fyi/signals/0a207112-6c10-4d62-a3a6-339dfc20bcea/signal.json","text":"job_opened · Principal Engineer (Managed Database Services) · signal_desk=hiring · occurred_at=2026-06-11T02:32:10+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7165140 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E12","kind":"event","title":"Senior Manager, Security Products","date":"2026-06-10T17:32:35+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7960771","signal_url":"https://onlylabs.fyi/signals/91176371-41df-4b3d-a7db-b680c1716b6d","signal_json_url":"https://onlylabs.fyi/signals/91176371-41df-4b3d-a7db-b680c1716b6d/signal.json","text":"job_opened · Senior Manager, Security Products · signal_desk=hiring · occurred_at=2026-06-10T17:32:35+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7960771 · raw={\"location\":\"Bengaluru\",\"ats\":\"greenhouse\"}"},{"ref":"E13","kind":"event","title":"Engineering Manager, Users & Account Systems","date":"2026-06-10T17:32:21+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7960745","signal_url":"https://onlylabs.fyi/signals/a12cf859-8c4d-40b2-97e8-6e902afbdc99","signal_json_url":"https://onlylabs.fyi/signals/a12cf859-8c4d-40b2-97e8-6e902afbdc99/signal.json","text":"job_opened · Engineering Manager, Users & Account Systems · signal_desk=hiring · occurred_at=2026-06-10T17:32:21+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7960745 · raw={\"location\":\"Bengaluru\",\"ats\":\"greenhouse\"}"},{"ref":"E14","kind":"event","title":"Software Network Engineering Manager","date":"2026-06-10T17:17:11+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7913965","signal_url":"https://onlylabs.fyi/signals/2643d91b-2fab-4e6b-8a7c-c39465e697a5","signal_json_url":"https://onlylabs.fyi/signals/2643d91b-2fab-4e6b-8a7c-c39465e697a5/signal.json","text":"job_opened · Software Network Engineering Manager · signal_desk=hiring · occurred_at=2026-06-10T17:17:11+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7913965 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E15","kind":"event","title":"The Inference Alpha: Maximizing Frontier Models on AMD","date":"2026-06-10T14:27:49.137+00:00","date_source":"rss.item_date","source_url":"https://www.digitalocean.com/blog/maximize-frontier-models","signal_url":"https://onlylabs.fyi/signals/25b8e4e3-b310-4018-a498-42e0c4f8993a","signal_json_url":"https://onlylabs.fyi/signals/25b8e4e3-b310-4018-a498-42e0c4f8993a/signal.json","text":"post_published · The Inference Alpha: Maximizing Frontier Models on AMD · signal_desk=talking · occurred_at=2026-06-10T14:27:49.137+00:00 · url=https://www.digitalocean.com/blog/maximize-frontier-models · raw={\"excerpt\":\"At DigitalOcean, we’re committed to providing high-performance infrastructure for the next generation of AI, which is why we’ve been focused on hosting frontier Large Language Models (LLMs) on frontier GPUs—including AMD GPUs.\\nWe see inference performance as an intricate systems-level challenge. For frontier open-weight models, achieving peak output speed is not just about the raw hardware. It also depends on a complex interaction between model architecture, runtime execution, memory systems, scheduling, and decoding strategy.\\nWe believe there’s a significant “performance alpha” found in specialized inference engineering. Optimizing for both speed and cost-efficiency requires a much deeper approach than standard configuration sweeps. By taking a custom approach to the software stack, we can demonstrate that achieving performance parity with more expensive hardware is entirely possible.\\nWhile the current software ecosystem often presents non-obvious hurdles, deep engineering allows us to deliver stronger inference economics on high-performance AMD infrastructure relative to conventional flagship deployments.\\nThe Proof is in the Throughput\\nTo ground our “Performance Alpha” theory in reality, DO worked with Wafer to achieve high performance on specific frontier models on AMD GPUs through various optimizations. By utilizing Wafer’s Agent to identify inefficiencies and apply appropriate fixes, we were able to move beyond marginal gains toward order-of-magnitude improvements that change how these models are used in production.*\\nKimi 2.5 (High-Speed Single Stream)\\n\\nOn a standard 10k input / 1.5k output workload, a stock configuration on 8x MI350X/MI355x hardware delivered a baseline of 22.5 tok/s. Through deep kernel optimization and a customized inference framework, we increased this to 255.2 tok/s - representing an 11.33x speedup with zero trade-offs in accuracy.\\nDeepSeek V3.2 (Full-Stack Scaling)\\n\\nWhile stock frameworks achieved 38.5 tok/s for single-request output speed, our optimized stack pushed this to 200.8 tok/s. More importantly, at a concurrency of 64, we saw a 7.32x improvement in per-request output speed and boosted aggregate throughput from 548 tok/s to 2,165 tok/s.\\nGLM-5 (Flagship Efficiency)\\n\\nGLM-5 is a massive 774B parameter flagship model. By optimizing the deployment topology and specializing the decode path, we enabled a single 8-GPU MI350X node to serve this model with a mean throughput of 151.1 tok/s and an inter-token latency (ITL) of just 17.8 ms.\\nThe Economic Thesis\\nBeyond the technical achievement, these results represent a fundamental shift in the economics of frontier inference. Our work demonstrates that fully optimized AMD infrastructure can achieve elite performance levels while remaining more cost-effective than traditional flagship hardware deployments.\\nThe takeaway is straightforward: inference performance is increasingly a systems problem. Delivering both high performance and sustainable economics requires a deep, custom approach to the software stack that maximizes every cycle of the underlying silicon.\\n*****Performance results are based on internal testing using the configurations described. Results in customer environments may vary depending on hardware, specific workload characteristics, implementation, and utilization.\\nWhy “Out-of-the-Box” Software Leaves Performance on the Table\\nIn our research, we define “stock” frameworks as unmodified, out-of-the-box versions of inference engines or standard kernel libraries. While these tools are the fastest way to get a model running, they carry several architectural taxes that can hinder frontier model performance.\\nThe Generality Trade-off\\n\\nStock kernels are generically written to support many model shapes, which often leaves them unoptimized for the specific dimensions of frontier architectures.\\nPrefill Bias\\n\\nStandard kernels are sized for large prefill batches and carry large-scale tile dimensions, register pressure and launch machinery calibrated for that regime. At single-stream decode, this is a fundamental mismatch—workload is memory-bandwidth-bound, yet the kernel continues to schedule compute resources at prefill scale, leaving the bulk of matrix cores idle.\\nThe “Launch Tax”\\n\\nStock setups often dispatch operations like all-reduce, residual add, and RMSNorm as three separate kernels. This leads to microsecond-level administrative overhead for each call and forces unnecessary data “round-trips” to High Bandwidth Memory (HBM).\\nRigid Software Constraints\\n\\nStandard libraries frequently contain hard-coded assertions—such as requiring head counts to be multiples of 16 - that can cause immediate incompatibilities with the unique configurations of new frontier models.\\nProblem Primer: Defining the Levers of High-Speed Inference\\nTo understand how these gains were achieved, we must understand the systems-level concepts that govern frontier model performance. Inference engineering at its core is about mastering the interactions between hardware execution, memory hierarchy, the software dispatch layer, and knowing precisely where each lever sits.\\nMXFP4 (Microscaling Formats)\\nMXFP4 is an open-standard 4-bit floating point format jointly developed by AMD, NVIDIA, Microsoft, and others under the OCP Microscaling specification[1]. Unlike per-tensor or per-channel quantization schemes, MXFP4 operates at the block level: a group of 16 or 32 values shares a single 8-bit scaling exponent, giving an effective storage cost of approximately 4.25 bits per weight rather than a clean 4.0.\\nThis shared-exponent design is the key insight - it preserves the dynamic range needed for numerically sensitive operations like expert routing and attention projection, while still achieving the memory footprint of a 4-bit format.\\nCompression\\n\\nBF16 costs 16 bits per weight. MXFP4 at 4.25 bits is a ~3.8× reduction. For a model with hundreds of billions of parameters distributed across routed expert FFN layers, this is the difference between requiring multi-node serving and fitting comfortably on a single 8×GPU node. For GLM-5 at 774B parameters, the majority of parameters reside in expert weights that are sparsely activated - MXFP4 compresses precisely the weights that are memory-resident but rarely hot, making this an exceptionally well-targeted optimization.\\nKV Cache Headroom\\n\\nThe compression benefit is not purely about model weights. By reducing the static weight footprint, the GPU’s HBM budget is freed for a larger KV cache allocation. This directly improves throughput on long-context requests, where KV cache eviction is otherwise the binding constraint.\\nMLA (Multi-Head Latent Attention)\\nStandard Multi-Head Attention (MHA) caches full K and V tensors for every layer, every head, and every token in the sequence. At long context lengths with large batch sizes, this KV cache becomes the dominant consumer of HBM - often exceeding the model weights themselves. MLA, introduced in DeepSeek-V2 [2] and carried forward through DeepSeek-V3, Kimi K2.5, and others, addresses this by changing what is stored.\\nLow-Rank Compression\\n\\nRather than caching the full K and V matrices, MLA projects the attention input down to a low-rank latent vector c_KV of dimension d_c << d_model. At inference time, the full K and V heads are reconstructed from this latent vector on-the-fly via learned up-projection matrices. The KV cache now stores only c_KV per token which is a reduction of roughly 5 to 13x depending on the model’s head configuration at the cost of additional GEMM operations during decode.\\nDecoupled RoPE\\n\\nMLA pairs the compressed KV cache with a decoupled rotary positional embedding scheme. Standard RoPE is applied to a separate d_r-dimensional key component, which is cached alongside c_KV. This avoids the numerical issue of applying position-dependent transformations to vectors that will later be linearly projected, preserving attention correctness.\\nThe Kernel Challenge\\n\\nThe reconstruction path introduces a non-trivial compute pattern. During decode, each step must: (1) load c_KV from cache, (2) apply the up-projection GEMM to materialize K and V, (3) run attention across all cached latent vectors. This cannot be naively fused into a standard FlashAttention kernel. Efficient MLA execution requires a fused kernel that absorbs the up-projection into the attention computation itself — merging what would otherwise be a separate GEMM + attention dispatch into a single pass. Without this fusion, the projection GEMMs are too small to be compute-efficient at batch-1, and latency suffers significantly[3].\\nMoE (Mixture of Experts)\\nA standard dense transformer activates every parameter for every token. An MoE layer replaces the FFN block with a collection of parallel expert FFNs and a learned router[4]. For each token, the router computes a softmax over all experts and selects the top-K highest-scoring ones. Only those K experts execute; the rest remain dormant[5]. The result is a model with a very large parameter count but a modest activated parameter count per token—GLM-5 has 774B total parameters but activates roughly 50–60B per token[6].\\nRouting Mechanics\\n\\nIn GLM-5’s 256-expert configuration with top-K routing, each token activates a fixed number of experts (typically K=8 or K=16 depending on the layer). The router output is a sparse selection vector; the token embedding is then dispatched to the selected expert FFNs, processed, and the outputs are weighted and summed. At scale across a batch, this produces a highly irregular all-to-all communication pattern — tokens from the same batch are routed to different experts, potentially on different GPUs in a tensor-parallel deployment.\\nExpert Parallelism\\n\\nIn multi-GPU serving, expert FFN weights are sharded across devices. Each GPU hosts a subset of experts. The routing step triggers an all-to-all dispatch where tokens are physically moved to the GPU that holds their assigned expert, processed, and returned. This all-to-all is the dominant latency bottleneck in MoE serving and does not diminish with tensor parallelism — it is irreducible given the routing structure.\\nThe Empty Work Problem\\n\\nStock MoE kernels are written for batched operation. They sort and permute tokens into expert-ordered buffers before dispatching to expert FFNs, then reverse-permute the outputs. At batch-1 single-stream decode, most experts receive zero tokens. The kernel still executes the sort, allocates the permutation buffers, and iterates over the expert list — pure overhead on empty bins. Well-engineered MoE kernels for decode add an occupancy check to skip empty experts entirely, converting O(num_experts) overhead to O(K) where K is the active expert count.\\nKernel Fusion\\nEvery GPU kernel launch carries a fixed administrative cost. The driver must validate arguments, schedule the kernel onto a Streaming Multiprocessor and synchronize before the next kernel can begin. On ROCm/HIP, this overhead is approximately 2 to 5µs per launch and for a transformer layer executing at batch-1, the total compute time for a small operation like RMSNorm may itself be only 5 to 10µs. This means launch overhead is the same order of magnitude as the useful work.\\nThe Round-Trip Problem:\\n\\nBeyond launch overhead, unfused kernels force intermediate results to be written to HBM and read back.\\nAfter an all-reduce across TP ranks, the sequence is-\\n\\nReduced result is written to global memory\\nResidual add reads it back\\nRMSNorm result is written to global memory again\\nThe next operation reads it back again\\nEach round-trip traverses HBM at 6 TB/s — but even at that bandwidth, repeated materialization of intermediate tensors adds latency proportional to tensor size × number of round-trips\\nFor a hidden dimension of 7168 (DeepSeek-V3/Kimi K2.5) at BF16 across a batch, this is measurable at decode latency timescales.\\nFused Kernels in Practice\\n\\nA fused all_reduce + residual_add + rms_norm kernel keeps the reduced tensor in registers or L1/L2 cache throughout[7]. The all-reduce result never touches HBM — it flows directly into the add and normalization logic within the same warp execution. AMD’s AITER library provides fused_add_rms_norm as a primitive that targets exactly this pattern; additional fusion opportunities include fused QKV projection + rotary embedding and fused gating + activation in MoE FFN layers.\\nCUDA Graph Capture\\n\\nAn orthogonal but complementary approach is CUDA/HIP graph capture, which records the full sequence of kernel launches for a decode step and replays them as a single driver submission, eliminating per-kernel launch overhead entirely. This is particularly effective at batch-1 where the graph structure is static and the individual kernels are short.\\nSpeculative Decoding\\nAutoregressive decode is inherently serial: each token depends on all previous tokens, and the full model must execute once per generated token. The wall-clock cost per token is dominated by the memory bandwidth required to load all activated weights at batch-1, this is a pure bandwidth-bound problem regardless of how fast the matrix cores are[8]. Speculative decoding breaks this serialization by parallelizing verification[9].\\nThe Drafter-Verifier Contract\\n\\nA small, fast draft model (or a lightweight MTP head attached to the main model) proposes a sequence of K candidate tokens in K fast forward passes. The large target model then processes all K candidates simultaneously in a single forward pass, verifying each against its own distribution using a rejection sampling criterion. If the drafter’s token matches the target’s distribution, it is accepted at zero additional cost. If it diverges, the sequence is truncated at the first mismatch and the target’s correction is used.\\nMTP (Multi-Token Prediction) Heads\\n\\nRather than a separate draft model, MTP attaches parallel prediction heads directly to the main model’s hidden states. Each head predicts the token N steps ahead. This eliminates the KV cache mismatch and memory overhead of maintaining a separate draft model and keeps the drafter tightly coupled to the target’s representations, improving acceptance rates[10].\\nAcceptance Rate Sensitivity\\n\\nThe 1.6x throughput figure is highly conditional. It assumes an acceptance rate above roughly 80% on typical user prompts. For code completion or highly predictable continuations, acceptance rates can reach 90%+ and the speedup exceeds 2x. For open-ended generation with high entropy, acceptance rates drop to 50–60% and the overhead of running the drafter erodes the gain. In production, speculative decoding should be gated on request type or dynamically disabled when measured acceptance rate falls below a threshold.\\nTensor Parallelism (TP)\\nTensor Parallelism distributes the weight matrices of each layer across multiple GPUs. For a linear projection Y = XW, the weight matrix W is column-split across N GPUs; each GPU computes a partial result, and an all-reduce synchronizes the outputs before the next layer[11]. This allows a model that would not fit on a single GPU to be served across a node, and reduces the per-GPU memory requirement proportionally.\\nThe All-Reduce Cost\\n\\nEvery layer boundary requires an all-reduce across all TP ranks. On an 8-GPU node over NVLink or AMD Infinity Fabric, a single all-reduce for a hidden dimension of 7168 at BF16 costs approximately 5 to 15µs depending on utilization and collective implementation. A 60-layer model executing at TP=8 therefore incurs 60 all-reduces per decode step — on the order of 300 to 900µs of pure synchronization overhead per token, independent of compute. This is the irreducible “TP tax.”\\nThe TP=4 × 2 Replica Insight\\n\\nThe all-reduce cost scales with the number of participating ranks. At TP=4, each all-reduce involves 4 GPUs instead of 8 - roughly halving synchronization latency. If the model fits within the memory budget of 4 GPUs (feasible for 70 to 130B parameter models with aggressive quantization), running two independent TP=4 replicas on a single 8-GPU node doubles request throughput while maintaining lower per-token latency than a single TP=8 instance. This is not a universally applicable strategy - for 400B+ models that require TP=8 to fit, there is no choice - but it is a significant architectural decision for mid-size frontier models[12].\\nTP vs. Pipeline Parallelism\\n\\nFor very large models that exceed the memory capacity of a single node, pipeline parallelism (PP) partitions layers across nodes. PP introduces bubble overhead from the pipeline fill/drain cycle but avoids the all-reduce cost of TP. In single-node inference, TP is almost always preferred; PP becomes relevant only when the model cannot fit on one node even with quantization.\\nReclaiming the Hardware Potential\\nThe performance gaps we’ve identified - “Launch Tax”, prefill bias and rigid software constraints are not limitations of the underlying silicon. Rather, they are symptoms of a software ecosystem that has prioritized generality over peak efficiency.\\nBy identifying these bottlenecks and mastering the “levers” of the modern inference stack from MXFP4 quantization to custom kernel fusion - our team has shown that it is possible to achieve significant performance gains on high-performance AMD infrastructure. These optimizations don’t just result in faster tokens, they rewrite the economic reality of hosting frontier models at scale.\\nThe Roadmap Ahead\\nThis is only the beginning of our deep dive into inference engineering. In the coming weeks, we will release three technical “surgeries,” each focusing on a different frontier model and the specific optimizations used to unlock its potential:\\nPart 2: The Kimi 2.5 Deep-Dive\\n\\nHow we achieved an 11x speedup by bypassing kernel incompatibilities and engineering custom MLA and MoE kernels.\\nPart 3: Scaling DeepSeek V3.2\\n\\nA look at full-stack serving optimizations, FP8 KV cache support, and high-concurrency throughput gains.\\nPart 4: Optimizing the 774B GLM-5\\n\\nA breakdown of TP=4 deployment topologies, specialized batched GEMV kernels and fine-tuning speculative decoding for maximum efficiency.\\nStay tuned as we move from the high-level anatomy of these bottlenecks to the low-level code that helps solve them. Keep in mind that results will depend on your specific configuration, hardware, and usage patterns.\"}"},{"ref":"E16","kind":"event","title":"Senior Software Engineer I (Storage)","date":"2026-06-10T09:39:51+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7986250","signal_url":"https://onlylabs.fyi/signals/291474c7-b606-49dd-85d8-8ff6379d635b","signal_json_url":"https://onlylabs.fyi/signals/291474c7-b606-49dd-85d8-8ff6379d635b/signal.json","text":"job_opened · Senior Software Engineer I (Storage) · signal_desk=hiring · occurred_at=2026-06-10T09:39:51+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7986250 · raw={\"location\":\"Bengaluru\",\"ats\":\"greenhouse\"}"},{"ref":"E17","kind":"event","title":"Software Engineer II (Storage)","date":"2026-06-10T09:38:44+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7986238","signal_url":"https://onlylabs.fyi/signals/4d282e5f-0b75-4231-a865-cd010d5e2cf2","signal_json_url":"https://onlylabs.fyi/signals/4d282e5f-0b75-4231-a865-cd010d5e2cf2/signal.json","text":"job_opened · Software Engineer II (Storage) · signal_desk=hiring · occurred_at=2026-06-10T09:38:44+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7986238 · raw={\"location\":\"Bengaluru\",\"ats\":\"greenhouse\"}"},{"ref":"E18","kind":"event","title":"What We Learned Hiring 33 Engineers in Two Weeks","date":"2026-06-09T22:58:20.214+00:00","date_source":"rss.item_date","source_url":"https://www.digitalocean.com/blog/ai-native-engineering-interview","signal_url":"https://onlylabs.fyi/signals/e65c0e02-7f63-4b27-a436-22182756b105","signal_json_url":"https://onlylabs.fyi/signals/e65c0e02-7f63-4b27-a436-22182756b105/signal.json","text":"post_published · What We Learned Hiring 33 Engineers in Two Weeks · signal_desk=talking · occurred_at=2026-06-09T22:58:20.214+00:00 · url=https://www.digitalocean.com/blog/ai-native-engineering-interview · raw={\"excerpt\":\"Earlier this year, we needed to hire a cohort of engineers in Seattle, fast. We had a product launching at our marquee conference, Deploy, a hard deadline, and a clear picture of what the work would actually require. What we didn’t want was an interview process designed for a world that no longer exists.\\nSo we rebuilt it from scratch and opened a brand-new office in Bellevue for everyone we hired. Here’s what we did, why we did it, and what we heard from the engineers who went through it.\\nThe problem with the standard loop\\nThe traditional engineering interview loop (recruiter screen, hiring manager screen, technical phone screen, take-home, onsite) was designed for a different era of software development. It tests for pattern recognition and syntax recall. It stages information rather than creating genuine signal. And it takes weeks.\\nMore importantly, it doesn’t reflect how engineers actually work today. Production environments are collaborative. Most engineers entering the field right now have been working with AI tools since they were in school, not reluctantly adopting them, but building with them naturally. A hand-implemented sorting algorithm on a whiteboard tells you almost nothing about how someone thinks through a real system.\\nWe wanted an interview that did.\\nWhat we changed\\nWe made the work the interview.\\nThe centerpiece of our on-site was a three-hour build session. Candidates chose from a short list of assigned prompts and were asked to design, build, and deploy a working prototype on DigitalOcean by the end of the session. They could use whatever AI tools they wanted: Claude Code, Codex, ours, whatever they were fastest with.\\nThree hours is the minimum window in which you can actually watch someone make real engineering decisions: what to scaffold versus what to write by hand, how they prompt, what they verify versus what they trust, how they handle the moment when the AI confidently produces something that doesn’t work. That moment always comes.\\nWe shifted from hiring niche specialists to sourcing strong, well-rounded software development engineers. We weren’t evaluating whether candidates could produce code. We assumed that. We were evaluating whether they had the judgment to productionize an idea.\\nShivani Shirolkar, one of the engineers we hired for our model catalog team, had never been in an AI-assisted interview. She prepared by using Cursor to work through sample problems herself. “I ended up creating a spec for myself that I could feed into Cursor during the interview,” she said. “For the first three hours, it was just heads-down building. They gave us the space to fully focus, which made it so much more engaging than a typical technical interview.”\\nWe followed the build with a real conversation.\\nAfter the coding session, candidates walked interviewers through what they’d built: design choices, trade-offs, and what they’d do differently with more time. Then the conversation went further, into hypotheticals about scaling, business constraints, what it would look like if traffic spiked, and there were windows of downtime to work with.\\nEric Daniel, now on our serverless inference team, found that the discussion after the build was where he could really show his thinking. “I was able to explain the architectural decisions I’d made and why they’d make future improvements easier to build on,” he said. “Then the conversation shifted into territory I’d never covered in an interview before: product trade-offs, business constraints, how the system would behave under real-world pressure. That’s the kind of problem-solving I actually do every day.”\\nThat’s exactly the kind of thinking we were looking for.\\nWe cut the steps that weren’t adding signals.\\nFor candidates with directly relevant backgrounds, we removed the recruiter and hiring manager screens from the loop. For candidates with strong but adjacent experience, the technical assessment was enough to qualify them for the on-site without a separate call.\\nEvery step we removed was either duplicating something we already knew or generating information we weren’t going to use. Redundant steps don’t just slow things down; they signal to candidates that the process wasn’t designed thoughtfully.\\nWe made decisions the same day.\\nAt the end of each interview day, the full panel (recruiters, hiring managers, technical program managers, and engineering leadership) debriefed all of the candidates together. Because we had aligned on decision criteria and pre-approved compensation bands ahead of time, we could move the moment we found the right fit.\\nOur fastest offers went out the next morning. Both Eric and Shivani had offers within 24 hours of their interviews, and both started within two weeks. Eric was shipping PRs on his second day. Shivani had a database-related PR in review before her first week was out.\\nWho we hired, and why that was a deliberate choice\\nMost of the engineers in this cohort are early in their careers. That was intentional.\\nThere’s been a quiet assumption in parts of the industry that early-career hiring is less strategic in an AI era, that AI tools reduce the need for foundational engineering roles. We see it the other way around. Engineers entering the field today don’t think of AI as a tool they’ve had to adopt. It’s simply how they build. That fluency isn’t something you retrofit into someone; it’s something you hire for directly.\\nWe also know that senior engineering judgment—the ability to own a system, lead a review, make the call on architecture—is something that develops in seat, with the right environment and the right challenges. What we can’t manufacture is the disposition to build with modern tools naturally. That’s what we were looking for, and that’s what we found.\\n\\nWhere they work\\nThe timing of the blitz also gave us a reason to do something we’d been working toward for a while. The Greater Seattle area is already home to more than 90 DigitalOcean employees, including our new CPTO. With a cohort of new engineers joining at once, we opened a new office at 10700 Northup Way in Bellevue to give the team a home base.\\n\\nThe space is designed for exactly the kind of work this cohort was hired to do: co-working, team sprints, and the in-person collaboration that accelerates ramp-up. For a group that hit the ground running, shipping in their first days together, having a dedicated place to build together contributed to their success.\\nWhat the cohort has shipped\\nThe timing of this hire wasn’t incidental. Both Shivani, Eric, and Andre Hernandez joined in mid-March, with Deploy on April 28. There was no gentle ramp; all 3 and the rest of the team in their cohort jumped right in. Vinay Kumar, Chief Product and Technology Officer, set expectations that this next phase for DigitalOcean means we needed to operate differently—make different choices, work in different ways. And we did.\\n“In a matter of weeks, we came up with an execution plan, built and shipped as many services as we possibly could as a small team, and delivered a product vision our customers could start testing immediately. The feedback from our customers, developers, and investors was incredible. That’s how powerful what we’re building really is.”\\nShivani has been predominantly on Model Catalog, working on the SI and DI model launches that were part of Deploy. “From day one, I was building,” she said. “It wasn’t about owning a particular stack; it was about jumping into whatever was the highest priority and making an impact. That kind of dexterity was actually really energizing.” Since Deploy, she’s carried that same mindset into more focused work on custom metrics for model evaluation.\\nAndre shipped the Spaces MCP, Block Volumes MCP, and NFS MCP servers, in a short amount of time, which was a great way for him to hit the ground running. “Working on MCP servers really connected the dots for me early on. It forced me to think deeply about how DigitalOcean’s customers interact with Block Volumes in new, agentic ways. I was learning our internal storage operations while simultaneously understanding our shift to an AI-Native cloud. Moving from task breakdown to production deployment so quickly has made all the difference to my onboarding.”\\nEric joined a two-person team on serverless inference, helping migrate models from internal hosting to the dedicated inference platform. The work involved automating configuration steps across roughly 30 models while ensuring zero customer downtime on existing hosted models. Post-Deploy, he’s been working across the team, adding autoscaling to the multimodal cluster, improving error logging, and bringing the infrastructure back into compliance with code standards after the sprint.\\nWhat we’d tell other teams\\nTwo things have made this work.\\nFirst, evaluate the work as it actually exists today. The most important change we made was philosophical. When you design an interview that mirrors how engineers actually build right now, you start asking better questions, attracting better fits, and making decisions you’re more confident in.\\nSecond, treat hiring as a design problem first. Fast teams move fast because the upstream design work is done. When every interaction has a clear purpose, the process gets faster, and the experience improves, for candidates and for the people doing the hiring.\\nWe’re planning to run this again. Eric, Shivani, and Andre are part of the first cohort; there will be more. And if you’re a builder who wants to work on infrastructure that’s genuinely at the edge of what AI-native cloud looks like, we’d love to talk.\"}"},{"ref":"E19","kind":"event","title":"digitalocean/droplet-agent 1.3.4","date":"2026-06-09T16:49:40+00:00","date_source":"source","source_url":"https://github.com/digitalocean/droplet-agent/releases/tag/1.3.4","signal_url":"https://onlylabs.fyi/signals/df1a44d5-c704-404b-a090-19410b9d1575","signal_json_url":"https://onlylabs.fyi/signals/df1a44d5-c704-404b-a090-19410b9d1575/signal.json","text":"release · digitalocean/droplet-agent 1.3.4 · signal_desk=releases · occurred_at=2026-06-09T16:49:40+00:00 · url=https://github.com/digitalocean/droplet-agent/releases/tag/1.3.4 · raw={\"repo\":\"digitalocean/droplet-agent\"}"},{"ref":"E20","kind":"event","title":"digitalocean/terraform-provider-digitalocean v2.89.0","date":"2026-06-09T13:10:57+00:00","date_source":"source","source_url":"https://github.com/digitalocean/terraform-provider-digitalocean/releases/tag/v2.89.0","signal_url":"https://onlylabs.fyi/signals/e5d4aab5-2528-4cf3-a5f4-431ef7a7fcf1","signal_json_url":"https://onlylabs.fyi/signals/e5d4aab5-2528-4cf3-a5f4-431ef7a7fcf1/signal.json","text":"release · digitalocean/terraform-provider-digitalocean v2.89.0 · signal_desk=releases · occurred_at=2026-06-09T13:10:57+00:00 · url=https://github.com/digitalocean/terraform-provider-digitalocean/releases/tag/v2.89.0 · raw={\"repo\":\"digitalocean/terraform-provider-digitalocean\"}"},{"ref":"E21","kind":"event","title":"digitalocean/pydo v0.36.0","date":"2026-06-09T11:49:17+00:00","date_source":"source","source_url":"https://github.com/digitalocean/pydo/releases/tag/v0.36.0","signal_url":"https://onlylabs.fyi/signals/68619dfd-fed1-4dfa-8245-31993ddd449d","signal_json_url":"https://onlylabs.fyi/signals/68619dfd-fed1-4dfa-8245-31993ddd449d/signal.json","text":"release · digitalocean/pydo v0.36.0 · signal_desk=releases · occurred_at=2026-06-09T11:49:17+00:00 · url=https://github.com/digitalocean/pydo/releases/tag/v0.36.0 · raw={\"repo\":\"digitalocean/pydo\"}"},{"ref":"E22","kind":"event","title":"digitalocean/terraform-provider-digitalocean v2.88.0","date":"2026-06-09T11:44:04+00:00","date_source":"source","source_url":"https://github.com/digitalocean/terraform-provider-digitalocean/releases/tag/v2.88.0","signal_url":"https://onlylabs.fyi/signals/88f0707f-c4f4-415f-b23d-38149702bd09","signal_json_url":"https://onlylabs.fyi/signals/88f0707f-c4f4-415f-b23d-38149702bd09/signal.json","text":"release · digitalocean/terraform-provider-digitalocean v2.88.0 · signal_desk=releases · occurred_at=2026-06-09T11:44:04+00:00 · url=https://github.com/digitalocean/terraform-provider-digitalocean/releases/tag/v2.88.0 · raw={\"repo\":\"digitalocean/terraform-provider-digitalocean\"}"},{"ref":"E23","kind":"event","title":"digitalocean/pydo v0.35.0","date":"2026-06-09T11:38:03+00:00","date_source":"source","source_url":"https://github.com/digitalocean/pydo/releases/tag/v0.35.0","signal_url":"https://onlylabs.fyi/signals/7247c7cc-a7eb-4da5-a1d4-9a7a9cf968a1","signal_json_url":"https://onlylabs.fyi/signals/7247c7cc-a7eb-4da5-a1d4-9a7a9cf968a1/signal.json","text":"release · digitalocean/pydo v0.35.0 · signal_desk=releases · occurred_at=2026-06-09T11:38:03+00:00 · url=https://github.com/digitalocean/pydo/releases/tag/v0.35.0 · raw={\"repo\":\"digitalocean/pydo\"}"},{"ref":"E24","kind":"event","title":"digitalocean/godo v1.195.0","date":"2026-06-09T11:17:19+00:00","date_source":"source","source_url":"https://github.com/digitalocean/godo/releases/tag/v1.195.0","signal_url":"https://onlylabs.fyi/signals/7c4ce03e-b4cb-4778-978b-c894e145dfff","signal_json_url":"https://onlylabs.fyi/signals/7c4ce03e-b4cb-4778-978b-c894e145dfff/signal.json","text":"release · digitalocean/godo v1.195.0 · signal_desk=releases · occurred_at=2026-06-09T11:17:19+00:00 · url=https://github.com/digitalocean/godo/releases/tag/v1.195.0 · raw={\"repo\":\"digitalocean/godo\"}"},{"ref":"E25","kind":"event","title":"digitalocean/doctl v1.161.0","date":"2026-06-09T11:12:41+00:00","date_source":"source","source_url":"https://github.com/digitalocean/doctl/releases/tag/v1.161.0","signal_url":"https://onlylabs.fyi/signals/23f0be28-f2ab-48aa-9360-68e67e0fa73a","signal_json_url":"https://onlylabs.fyi/signals/23f0be28-f2ab-48aa-9360-68e67e0fa73a/signal.json","text":"release · digitalocean/doctl v1.161.0 · signal_desk=releases · occurred_at=2026-06-09T11:12:41+00:00 · url=https://github.com/digitalocean/doctl/releases/tag/v1.161.0 · raw={\"repo\":\"digitalocean/doctl\"}"},{"ref":"E26","kind":"event","title":"digitalocean/ShellPort v2.1.4","date":"2026-06-09T04:15:30+00:00","date_source":"source","source_url":"https://github.com/digitalocean/ShellPort/releases/tag/v2.1.4","signal_url":"https://onlylabs.fyi/signals/ed2015a6-4914-4ec4-946c-b5e42fc59366","signal_json_url":"https://onlylabs.fyi/signals/ed2015a6-4914-4ec4-946c-b5e42fc59366/signal.json","text":"release · digitalocean/ShellPort v2.1.4 · signal_desk=releases · occurred_at=2026-06-09T04:15:30+00:00 · url=https://github.com/digitalocean/ShellPort/releases/tag/v2.1.4 · raw={\"repo\":\"digitalocean/ShellPort\"}"},{"ref":"E27","kind":"event","title":"Director, Startup Ecosystem","date":"2026-06-08T19:47:25+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7579090","signal_url":"https://onlylabs.fyi/signals/a26076fd-2832-48b2-a86a-b2b24288c602","signal_json_url":"https://onlylabs.fyi/signals/a26076fd-2832-48b2-a86a-b2b24288c602/signal.json","text":"job_opened · Director, Startup Ecosystem · signal_desk=hiring · occurred_at=2026-06-08T19:47:25+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7579090 · raw={\"location\":\"San Francisco\",\"ats\":\"greenhouse\"}"},{"ref":"E28","kind":"event","title":"Office Coordinator, Bellevue","date":"2026-06-08T19:45:03+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7982761","signal_url":"https://onlylabs.fyi/signals/d138c203-840f-48c4-b28a-fbe9aa405081","signal_json_url":"https://onlylabs.fyi/signals/d138c203-840f-48c4-b28a-fbe9aa405081/signal.json","text":"job_opened · Office Coordinator, Bellevue · signal_desk=hiring · occurred_at=2026-06-08T19:45:03+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7982761 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E29","kind":"event","title":"Strategy Director","date":"2026-06-08T19:17:13+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7826696","signal_url":"https://onlylabs.fyi/signals/90c69856-0273-4cfc-9e88-1555a36b3beb","signal_json_url":"https://onlylabs.fyi/signals/90c69856-0273-4cfc-9e88-1555a36b3beb/signal.json","text":"job_opened · Strategy Director · signal_desk=hiring · occurred_at=2026-06-08T19:17:13+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7826696 · raw={\"location\":\"San Francisco\",\"ats\":\"greenhouse\"}"},{"ref":"E30","kind":"event","title":"Corporate Development Senior Manager ","date":"2026-06-08T19:17:11+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7739997","signal_url":"https://onlylabs.fyi/signals/b8e24b73-dbb1-4e6a-8b75-5bd8e70a31f6","signal_json_url":"https://onlylabs.fyi/signals/b8e24b73-dbb1-4e6a-8b75-5bd8e70a31f6/signal.json","text":"job_opened · Corporate Development Senior Manager  · signal_desk=hiring · occurred_at=2026-06-08T19:17:11+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7739997 · raw={\"location\":\"San Francisco\",\"ats\":\"greenhouse\"}"},{"ref":"E31","kind":"event","title":"digitalocean/ShellPort v2.1.3","date":"2026-06-07T08:28:55+00:00","date_source":"source","source_url":"https://github.com/digitalocean/ShellPort/releases/tag/v2.1.3","signal_url":"https://onlylabs.fyi/signals/a482d2e4-1f85-4bde-93cf-b664d39b6848","signal_json_url":"https://onlylabs.fyi/signals/a482d2e4-1f85-4bde-93cf-b664d39b6848/signal.json","text":"release · digitalocean/ShellPort v2.1.3 · signal_desk=releases · occurred_at=2026-06-07T08:28:55+00:00 · url=https://github.com/digitalocean/ShellPort/releases/tag/v2.1.3 · raw={\"repo\":\"digitalocean/ShellPort\"}"},{"ref":"E32","kind":"event","title":"Senior Engineer I, Managed Kubernetes","date":"2026-06-06T07:00:37+00:00","date_source":"source","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7184027","signal_url":"https://onlylabs.fyi/signals/650aabee-d782-48ed-8c5c-3d48e4a3bd37","signal_json_url":"https://onlylabs.fyi/signals/650aabee-d782-48ed-8c5c-3d48e4a3bd37/signal.json","text":"job_opened · Senior Engineer I, Managed Kubernetes · signal_desk=hiring · occurred_at=2026-06-06T07:00:37+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7184027 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E33","kind":"event","title":"Senior Engineer I, Managed Kubernetes","date":"2026-06-06T07:00:37+00:00","date_source":"source","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7184023","signal_url":"https://onlylabs.fyi/signals/8d4b30c7-d4b4-4553-b4c2-3f244778e397","signal_json_url":"https://onlylabs.fyi/signals/8d4b30c7-d4b4-4553-b4c2-3f244778e397/signal.json","text":"job_opened · Senior Engineer I, Managed Kubernetes · signal_desk=hiring · occurred_at=2026-06-06T07:00:37+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7184023 · raw={\"location\":\"San Francisco\",\"ats\":\"greenhouse\"}"},{"ref":"E34","kind":"event","title":"Senior Product Manager II, Security Products - IAM","date":"2026-06-06T04:01:44+00:00","date_source":"source","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7617112","signal_url":"https://onlylabs.fyi/signals/84785274-8fd4-405a-9531-4f4fcaa0414a","signal_json_url":"https://onlylabs.fyi/signals/84785274-8fd4-405a-9531-4f4fcaa0414a/signal.json","text":"job_opened · Senior Product Manager II, Security Products - IAM · signal_desk=hiring · occurred_at=2026-06-06T04:01:44+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7617112 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E35","kind":"event","title":"Senior Product Manager II, Security Products - IAM","date":"2026-06-06T04:01:44+00:00","date_source":"source","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7617110","signal_url":"https://onlylabs.fyi/signals/88bb0141-303e-4c22-a188-2d77394b4274","signal_json_url":"https://onlylabs.fyi/signals/88bb0141-303e-4c22-a188-2d77394b4274/signal.json","text":"job_opened · Senior Product Manager II, Security Products - IAM · signal_desk=hiring · occurred_at=2026-06-06T04:01:44+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7617110 · raw={\"location\":\"Boston\",\"ats\":\"greenhouse\"}"},{"ref":"E36","kind":"event","title":"digitalocean/ShellPort v2.1.2","date":"2026-06-05T21:56:31+00:00","date_source":"source","source_url":"https://github.com/digitalocean/ShellPort/releases/tag/v2.1.2","signal_url":"https://onlylabs.fyi/signals/3b301808-2146-4763-ac6e-f4b1def66ce9","signal_json_url":"https://onlylabs.fyi/signals/3b301808-2146-4763-ac6e-f4b1def66ce9/signal.json","text":"release · digitalocean/ShellPort v2.1.2 · signal_desk=releases · occurred_at=2026-06-05T21:56:31+00:00 · url=https://github.com/digitalocean/ShellPort/releases/tag/v2.1.2 · raw={\"repo\":\"digitalocean/ShellPort\"}"},{"ref":"E37","kind":"event","title":"digitalocean/ShellPort v2.1.1","date":"2026-06-05T09:47:11+00:00","date_source":"source","source_url":"https://github.com/digitalocean/ShellPort/releases/tag/v2.1.1","signal_url":"https://onlylabs.fyi/signals/060655b8-285b-4899-85fc-169e55605f15","signal_json_url":"https://onlylabs.fyi/signals/060655b8-285b-4899-85fc-169e55605f15/signal.json","text":"release · digitalocean/ShellPort v2.1.1 · signal_desk=releases · occurred_at=2026-06-05T09:47:11+00:00 · url=https://github.com/digitalocean/ShellPort/releases/tag/v2.1.1 · raw={\"repo\":\"digitalocean/ShellPort\"}"},{"ref":"E38","kind":"event","title":"digitalocean/ShellPort v2.1.0","date":"2026-06-05T08:24:41+00:00","date_source":"source","source_url":"https://github.com/digitalocean/ShellPort/releases/tag/v2.1.0","signal_url":"https://onlylabs.fyi/signals/d9e6a3b3-f65f-4d25-9bdb-251813f1bfd9","signal_json_url":"https://onlylabs.fyi/signals/d9e6a3b3-f65f-4d25-9bdb-251813f1bfd9/signal.json","text":"release · digitalocean/ShellPort v2.1.0 · signal_desk=releases · occurred_at=2026-06-05T08:24:41+00:00 · url=https://github.com/digitalocean/ShellPort/releases/tag/v2.1.0 · raw={\"repo\":\"digitalocean/ShellPort\"}"},{"ref":"E39","kind":"event","title":"Model Evaluations: Prove Your Routing Policy Actually Works","date":"2026-06-04T19:52:49.377+00:00","date_source":"rss.item_date","source_url":"https://www.digitalocean.com/blog/model-evaluation-public-preview","signal_url":"https://onlylabs.fyi/signals/445ef83b-93e8-4b66-b72d-c0e34d590700","signal_json_url":"https://onlylabs.fyi/signals/445ef83b-93e8-4b66-b72d-c0e34d590700/signal.json","text":"post_published · Model Evaluations: Prove Your Routing Policy Actually Works · signal_desk=talking · occurred_at=2026-06-04T19:52:49.377+00:00 · url=https://www.digitalocean.com/blog/model-evaluation-public-preview · raw={\"excerpt\":\"Most teams running inference at scale do not fail because they cannot find a “good” model. They fail because they ship a routing policy that looks fine in a playground, but drifts the moment it sees real prompts, real latency tails, and real per-token cost. The routing policy breaks on the prompts you never tested and your users find out before you do.\\nNow you can use Model Evaluations, available in Public Preview on the DigitalOcean Inference Engine, to evaluate models available on the platform, or models that you have imported from Hugging Face or DigitalOcean Spaces. Model Evaluations helps you make comparable, reproducible decisions across models, routing strategies, cost, latency, and output quality.\\nIn this guide, we walk through setting up, running, and interpreting a Model Evaluation across three inference strategies: using a single frontier model for every request, deploying a task-specific fine-tuned model, or using the Inference Router with a cost- or latency-optimized policy. The goal is simple: determine which approach performs best on your workload before you change production traffic.\\nThe scenario\\nLet’s say you are running a legal-adjacent assistant (think contract summarization, clause extraction, policy Q&A). You currently call one expensive frontier model for every request as you believe it is the most accurate. Your CFO sees inference as COGS whereas your users see latency and p95 as key metrics on long documents. The Inference Router is attractive: it can send “easy” work to a cheaper or faster model and keep the heavy lifter for edge cases, if the routing policy is aligned with  your use case.\\nYour evaluation job is to compare these three candidates on the same dataset, using the same judge and metrics, so the results are directly comparable:\\n\\n\\n\\nEndpoint\\nCandidate\\nWhat you are really testing\\n\\n\\n\\n\\nServerless Inference\\nanthropic-claude-4.6-sonnet\\nSingle “always frontier” model (your baseline)\\n\\n\\nInference Router\\nmodel-eval-blog-legal\\nAn Inference Router configuration with 3 tasks that uses Claude Haiku 4.5, DeepSeek R1 Distill Llama 70B, and Gemma 4\\n\\n\\nBYOM on Dedicate Inference (DI)\\nOntario/qwen3-0.6b-en-law-qa\\nA Bring Your Own Model (BYOM) imported from Hugging Face deployed on a Dedicated Inference endpoint\\n\\n\\n\\n\\nNote that the Inference Router offers a playground and a quickstart router evaluation feature with predefined metrics and judge model. This is useful for users looking to get a jumpstart on router performance. We will be focusing on using the Model Evaluation since it allows for customization of datasets, metrics and judge models.\\nPrerequisites\\nEnable the feature (Public Preview): In the DigitalOcean Console, navigate to the Feature Preview page and enable Model Evaluations, then go to the Inference tab in the left navigation and click → Model Evaluations\\nCreate endpoints for candidates:\\nFor candidate 2 (Inference Router), you need to configure a router. For this example, create a cost-optimization router focused on the three types of tasks being evaluated in this example: contract_summarization, clause_extraction, policy_qa. Choose models for these tasks based on your workload needs.\\nFor candidate 3 (BYOM on DI), import the suggested model from Hugging Face and deploy onto a dedicated inference endpoint for this evaluation test.\\nEach of the candidates requires separate evaluation runs. So run 1 will be on the SI model, and run 2 and 3 will be on the Inference Router and BYOM model, respectively.\\nStep 1: Define the decision and the “star” metric\\nBefore you start the process, answer three questions:\\nWhat is a “good” answer for you? Correctness, completeness, faithfulness to ground truth?\\nWhat is non-negotiable? For example, PII leakage, toxicity, and bias in sensitive domains.\\nWhat is the one number leadership will look at first? Designate a star metric alongside a bundle of criteria.\\nEvery later argument should point back to your star metric, or you will drown in multidimensional scores. Note that sometimes there might be multiple criteria that are important—you may care about correctness but your compliance partner cares about PII leakage and bias.\\nStep 2: Add dataset\\nCreate a dataset that matches your production use cases, and meets the following criteria\\nFormats: CSV or JSONL .\\nSize: up to ~1K rows / 1 GB for a job.\\nContains prompt and optionally, ground truth.\\n\\ninput, ground_truth\\n<prompt 1>, <ground_truth 1>\\n<prompt 2>, <ground_truth 2>\\n\\nTry to avoid using only cherry-picked prompts. We recommend mixing typical ‘happy-path’ prompts along with convoluted long-context prompts that are challenging. Also include examples of edge cases that expose safety risks.\\nFor this example, you can utilize this simulated dataset for your use case legal_eval_simulated_dataset.csv.\\n\\nUpload the created dataset in the new evaluation run dialog. The upload process validates the schema and blocks obvious breakage. In the next screen, provide a concise name for the evaluation run. You can also upload datasets using cURL\\n\\ncurl -X POST \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -H \\\"Authorization: Bearer $DIGITALOCEAN_TOKEN\\\" \\\\\\n  \\\"https://api.digitalocean.com/v2/gen-ai/model_evaluation/datasets/file_upload_presigned_urls\\\" \\\\\\n  -d '{\\n    \\\"files\\\": [{\\n      \\\"file_name\\\": \\\"legal_eval_simulated_dataset.csv\\\",\\n      \\\"file_size\\\": 77564\\n    }]\\n  }'\\n\\nStep 3: Configure candidates\\n\\nFor apples-to-apples comparison of the three candidates, ensure that you use the same evaluation configuration as below. This includes setting the same system prompt, temperature and max tokens to values that mirror your production use case. If your app injects a system prompt in code, paste the same prompt here. Otherwise, you will be measuring a different product than the one you ship.\\nFor the first run, select a frontier model such as anthropic-claude-4.6-sonnet, or a DigitalOcean-hosted model such as glm-5 from the candidate model dropdown. (Note that to access commercial models, your account will need to be an appropriate tier. You can request access to higher tier at this link). For the second run, choose your router config from the dropdown (model-eval-blog-legal  in this example). For the third run, choose the dedicated inference endpoint where Ontario/qwen3-0.6b-en-law-qa has been deployed. For the system prompt for the candidate, you can create your own system prompt suitable for evaluation, or tweak the following example: Legal_system_prompt.txt.\\nStep 4: Choose the judge and the rubric\\nChoose an appropriate judge for evaluating the candidates. Remember to use the same judge for all three candidates. For this example, we recommend using OpenAI GPT-5.5 (or DeepSeek R1 Distill Llama 70B if you do not have access to commercial models).\\n\\nChoose all the evaluation metrics available: correctness, completeness, ground truth faithfulness, and safety metrics (PII, toxicity and bias). Since the dataset has ground truth faithfulness, let’s choose that as the star metric.\\nRun the job. Monitor status in the same model evaluation landing page.\\nCode snippet for setting the evaluation configuration using cURL is provided below:\\n\\ncurl -X POST \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -H \\\"Authorization: Bearer $DIGITALOCEAN_TOKEN\\\" \\\\\\n  \\\"https://api.digitalocean.com/v2/gen-ai/model_evaluation_runs\\\" \\\\\\n  -d '{\\n    \\\"name\\\": \\\"my-evaluation-run\\\",\\n    \\\"candidate_model_uuid\\\": \\\"123e4567-e89b-12d3-a456-426614174000\\\",\\n    \\\"judge_model_uuid\\\": \\\"223e4567-e89b-12d3-a456-426614174001\\\",\\n    \\\"dataset_uuid\\\": \\\"323e4567-e89b-12d3-a456-426614174002\\\",\\n    \\\"metric_uuids\\\": [\\n      \\\"423e4567-e89b-12d3-a456-426614174003\\\"\\n    ]\\n  }'\\n\\nInformation about model and metric UUIDs are available in the API Reference.\\nStep 5: Interpret results like a PM, not like a leaderboard\\n\\nWhen the run finishes, you are looking for three layers of evidence (aligned to your dashboard requirements):\\nAggregate: per-metric and overall health and the star metric for the exec readout.\\nPerformance economics on the same rows: end-to-end latency, total evaluation time, token counts, estimated cost. This is how you answer “best accuracy per dollar?” without merging two spreadsheets.\\nItem-level drill-down: input, model output, judge rationale, per-criterion scores. This is where you see routing decisions: by evaluating the tradeoff between completeness or correctness score against latency and cost.\\n\\nIf you are comparing a router vs. a static model, scan for segmented behavior:\\nOn easy prompts, did the router preserve quality and improve cost/latency?\\nOn hard prompts, did the router keep safety scores in range or did PII/toxicity/bias tick up?\\n\\nFinally, download the results for exhaustive analysis.\\nStep 6: The decision and the iteration loop\\nYou are not looking for a philosophical winner; you are looking for a go / no-go with a tuning path:\\nShip the router if star metric and safety bars hold (or only regress within agreed tolerance) and you gain meaningful cost or latency headroom on your representative mix.\\nKeep iterating if the router’s deficiencies cluster in specific task types. Then adjust routing policy (natural-language policy for tasks + model pool, per Router positioning), not the judge, and re-run the same dataset to see if the performance changes.\\nRelease narrative you can use internally: “We didn’t A/B in production. We simulated production endpoints, captured judge + latency + cost in one run, and re-ran on the same dataset as the router policy evolved.” Also, share context on relying on a public leaderboard number as a substitute for your workload, and treat quality in one tool and $ / token in another.\\nTurning evaluation into an operational workflow\\nOver time, model evaluation is moving closer to real-world production workloads, giving teams near-real-time visibility into performance, cost, latency, and output quality. We are continuing to expand DigitalOcean Model Evaluations with support for custom metrics, multimodal models, standardized benchmarks, and richer workload analysis so teams can make production decisions with greater confidence*.\\nFor you, this means spending less time second-guessing model decisions and more time shipping confidently. Every evaluation run brings you closer to a production stack you can justify with data instead of intuition.\\n*The above reflects our current plans and product direction, and is subject to change without notice. It is provided for informational purposes only and is not a commitment to deliver any material, feature, or functionality.\"}"},{"ref":"E40","kind":"event","title":"Principal Engineer - Security Products, Security Visibility","date":"2026-06-04T19:33:07+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7975063","signal_url":"https://onlylabs.fyi/signals/9efee210-9cad-4e7f-946d-7cf00d0293f0","signal_json_url":"https://onlylabs.fyi/signals/9efee210-9cad-4e7f-946d-7cf00d0293f0/signal.json","text":"job_opened · Principal Engineer - Security Products, Security Visibility · signal_desk=hiring · occurred_at=2026-06-04T19:33:07+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7975063 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E41","kind":"event","title":"Principal Engineer - Security Products, Security Visibility","date":"2026-06-04T19:32:16+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7975061","signal_url":"https://onlylabs.fyi/signals/874d2263-e45c-4e21-9cf1-c8f551800bfe","signal_json_url":"https://onlylabs.fyi/signals/874d2263-e45c-4e21-9cf1-c8f551800bfe/signal.json","text":"job_opened · Principal Engineer - Security Products, Security Visibility · signal_desk=hiring · occurred_at=2026-06-04T19:32:16+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7975061 · raw={\"location\":\"Boston\",\"ats\":\"greenhouse\"}"},{"ref":"E42","kind":"event","title":"digitalocean/ShellPort v2.0.2","date":"2026-06-04T17:32:45+00:00","date_source":"source","source_url":"https://github.com/digitalocean/ShellPort/releases/tag/v2.0.2","signal_url":"https://onlylabs.fyi/signals/336d5444-a3af-4e07-a960-e04203aa8471","signal_json_url":"https://onlylabs.fyi/signals/336d5444-a3af-4e07-a960-e04203aa8471/signal.json","text":"release · digitalocean/ShellPort v2.0.2 · signal_desk=releases · occurred_at=2026-06-04T17:32:45+00:00 · url=https://github.com/digitalocean/ShellPort/releases/tag/v2.0.2 · raw={\"repo\":\"digitalocean/ShellPort\"}"},{"ref":"E43","kind":"event","title":"Principal Software Engineer","date":"2026-06-04T17:32:08+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7975125","signal_url":"https://onlylabs.fyi/signals/b3c423db-74ce-4c2d-95e3-4c341bdf55fb","signal_json_url":"https://onlylabs.fyi/signals/b3c423db-74ce-4c2d-95e3-4c341bdf55fb/signal.json","text":"job_opened · Principal Software Engineer · signal_desk=hiring · occurred_at=2026-06-04T17:32:08+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7975125 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E44","kind":"event","title":"Principal Software Engineer","date":"2026-06-04T17:32:01+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7975123","signal_url":"https://onlylabs.fyi/signals/bf71929d-76b6-4c9f-b15b-5251ea1d435e","signal_json_url":"https://onlylabs.fyi/signals/bf71929d-76b6-4c9f-b15b-5251ea1d435e/signal.json","text":"job_opened · Principal Software Engineer · signal_desk=hiring · occurred_at=2026-06-04T17:32:01+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7975123 · raw={\"location\":\"Boston\",\"ats\":\"greenhouse\"}"},{"ref":"E45","kind":"event","title":"digitalocean/ShellPort v2.0.1","date":"2026-06-04T17:28:48+00:00","date_source":"source","source_url":"https://github.com/digitalocean/ShellPort/releases/tag/v2.0.1","signal_url":"https://onlylabs.fyi/signals/a63a5887-eab7-4e0e-a714-a00f050e32c0","signal_json_url":"https://onlylabs.fyi/signals/a63a5887-eab7-4e0e-a714-a00f050e32c0/signal.json","text":"release · digitalocean/ShellPort v2.0.1 · signal_desk=releases · occurred_at=2026-06-04T17:28:48+00:00 · url=https://github.com/digitalocean/ShellPort/releases/tag/v2.0.1 · raw={\"repo\":\"digitalocean/ShellPort\"}"},{"ref":"E46","kind":"event","title":"Senior Engineer, Inference Data Plane","date":"2026-06-04T14:02:17+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7722807","signal_url":"https://onlylabs.fyi/signals/5fa5e05e-3ae6-416f-b77c-ca0ee2f4cd40","signal_json_url":"https://onlylabs.fyi/signals/5fa5e05e-3ae6-416f-b77c-ca0ee2f4cd40/signal.json","text":"job_opened · Senior Engineer, Inference Data Plane · signal_desk=hiring · occurred_at=2026-06-04T14:02:17+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7722807 · raw={\"location\":\"San Francisco\",\"ats\":\"greenhouse\"}"},{"ref":"E47","kind":"event","title":"Senior Engineer, Inference Data Plane","date":"2026-06-04T14:02:17+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7722811","signal_url":"https://onlylabs.fyi/signals/8693180b-fb01-401a-9858-b554bd3081b0","signal_json_url":"https://onlylabs.fyi/signals/8693180b-fb01-401a-9858-b554bd3081b0/signal.json","text":"job_opened · Senior Engineer, Inference Data Plane · signal_desk=hiring · occurred_at=2026-06-04T14:02:17+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7722811 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E48","kind":"event","title":"Senior Engineer II, AI Inference Engine Systems","date":"2026-06-04T14:02:15+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7714471","signal_url":"https://onlylabs.fyi/signals/50b2d343-081d-41cf-9e0c-783ddbb5946d","signal_json_url":"https://onlylabs.fyi/signals/50b2d343-081d-41cf-9e0c-783ddbb5946d/signal.json","text":"job_opened · Senior Engineer II, AI Inference Engine Systems · signal_desk=hiring · occurred_at=2026-06-04T14:02:15+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7714471 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E49","kind":"event","title":"Senior Engineer II, AI Inference Engine Systems","date":"2026-06-04T14:02:15+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7714467","signal_url":"https://onlylabs.fyi/signals/c5b77c2d-c6ea-46d5-933c-57a23712740f","signal_json_url":"https://onlylabs.fyi/signals/c5b77c2d-c6ea-46d5-933c-57a23712740f/signal.json","text":"job_opened · Senior Engineer II, AI Inference Engine Systems · signal_desk=hiring · occurred_at=2026-06-04T14:02:15+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7714467 · raw={\"location\":\"San Francisco\",\"ats\":\"greenhouse\"}"},{"ref":"E50","kind":"event","title":"digitalocean/pydo v0.34.0","date":"2026-06-04T08:52:26+00:00","date_source":"source","source_url":"https://github.com/digitalocean/pydo/releases/tag/v0.34.0","signal_url":"https://onlylabs.fyi/signals/5ab26f21-3624-4bf2-9817-9214619eac02","signal_json_url":"https://onlylabs.fyi/signals/5ab26f21-3624-4bf2-9817-9214619eac02/signal.json","text":"release · digitalocean/pydo v0.34.0 · signal_desk=releases · occurred_at=2026-06-04T08:52:26+00:00 · url=https://github.com/digitalocean/pydo/releases/tag/v0.34.0 · raw={\"repo\":\"digitalocean/pydo\"}"},{"ref":"E51","kind":"event","title":"Director of Engineering - Security Products","date":"2026-06-04T07:02:40+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7975094","signal_url":"https://onlylabs.fyi/signals/b88890ad-207a-4893-ab9f-2223f475322b","signal_json_url":"https://onlylabs.fyi/signals/b88890ad-207a-4893-ab9f-2223f475322b/signal.json","text":"job_opened · Director of Engineering - Security Products · signal_desk=hiring · occurred_at=2026-06-04T07:02:40+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7975094 · raw={\"location\":\"Bengaluru\",\"ats\":\"greenhouse\"}"},{"ref":"E52","kind":"event","title":"Senior Manager, Security Products","date":"2026-06-03T21:37:46+00:00","date_source":"greenhouse.updated_at","source_url":"https://www.digitalocean.com/careers/position/apply/?gh_jid=7975100","signal_url":"https://onlylabs.fyi/signals/d52396f7-f0a5-4c05-a1bb-30fd612d1c22","signal_json_url":"https://onlylabs.fyi/signals/d52396f7-f0a5-4c05-a1bb-30fd612d1c22/signal.json","text":"job_opened · Senior Manager, Security Products · signal_desk=hiring · occurred_at=2026-06-03T21:37:46+00:00 · url=https://www.digitalocean.com/careers/position/apply/?gh_jid=7975100 · raw={\"location\":\"Seattle\",\"ats\":\"greenhouse\"}"},{"ref":"E53","kind":"event","title":"The Team Behind Deploy: Shipping AI, the DigitalOcean Way","date":"2026-06-03T19:38:43.949+00:00","date_source":"rss.item_date","source_url":"https://www.digitalocean.com/blog/behind-deploy-2026","signal_url":"https://onlylabs.fyi/signals/7357e257-b304-455a-a67c-0dcaa8fce3bd","signal_json_url":"https://onlylabs.fyi/signals/7357e257-b304-455a-a67c-0dcaa8fce3bd/signal.json","text":"post_published · The Team Behind Deploy: Shipping AI, the DigitalOcean Way · signal_desk=talking · occurred_at=2026-06-03T19:38:43.949+00:00 · url=https://www.digitalocean.com/blog/behind-deploy-2026 · raw={\"excerpt\":\"Deploy 2026 came and went, and we’re still buzzing. For one day at Convene 100 Stockton in San Francisco, developers, startup founders, customers, and partners filled the room to talk about a shared challenge: how to build and scale AI products without unnecessary complexity. Conversations moved from infrastructure to inference costs, production workloads, vector databases, and what teams actually need to get AI applications from prototype to production. We’re grateful to everyone who showed up and made it what it was.\\nDigitalOcean took the covers off the AI-Native Cloud, a five-layer stack purpose-built for AI-native companies, with more than 15 product launches in a single keynote. That included Inference Router, Dedicated Inference, Managed Weaviate, Knowledge Bases, expanded GPU and model capabilities, and a new Kansas City data center with liquid-cooled B300s. The event had seven sponsors, including NVIDIA, AMD, Weaviate, OpenRouter, MongoDB, and others. Customers like Hippocratic AI, Character AI, and Higgsfield shared how they are building on DigitalOcean.\\nDeploy 2026 was a cross-company effort, and a reflection of how DigitalOcean works every day.\\nView YouTube video\\n\\nThese three team members capture what our culture of true ownership looked like up close:\\nMeghan Grady, Senior Director of Marketing and Communications. She and her team led the planning and execution behind Deploy, including keynote production, customer programming, live streaming, social coverage, and event communications.\\nMitchell Mocchi, Sales Account Executive, and the team worked alongside sales, solutions architecture, support, and engineering teams. He met directly with customers on the Deploy floor and helped them through the AI infrastructure decisions in real-time.\\nTyler Gillam, Senior Software Engineer II, and his team helped build the Inference Router, DigitalOcean’s AI routing product. He worked with engineering and product teams to demo it live during the keynote.\\nA career high for the marketing team\\nWith a launch as big as the AI-Native cloud, our team knew we had to go big. Registrations came in so strongly that even the venue had room to grow. The customer experience was front-of-mind as we added a second speaker track and introduced a startup showcase, giving five companies from DigitalOcean’s startup program a stage to talk about their journey. On top of that, the marketing and design teams launched a brand new website and brand identity alongside the event to showcase the company’s evolution.\\nThis was also the first Deploy with paid sponsorships. Meghan, who previously worked on DigitalOcean’s partnerships team, watched those relationships come to life on the event floor. “Seeing all of our sponsors there, with their huge teams, and observing how those relationships have evolved and how much value they see in our partnerships and our ecosystem was really exciting,” she says.\\nThere was no shortage of memorable moments throughout the day. The one Meghan keeps coming back to came mid-keynote, when the audience raised their phones to photograph the AI-Native Cloud stack. Months of work behind a single slide, and people wanted to hold onto it.\\nWorking on events like Deploy is a highlight for any employee who is involved. “DigitalOcean is a very special place. What makes it is the people. Both the people who work here and our customers. The company is growing exponentially right now, and it’s a true opportunity for growth,” Meghan says.\\nBuilding and launching products side-by-side with our customers\\nMitch has been at DigitalOcean for over four years. As an Account Executive, his job is managing customer relationships, which is much closer to the infrastructure than that title suggests. They’re the first call when something isn’t working, and the ones sitting across from customers, figuring out what it takes to make a workload succeed on the platform.\\nAt Deploy, that meant a full day on the floor. On one end, he was talking to established customers like Hippocratic AI, running large inference clusters on H200 and B300 infrastructure. On the other front, brand-new startups are looking to get their hands on any GPU they can.\\n“The product release cycle over the past few months is acting as empirical data to our customers of where the company is headed,” Mitch says. “Customers were really excited to hear how much we were building.”\\nOne conversation stuck with him. A week before Deploy, DigitalOcean had connected with Metamorphic, a San Francisco startup. At the event, Mitch found himself sitting around a table with their lead engineer, talking through their H100 cluster deployment live. “Just seeing how excited they were really stood out to me,” he says. “Knowing that DigitalOcean is acting as the infrastructure layer for this project is just fantastic.”\\nWhat surprises customers most, Mitch says, is how far the team is willing to go during a proof of concept. Sales, support, solutions architecture, and forward-deployed engineering all engage with a single customer at once to make sure they succeed.\\nWhen a customer hits a limit or a missing feature, the answer today isn’t to walk away. “We need to change the product,” Mitch says. “Because if this customer is experiencing this issue, there’s probably a group of others experiencing it too.”\\nFrom writing code to the keynote stage and booth conversations\\nFor Deploy, the engineering team built the Inference Router, a product that automatically routes each AI request to the right model based on cost, latency, and quality. Tyler, a Senior Software Engineer focused on AI infrastructure and inference products, helped build it, then demoed it live during the keynote. He worked closely with product and engineering leaders, the CTO, and the CEO to make sure the story landed.\\n“Standing on stage showing something our team had actually built. It wasn’t just me showing something that I had done. It was me representing the entire team and what they’d worked on for the past two to three months,” Tyler says.\\nAfter the keynote, developers started walking up to the booth to ask about the Inference Router. They immediately understood the problem, and they wanted to use it. “It wasn’t just something we thought was valuable internally,” he says. “It actually proved itself.”\\nThat kind of feedback loop isn’t unique to Deploy. It’s how the engineering team works day to day. Engineers at DigitalOcean are directly connected to the people using what they build. “I get to meet with existing customers, prospective customers, and AI-native companies to understand what they’re building and where they’re getting stuck. We hear directly from developers, carry those problems into product ideas, build them, and get them into customer hands quickly.”\\nThe result is a pace and scope of work that’s hard to find elsewhere. “I’ve shipped more interesting and meaningful features here than anywhere else I’ve worked,” he says.\\nTyler says the engineers who thrive at DigitalOcean are builders with strong ownership. People who care about the customer, care about the details, and want to move fast. That means the role extends well beyond writing code. “You’re not just going to work on a tiny piece of something that may or may not ship. You can help shape the product. You can build it, launch it, and talk to customers who are using it.”\\nFor anyone weighing whether to join us at DigitalOcean, Tyler puts it simply: “If you want ownership, if you want to build products that developers actually use, and if you want to work on infrastructure at a moment where AI is changing incredibly fast, DigitalOcean is a great place to do that.”\\nJoin the team building the AI-Native Cloud\\nAs Tyler put it, Deploy reflects DigitalOcean’s culture:\\n“We care about builders. We care about simplicity. We care about making powerful infrastructure accessible. Deploy gives us a chance to show that through more than slides or announcements, but also through real demos, real conversations, and real products.”\\nThe company is growing fast, and the work is changing just as quickly. Today, DigitalOcean is shipping a full-stack inference cloud. For those who work here, that means there’s room to shape products from the ground up, and work directly with the customers using them.\\nIf that sounds like the kind of work you want to do, explore our open roles.\"}"},{"ref":"E54","kind":"event","title":"Powering the Inference Era: Inside the DigitalOcean Data & Learning Layer","date":"2026-06-03T19:23:28.774+00:00","date_source":"rss.item_date","source_url":"https://www.digitalocean.com/blog/dataandlearning","signal_url":"https://onlylabs.fyi/signals/c7bea94e-3fcc-4de2-814e-414aec3a9037","signal_json_url":"https://onlylabs.fyi/signals/c7bea94e-3fcc-4de2-814e-414aec3a9037/signal.json","text":"post_published · Powering the Inference Era: Inside the DigitalOcean Data & Learning Layer · signal_desk=talking · occurred_at=2026-06-03T19:23:28.774+00:00 · url=https://www.digitalocean.com/blog/dataandlearning · raw={\"excerpt\":\"Building an AI-native application requires a data layer that can do two things at once: handle the structured, transactional queries your application runs on, and understand meaning well enough to power semantic search across unstructured content. An AI application needs both — precise SQL for account balances and transaction records, and vector search to surface conceptually related patterns, anomalies, or past cases that a keyword query would never find.\\nMost teams end up stitching these together across different environments, where every query crosses a boundary. Latency compounds and costs grow with the complexity of the glue, not the value of the data. What holds together in a prototype starts to fracture under production load.\\nThe DigitalOcean Data & Learning layer is designed to close that gap by giving you structured, vector, and retrieval layers that work together in the same ecosystem.\\nReal-Time Inference and Learning\\nAt the heart of any sophisticated AI application is the need for grounded, context-aware inference. DigitalOcean now supports a unified set of tools across the data layer:\\nManaged PostgreSQL Advanced and MySQL Advanced Edition (Public Preview) for the structured, transactional data your application runs on\\nKnowledge Bases (General Availability) to handle the full retrieval pipeline from ingestion to answer\\nManaged Weaviate (Private Preview) for vector search on unstructured data\\nTogether, this unified platform allows developers to build real-time multimodal pipelines and manage enterprise knowledge bases with ease. Every retrieval your application or agent makes flows through this layer. When the data and retrieval layer is fully managed and scales with the application, your agent’s answers stay grounded and your service stays available.\\nThese services run on the same platform as DigitalOcean’s Inference Engine and Managed Agent infrastructure. This means zero egress between the data layer and inference, one billing relationship instead of three, and no auth glue to write just to connect an agent to its knowledge. We don’t try to rebuild what the open-source ecosystem already does well. Knowledge Bases, Managed Weaviate, and our managed PostgreSQL and MySQL run on the same open standards developers already trust, integrated into one platform so the path to scale gets shorter.\\nPostgreSQL and MySQL Advanced Edition: Scale the foundation, Faster\\nEvery transactional application needs a database. It’s the layer everything else depends on, and when it goes down, the application goes with it. For AI-native companies, the stakes are even higher. If you are deploying autonomous agents that rely on structured data, or building an app with AI from scratch, your database is the single source of truth. If it goes down, your entire application goes down with it.\\nTo support these critical workloads as they scale, we are launching PostgreSQL and MySQL Advanced Edition (now in Public Preview). This new tier is purpose-built for larger customers and high-growth AI startups that require a more resilient, purpose-built platform. When your agent needs to pull a month of transaction logs before generating an analysis, Advanced Edition is the layer that has to stay up, stay fast, and stay consistent. That’s the foundation everything else in this stack sits on.\\nAdvanced Edition retains the critical features you love from our Standard tier, like automatic disk scaling, but has been entirely re-engineered for maximum speed and minimal disruption:\\nScale in as little as minutes: When you need to scale database capacity, the operation now takes mere minutes instead of hours.\\nReliable operations: Benefit from highly resilient high-availability (HA) clusters with proxy-based failover which keeps your application alive even during node failures.\\nObserve with confidence: Get deeper insights into your query performance and system health to understand, debug, or dive deep into your database’s performance.\\n\\nEven before the introduction of our Advanced Edition, customers were achieving massive scale on our Standard tier. For example, Picap, a rideshare platform supporting over 1 million rides per day, achieved 4x cost savings by moving Postgres and Kafka to DigitalOcean. With Advanced Edition, we are giving you a stronger foundation to support your app as it scales.\\nEnroll in the MySQL and PostgreSQL Public Previews and get early access today.\\nKnowledge Bases: Simplifying RAG\\nWhile robust databases handle your structural foundation, DigitalOcean Knowledge Bases (now in GA) takes the Data & Learning layer a step further by simplifying unstructured data management. Building a traditional Retrieval-Augmented Generation (RAG) stack typically requires stitching together multiple vendors for vector storage, embedding models, and retrieval logic; Knowledge Bases replaces that complexity by turning the entire subsystem into a built-in platform primitive.\\nKey benefits include:\\nZero-Config Lifecycle: Simply upload your documents and pick an embedding model. DigitalOcean handles the entire pipeline—ingest, chunk, embed, and hybrid retrieval—automatically.\\nThe RAG Playground: Test your strategies interactively in the console. You can refine chunking and retrieval quality without writing throwaway scripts—all in a single pane.\\nAgent-Ready with MCP: With one line of config, any MCP-compatible agent, including AI Platform Agents, can use your Knowledge Base as its retrieval tool.\\nAffordable Pricing: Production-ready infrastructure starting at $19.60, with embedding tokens at just $0.02/1M. No minimum commitment required.\\nUsing DigitalOcean Knowledge Bases, LawVo was able to accelerate their path to production:\\n\\\"Before DigitalOcean Knowledge Bases, we were looking at weeks of work to stand up a production RAG pipeline behind our LawvoAI offering—vector DB, embeddings, chunking, the whole stack. With DigitalOcean, we had a citation-backed knowledge base running in a day. That lets our team focus on the product, not the retrieval plumbing.”\\nHovsep Seraydarian, Co-founder & CTO LawVo\\nManaged Weaviate: Relief from Vector Ops\\nWhile Knowledge Bases offer a seamless, turnkey experience, some teams need a more complete, hands-on control over their schema, chunking, and retrieval. However, self-hosting a vector database eventually leads to an operational wall and proprietary alternatives often result in unpredictable billing spikes. Managed Weaviate (now in Private Preview) offers a solution: the unmodified open-source engine you already know, delivered as a managed service.\\nHow it works:\\nZero Lock-in: Use the same Python, JavaScript, and Go clients with no proprietary SDKs or code changes required for migration.\\nPlatform Native: One invoice, one API token. Your vector store now lives directly alongside your Droplets, Managed Databases and Serverless Inference—no separate vendor relationship to manage.\\nPredictable Pricing: Paired with DigitalOcean Serverless Inference, you can keep your data embeddings, retrieval, and generation co-located with zero egress fees. Plus, you can bring your own embedding provider if you prefer. Be the first to try Managed Weaviate on DigitalOcean.\\nDigitalOcean Data & Learning Layer: Built for Modern AI\\nThe Data & Learning layer is built so the data and the inference don’t live in different houses. For AI-native companies building toward production, whether the workload is petabytes of transaction logs, an enterprise knowledge base, or real-time multimodal pipelines, the result is a faster path from prototype to scale on a stack the team doesn’t have to assemble.\\nReady to dive deeper? Check out the Data & Learning segment of the Deploy Keynote or watch the Product Session on the Data Layer.\\nJoin the over 85,000 customer accounts already running on our Managed Databases, production-tested for the most demanding AI workloads.\"}"},{"ref":"E55","kind":"event","title":"DigitalOcean Serverless Inference: A Deep Dive","date":"2026-06-03T00:00:00.000Z","date_source":"page.visible_date","source_url":"https://www.digitalocean.com/blog/serverless-inference-deep-dive","signal_url":"https://onlylabs.fyi/signals/0cb0359e-bd7d-40dd-ac01-a99745272505","signal_json_url":"https://onlylabs.fyi/signals/0cb0359e-bd7d-40dd-ac01-a99745272505/signal.json","text":"post_published · DigitalOcean Serverless Inference: A Deep Dive · signal_desk=talking · occurred_at=2026-06-03T00:00:00.000Z · url=https://www.digitalocean.com/blog/serverless-inference-deep-dive · raw={\"excerpt\":\"The Problem: Inference Gets Hard at Scale\\nIf you’ve shipped an AI feature to production, you already know: the hard part isn’t making a model respond to a prompt. The hard part is making it respond more reliably, at scale, across multiple models, without burning through your budget.\\nThe moment real users show up, you’re dealing with GPU resource contention, traffic unpredictability (a single enterprise customer can 10x your request volume overnight), latency-cost tradeoffs that shift constantly, and multi-model orchestration across text, vision, image, video, and audio — each with different API contracts and failure characteristics.\\nMost teams spend months just getting the infrastructure stable. We built DigitalOcean Serverless Inference so you don’t have to.\\nWhat Serverless Inference Is\\nDigitalOcean Serverless Inference is a fully managed, API-first inference platform — 30+ foundation models across text, code, vision, image generation, video generation, and speech, all through a single API key, a single base URL, and pay-per-token pricing with no minimum commitments.\\nThe core idea: Serverless Inference separates model consumption from infrastructure management. It automatically scales to handle incoming requests. Because it does not maintain sessions, each request must include the full context needed by the model. You interact with models through an API surface. We handle GPU allocation, scaling, and model lifecycle underneath.\\nSingle Endpoint, Every Mode\\n\\n\\nNone\\nhttps://inference.do-ai.run\\n\\nAuthenticate with a Model Access Key (recommended — scoped to specific models, VPC-restrictable)\\nOpenAI and Anthropic Compatible\\nThe API is OpenAI-compatible. If you have existing code that calls OpenAI, switch to DigitalOcean by changing two lines — the base URL and the key:\\n\\n\\nPython\\nfrom openai import OpenAI\\nimport os\\n\\nclient = OpenAI(\\n    base_url=\\\"https://inference.do-ai.run/v1/\\\",\\n    api_key=os.getenv(\\\"MODEL_ACCESS_KEY\\\"),\\n)\\nresponse = client.chat.completions.create(\\n    model=\\\"deepseek-v3.2\\\",\\n    messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"Explain the CAP theorem.\\\"}],\\n)\\n\\nWe also support Anthropic-compatible patterns through the /v1/messages endpoint, so Claude Code and other agentic workflows work directly through DigitalOcean without vendor lock-in.\\nIntelligent Routing, Built-in Tools, and More\\nBeyond basic inference, the platform includes an Inference Router for automatic multi-model routing, built-in tools for knowledge retrieval, MCP, and web search, prompt caching for cost reduction on repeated contexts, and reasoning for step-by-step thinking traces. We’ll cover each of these in detail later in this post.\\nColocated with Your Cloud\\nUnlike standalone inference providers, Serverless Inference is part of the DigitalOcean platform. Your inference workloads sit alongside databases, object storage, Kubernetes clusters, and VPCs — all under unified billing and access control. This is a structural advantage where DigitalOcean is unmatched.\\nArchitecture: How Requests Flow\\nHere’s what happens under the hood when your application sends a request:\\n\\n\\nNone\\nClient Request\\n    → Cloudflare (edge proxying, DDoS protection, TLS)\\n        → Load Balancer(auth, validation)\\n            → Traefik (ingress routing on DOKS)\\n                → Intelligent Inference API (routing, billing)\\n                    → Model Executor Service (provider translation)\\n                        → Model Backend (Ray + vLLM for open-source,\\n                           or provider API for OpenAI/Anthropic)\\n                            → Streaming Response → Client\\n                                → Kafka (billing events, telemetry)\\n\\nLoad Balancer\\nDistributes traffic across the inference cluster and serves as the policy enforcement point. Every request is validated against the Model Access Key or DigitalOcean personal access token, with the load balancer resolving tenant identity and confirming the caller is authorized to use the requested model; VPC-bound keys are enforced, rejecting requests originating outside the restricted network. Per-account and per-model rate limits are also enforced via a regional Redis cache to protect platform stability and help prevent single-tenant resource exhaustion. Finally, before reaching any backend, request validation occurs against the model’s contract from the Model Catalog—the centralized source of truth—designed to ensure that requests with unsupported parameters or incorrect endpoint shapes are rejected deterministically with a clear error, preventing the frustrating experience of cryptic provider errors deep in the stack.\\nIntelligent Inference API\\nThe customer-facing entry point and policy enforcement layer. Every request passes through this service, where it handles authentication (validating Model Access Keys), request validation against the model’s contract, rate limiting (via Redis), billing metering (token counts dispatched to Kafka), and SSE streaming. It also orchestrates the Inference Router and built-in tool execution. Deployed as a stateless service on DigitalOcean Kubernetes Service (DOKS) with autoscaling.\\nModel Executor Service — The Translation Layer\\nThis is one of the most important — and least visible — parts of the platform.\\nEvery model provider has quirks. Anthropic structures tool calls are different from OpenAI. Streaming event formats vary. Some providers support parameters that others silently ignore. Request schemas, error shapes, and response normalization all differ in subtle ways. If you’ve ever tried to build a multi-model application yourself, you’ve felt this pain — it’s a constant stream of provider-specific edge cases that break your code in production.\\nThe Model Executor Service helps solve this by sitting between the Intelligent Inference API and every model backend. It translates a standardized request envelope into whatever provider-native format the backend expects, executes the call, then normalizes the response back into a consistent shape. Provider-specific quirks — streaming differences, parameter mappings, error translations — are absorbed here to prevent leaks to your application.\\nThis is why every model on the platform works identically across the API endpoints (/chat/completions, /responses, and /messages (Anthropic Models Only)), regardless of who built the model or where it runs. You don’t need to know whether the model behind a request is hosted on our GPUs, served by OpenAI, or running on Anthropic’s infrastructure. The translation layer presents a single, consistent API contract for all of them.\\nModel Runtime: Ray + vLLM\\nFor DigitalOcean-hosted open-source models, the runtime layer is built on Ray (orchestration and scheduling across NVIDIA H100 GPU nodes) and vLLM (KV cache management, continuous batching, token generation). Ray multiplexes multiple models across shared GPU pools, so you pay for tokens consumed rather than GPU-hours reserved.\\nCommercial Model Routing\\nFor commercial models from OpenAI and Anthropic, the Model Executor Service handles the provider translation and forwards to the provider’s API. The response is normalized back through the same pipeline. This means you can call Claude Sonnet and DeepSeek V3.2 from the same application with the same key, the same endpoint, and get back identically structured responses.\\nBilling Pipeline\\nEvery request generates a usage event (model ID, token counts, metadata) written to regional Kafka. The billing pipeline consumes these asynchronously — it never sits in the critical request path, so billing latency doesn’t affect inference latency.\\nGetting Started: From Zero to Inference\\nIf you’ve used the OpenAI SDK before, you’ll be productive in minutes. Here’s a walkthrough from first API key to streaming responses.\\nStep 1: Get Your Model Access Key\\nIn the DigitalOcean Control Panel, click INFERENCE → Manage → Model Access Keys. Create a key and export it:\\n\\n\\nShell\\nexport MODEL_ACCESS_KEY=\\\"your-model-access-key-here\\\"\\n\\nStep 2: Chat Completions API\\nThe most common endpoint. Send a POST to /v1/chat/completions with a model ID, messages, temperature, and token limit:\\n\\n\\nShell\\ncurl -X POST https://inference.do-ai.run/v1/chat/completions \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"llama3.3-70b-instruct\\\",\\n    \\\"messages\\\": [{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"What is the capital of France?\\\"}],\\n    \\\"temperature\\\": 0.7,\\n    \\\"max_completion_tokens\\\": 256\\n  }'\\n\\nThe same request using the Python OpenAI SDK:\\n\\n\\nPython\\nfrom openai import OpenAI\\nfrom dotenv import load_dotenv\\nimport os\\n\\nload_dotenv()\\n\\nclient = OpenAI(\\n    base_url=\\\"https://inference.do-ai.run/v1/\\\",\\n    api_key=os.getenv(\\\"MODEL_ACCESS_KEY\\\"),\\n)\\n\\nresp = client.chat.completions.create(\\n    model=\\\"llama3.3-70b-instruct\\\",\\n    messages=[\\n        {\\\"role\\\": \\\"system\\\", \\\"content\\\": \\\"You are a helpful assistant.\\\"},\\n        {\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"Tell me a fun fact about octopuses.\\\"}\\n    ],\\n)\\nprint(resp.choices[0].message.content)\\n\\nTo switch models, change one parameter — “model”: “llama-4-maverick”. No SDK, endpoint, or auth changes needed.\\nStep 3: Responses API\\nFor newer integrations and multi-step tool use, use the Responses API at /v1/responses. It takes a single input field instead of a messages array:\\n\\n\\nShell\\ncurl -sS -X POST https://inference.do-ai.run/v1/responses \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"openai-gpt-oss-20b\\\",\\n    \\\"input\\\": \\\"What is the capital of France?\\\",\\n    \\\"max_output_tokens\\\": 50,\\n    \\\"temperature\\\": 0.7,\\n    \\\"stream\\\": false\\n  }'\\n\\nStep 4: Streaming\\nFor chatbots, code assistants, and interactive tools, set stream: true to receive tokens via Server-Sent Events as they’re generated. Using the same client from Step 2:\\n\\n\\nPython\\nstream = client.responses.create(\\n    model=\\\"openai-gpt-oss-20b\\\",\\n    input=\\\"What is the capital of France?\\\",\\n    max_output_tokens=50,\\n    temperature=0.7,\\n    stream = true\\n)\\nfor chunk in stream:\\n    if chunk.choices[0].delta.content is not None:\\n        print(chunk.choices[0].delta.content, end=\\\"\\\", flush=True)\\n\\nAll API Endpoints\\n\\n\\n\\nAPI\\nType\\nEndpoint\\nDescription\\n\\n\\n\\n\\nModels\\nSync\\n/v1/models\\nList available models\\n\\n\\nChat Comlpletions\\nSync\\n/v1/chat/completions\\nChat-style prompts (text + VLM)\\n\\n\\nResponses\\nSync\\n/v1/responses\\nMulti-step tool use, stateful interactions\\n\\n\\nMessags\\nSync\\n/v1/messages\\nAnthropic-compatible (Claude Code, agentic workflows)\\n\\n\\nImag Generaton\\nSync\\n/v1/images/generations\\nText-to-image, up to 1 megapixel\\n\\n\\nText-to-Speech\\nSync\\n/v1/audio/speech\\nText-to-speech (binary audio)\\n\\n\\nEmbeddings\\nSync\\n/v1/embeddings\\nDense vector representations for search/RAG\\n\\n\\nVideo\\nAsync\\n/v1/video/generations\\nText-to-video (MP4, 480p or 720p)\\n\\n\\nfl Models\\nAsync\\n/v1/async-invoke\\nImage/TTS via fal models\\n\\n\\n\\n\\nFor async video generation, results expire 2 hours after job completion.\\nPrompt Caching\\nThe sections above cover the basics of sending requests. Now let’s look at the platform features that can make production inference faster and cheaper — starting with prompt caching.\\nPrompt caching lets you cache context across requests so repeated prefixes are billed at a lower rate. This is critical for agentic workflows where 80–97% of input tokens repeat across sequential requests.\\nAnthropic Models\\nUse cache_control with type: ephemeral and a ttl of 5m or 1h:\\n\\n\\nShell\\ncurl -X POST https://inference.do-ai.run/v1/chat/completions \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"anthropic-claude-4.6-sonnet\\\",\\n    \\\"messages\\\": [{\\n      \\\"role\\\": \\\"developer\\\",\\n      \\\"content\\\": [{\\n        \\\"type\\\": \\\"text\\\",\\n        \\\"text\\\": \\\"You are a helpful coding assistant with extensive knowledge of Python and cloud infrastructure.\\\",\\n        \\\"cache_control\\\": {\\\"type\\\": \\\"ephemeral\\\", \\\"ttl\\\": \\\"1h\\\"}\\n      }]\\n    },\\n    {\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"Write a Python function to validate email addresses.\\\"}],\\n    \\\"max_completion_tokens\\\": 1024\\n  }'\\n\\nThe response shows cache_created_input_tokens on the first call; subsequent calls show cache_read_input_tokens at reduced cost.\\nOpenAI Models\\nFor prompts with 1,024+ tokens, use prompt_cache_retention set to in_memory or 24h:\\n\\n\\nJSON\\n{\\n  \\\"model\\\": \\\"gpt-4o-mini\\\",\\n  \\\"prompt_cache_retention\\\": \\\"24h\\\",\\n  \\\"messages\\\": [...],\\n  \\\"temperature\\\": 0.2\\n}\\n\\nOpen-Source Models\\nPrompt caching for DigitalOcean-hosted open-source models is not currently supported. It is available only for Anthropic and OpenAI models at this time. Open-source model caching is on the roadmap as a high-priority investment.\\nReasoning\\nFor models that support it, you can enable step-by-step thinking traces — useful for math, logic, coding, and complex analytical tasks.\\nAnthropic format — use a reasoning object with effort and optional max_tokens:\\n\\n\\nShell\\ncurl -X POST https://inference.do-ai.run/v1/chat/completions \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"anthropic-claude-opus-4.5\\\",\\n    \\\"messages\\\": [{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"What is 27 * 453? Think step by step.\\\"}],\\n    \\\"max_completion_tokens\\\": 1192,\\n    \\\"reasoning\\\": {\\\"effort\\\": \\\"high\\\", \\\"max_tokens\\\": 1024}\\n  }'\\n\\nThe response includes reasoning_content (the thinking trace) alongside content (the final answer). If you omit max_tokens, the reasoning budget defaults to a percentage of max_completion_tokens: 20% for low, 50% for medium, 80% for high, 95% for max.\\nOpenAI format — use reasoning_effort directly (none, low, medium, high, max):\\n\\n\\nJSON\\ncurl -X POST https://inference.do-ai.run/v1/chat/completions \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"anthropic-claude-4.6-sonnet\\\",\\n    \\\"messages\\\": [\\n      {\\n        \\\"role\\\": \\\"user\\\",\\n        \\\"content\\\": \\\"What is 27 * 453? Think step by step.\\\"\\n      }\\n    ],\\n    \\\"max_completion_tokens\\\": 8192,\\n    \\\"reasoning_effort\\\": \\\"high\\\"\\n  }'\\n\\nMultimodal Inference\\nServerless Inference isn’t text-only. We support vision-language models, image generation, video generation, text-to-speech, and vector embeddings — all through the same API key and base URL.\\nVision-Language Models\\nVLMs accept text + image inputs (PNG, JPG, JPEG, WEBP as base64 or HTTPS URLs) and return text:\\n\\n\\nShell\\ncurl https://inference.do-ai.run/v1/chat/completions \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"nemotron-nano-12b-v2-vl\\\",\\n    \\\"messages\\\": [{\\n      \\\"role\\\": \\\"user\\\",\\n      \\\"content\\\": [\\n        {\\\"type\\\": \\\"text\\\", \\\"text\\\": \\\"What is shown in this image?\\\"},\\n        {\\\"type\\\": \\\"image_url\\\", \\\"image_url\\\": {\\\"url\\\": \\\"https://example.com/sample.jpg\\\"}}\\n      ]\\n    }],\\n    \\\"max_tokens\\\": 512\\n  }'\\n\\nImage Generation\\nGenerate images up to 1 megapixel. Always specify n and size:\\n\\n\\nShell\\ncurl https://inference.do-ai.run/v1/images/generations \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"stable-diffusion-3.5-large\\\",\\n    \\\"prompt\\\": \\\"A sunset over mountains\\\",\\n    \\\"n\\\": 1,\\n    \\\"size\\\": \\\"1024x1024\\\",\\n    \\\"quality\\\": \\\"auto\\\",\\n    \\\"response_format\\\": \\\"b64_json\\\",\\n    \\\"background\\\": \\\"auto\\\",\\n    \\\"output_format\\\": \\\"png\\\"\\n  }'\\n\\nText-to-Video (Asynchronous)\\nSubmit a job, poll for status, download MP4 when complete. Output is 480p (9 seconds) or 720p (5 seconds). Videos expire 2 hours after completion:\\n\\n\\nShell\\ncurl -X POST https://inference.do-ai.run/v1/video/generations \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"wan2.2-t2v-a14b\\\",\\n    \\\"prompt\\\": \\\"A drone shot flying over a lush green valley at golden hour\\\",\\n    \\\"size\\\": \\\"1280x720\\\",\\n    \\\"fps\\\": 16\\n  }'\\n\\nThe request returns a job ID and job status:\\n\\n{\\n  \\\"id\\\": \\\"job_abc123\\\",\\n  \\\"status\\\": \\\"processing\\\"\\n}\\n\\nNext, poll the result using the job ID:\\n\\ncurl https://inference.do-ai.run/v1/video/generations/job_abc123 \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\"\\n\\nYou can see the following when the job completes\\n\\n{\\n  \\\"created_at\\\": 1777003604,\\n  \\\"error\\\": null,\\n  \\\"id\\\": \\\"video_abc\\\",\\n  \\\"model\\\": \\\"wan2.2-t2v-a14b\\\",\\n  \\\"object\\\": \\\"video\\\",\\n  \\\"output\\\": null,\\n  \\\"status\\\": \\\"completed\\\",\\n  \\\"x_request_id\\\": null\\n}\\n\\nText-to-Speech\\n\\n\\nShell\\ncurl -sS https://inference.do-ai.run/v1/audio/speech \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"qwen3-tts-voicedesign\\\",\\n    \\\"input\\\": \\\"Welcome to DigitalOcean.\\\",\\n    \\\"voice\\\": \\\"alloy\\\",\\n    \\\"response_format\\\": \\\"mp3\\\",\\n    \\\"instructions\\\": \\\"Speak naturally.\\\"\\n  }' -o speech.mp3\\n\\nDigitalOcean is one of the few platforms where all three are available serverlessly through the same API key and endpoint.\\nBuilt-in Tools\\nBuilt-in tools are server-side integrations that extend what models can do during inference — without you managing tool orchestration. Add tool definitions to your API request, and the platform handles discovery, execution, and response integration automatically. They work with both the Chat Completions, Responses and Messages APIs.\\nKnowledge Base Retrieval (RAG)\\nLet the model query your private data sources during inference:\\n\\n\\nShell\\ncurl -X POST https://inference.do-ai.run/v1/chat/completions \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"openai-gpt-4o\\\",\\n    \\\"messages\\\": [{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"What features does DigitalOcean Inference offer?\\\"}],\\n    \\\"tools\\\": [{\\\"type\\\": \\\"knowledge_base_retrieval\\\", \\\"knowledge_base_id\\\": \\\"<your-kb-id>\\\"}],\\n    \\\"max_tokens\\\": 1024\\n  }'\\n\\nModel Context Protocol (MCP)\\nConnect to remote MCP servers — authenticated or unauthenticated — for live data access:\\n\\n\\nShell\\ncurl -X POST https://inference.do-ai.run/v1/chat/completions \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"openai-gpt-4o\\\",\\n    \\\"messages\\\": [{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"Fetch my DigitalOcean account info.\\\"}],\\n    \\\"tools\\\": [{\\n      \\\"type\\\": \\\"mcp\\\",\\n      \\\"server_label\\\": \\\"digitalocean\\\",\\n      \\\"server_url\\\": \\\"https://accounts.mcp.digitalocean.com/mcp\\\",\\n      \\\"authorization\\\": \\\"Bearer $DIGITALOCEAN_API_TOKEN\\\",\\n      \\\"allowed_tools\\\": [\\\"account-get-information\\\"]\\n    }],\\n    \\\"tool_choice\\\": \\\"required\\\",\\n    \\\"max_tokens\\\": 512\\n  }'\\n\\nWeb Search\\nGive models access to real-time web content:\\n\\n\\nShell\\ncurl -X POST https://inference.do-ai.run/v1/responses \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"openai-gpt-4o\\\",\\n    \\\"input\\\": \\\"What are the latest DigitalOcean Droplet pricing changes?\\\",\\n    \\\"tools\\\": [{\\\"type\\\": \\\"web_search\\\", \\\"max_uses\\\": 3, \\\"max_results\\\": 5}],\\n    \\\"max_output_tokens\\\": 1024\\n  }'\\n\\nAgentic Workflows (Claude Code)\\nWe offer full Anthropic tool-use compatibility through /v1/messages. Set ANTHROPIC_BASE_URL to https://inference.do-ai.run/v1/messages to run Claude Code and other agentic workflows on DigitalOcean:\\n\\n\\nShell\\ncurl https://inference.do-ai.run/v1/messages \\\\\\n  -H \\\"x-api-key: $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"anthropic-version: 2023-06-01\\\" \\\\\\n  -H \\\"content-type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"anthropic-claude-4.6-sonnet\\\",\\n    \\\"max_tokens\\\": 4096,\\n    \\\"tools\\\": [{\\n      \\\"name\\\": \\\"read_file\\\",\\n      \\\"description\\\": \\\"Read a file from the local filesystem.\\\",\\n      \\\"input_schema\\\": {\\\"type\\\": \\\"object\\\", \\\"properties\\\": {\\\"path\\\": {\\\"type\\\": \\\"string\\\"}}, \\\"required\\\": [\\\"path\\\"]}\\n    }],\\n    \\\"messages\\\": [{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"Refactor the authentication logic in src/auth.ts.\\\"}]\\n  }'\\n\\nPricing\\n(current as of May 2026)\\nKnowledge base retrieval and MCP incur no additional charges beyond standard per-token inference costs. Web search is $10 per 1,000 requests.\\nInference Router\\nWe mentioned the Inference Router earlier as a key differentiator. Here’s how it works in practice.\\nThe Inference Router classifies each incoming request against your configured tasks, then selects the best model from a pool. Each task has up to 3 models and a selection policy: Cost Efficiency (cheapest by token cost), Speed Optimization (fastest by TTFT), Manual Ranking (your specified order), or Optimal (DigitalOcean’s benchmarking, for pre-configured tasks).\\nUsing it is a one-line change — prefix the router name with router: in the model field:\\n\\n\\nShell\\ncurl -X POST https://inference.do-ai.run/v1/chat/completions \\\\\\n  -H \\\"Authorization: Bearer $MODEL_ACCESS_KEY\\\" \\\\\\n  -H \\\"Content-Type: application/json\\\" \\\\\\n  -d '{\\n    \\\"model\\\": \\\"router:my-support-router\\\",\\n    \\\"messages\\\": [{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"What are your support hours?\\\"}],\\n    \\\"stream\\\": true\\n  }'\\n\\nThe response’s model field tells you which model was selected, and the x-model-router-selected-route header shows which task matched. If no task matches, fallback models handle the request. If a model is unavailable, the router fails over automatically.\\nFor a complete walkthrough of building a real support bot with the Inference Router, see Inference Routing: Matching Models to Tasks, Not Just Requests.\\nProduction Operations\\nThe production lifecycle is anchored by the seamless handoff between Serverless Inference (SI) and Dedicated Inference (DI), where the SI layer orchestrates calls to vLLM endpoints that serve as the high-performance Inference Engine.\\nTo maintain stability during high-concurrency spikes, vLLM’s internal scheduler manages request queuing and batching, ensuring compute resources are saturated without being overwhelmed. This infrastructure is underpinned by a robust reliability framework.\\nHigh Service Level Objectives (SLOs) are enforced via auto-pod healing - capability that automatically detects and recovers from node failures to ensure the system remains resilient and available at scale.\\nWith the APIs and platform features covered, here’s how the system behaves in production.\\nObservability\\nEvery request generates telemetry automatically — no instrumentation required. View metrics in the Control Panel under INFERENCE → Serverless Inference → Analyze:\\n\\n\\n\\nCategory\\nMetics\\n\\n\\n\\n\\nRelabilty\\nError rates (4xx/5xx), success rates, RPM\\n\\n\\nLatency\\nTime to first token (TTFT), end-to-end\\n\\n\\nCost\\nPer-invocation, per-model spend\\n\\n\\nUsage\\nToken consumption by model\\n\\n\\nRat Limiting\\nThrottled request/token counts\\n\\n\\nMultimodal\\nImage count, audio duration, cost by modality\\n\\n\\n\\n\\nFailure Recovery\\nModel failures: Ray reschedules on a healthy GPU node; clients should retry.\\nRate limits: HTTP 429 — implement exponential backoff.\\nProvider failures: Standardized error responses for commercial model outages.\\nStream drops: Partial response delivered; retry the full request.\\nRouter failover: Automatic reroute to fallback models when primary is unavailable.\\nContent Safety\\nInput guardrails block policy-violating prompts before inference begins. Output guardrails withhold violating content after generation.\\nEconomics\\nPay-Per-Token\\nA dedicated GPU costs the same at 5% utilization or 95%. Serverless Inference pools GPU capacity across all customers — you pay only for tokens consumed, not GPU-hours reserved.\\nPricing (per 1M tokens)\\n\\n\\n\\nModel\\nDigitalOcean (Input/Output)\\nCompetitor Range\\n\\n\\n\\n\\nLlama 4 Maverick\\n$0.250 / $0.800\\nBedrock $0.24/$0.97\\n\\n\\nDeepSeek V3.2\\n$0.300 / $1.000\\nBedrock $0.62/$1.85\\n\\n\\nQwen3.5 397B\\n$0.550 / $3.500\\nTogether AI $0.60/$3.60\\n\\n\\nSD 3.5 Large (image)\\n$0.065/image\\nBedrock $0.08\\n\\n\\nWan 2.2 T2V (720p video)\\n$0.31/video\\nTogetherAI $0.66\\n\\n\\nQwen3 TTS\\n$0.020/1K tokens\\nReplicate: $0.02/1K tokens  ElevenLabs Multilingual V2 for $0.1/1K tokens\\n\\n\\n\\n\\n*Pricing reflects publicly listed rates as of May 29 2026. Competitor pricing is subject to change.\\n*Off-peak discounts of 5–10% apply during 10 PM – 4 AM PT on eligible open-source models.\\nSecurity and Data Privacy\\nZero data retention: Synchronous outputs are never stored. Async video outputs expire after 2 hours. Inputs are never stored for training or logging.\\nVPC support with model access keys restrictable to VPC networks.\\nKey scoping to specific models and batch inference.\\nTiered billing: Spend caps by tier. Tier 1, Tier 2 (new users) gets open-source models only; Tier 3+ unlocks commercial models from OpenAI and Anthropic.\\nWhat We Learned\\nBuilding this platform taught us things that no architecture diagram captures:\\nServerless Inference traffic is unpredictable. A single agentic integration can 10x volume overnight. A 200K-token request consumes more GPU memory than 1,000 short ones. Traditional autoscaling heuristics don’t apply.\\nGPU utilization beats raw hardware. 10 nodes at 80% utilization outperform 20 at 30%. Scheduling intelligence (Ray) and serving optimization at Inference engine (like vLLM) yield more effective capacity than adding hardware.\\nCold starts are infrastructure problems. Dominated by weight pre-staging, memory availability, and scheduling speed — not just model loading time.\\nStreaming shapes everything. It affects load balancer timeouts, proxy design, billing metering, and error handling. It’s not a feature — it’s an architectural constraint.\\nDevelopers want reliability, not infrastructure. The most consistent feedback we hear: make the API work, give clear errors, make billing transparent. The best inference infrastructure is the kind you never think about.\\nWhat’s Next\\nOpen-source prompt caching — caching is live for Anthropic/OpenAI today; we plan to extend caching to open-source models.\\nExpanded model catalog — we plan to extend caching models to include next-gen DeepSeek, Qwen, Llama families, plus speech-to-text.\\nMulti-region — we plan to add EU regions  to support General Data Protection Regulation (GDPR)-compliant workloads.\\nEnhanced observability — we plan to add cache hit rates, latency histograms, SI logs.\\nRouter enhancements — we’re exploring router support for popular coding agents.\\nThe above reflects our current plans and product direction, and is subject to change without notice. It is provided for informational purposes only and is not a commitment to deliver any material, feature, or functionality.\\nGet Started\\nCreate an account at cloud.digitalocean.com\\nCreate a Model Access Key under INFERENCE → Manage → Model Access Keys\\nMake your first request using the examples above\\nExplore the Model Playground to test models interactively\"}"},{"ref":"E56","kind":"event","title":"Open by Design: How NVIDIA and DigitalOcean Are Building the Stack for the Always-On Agentic Era","date":"2026-06-02T18:29:57.287+00:00","date_source":"rss.item_date","source_url":"https://www.digitalocean.com/blog/open-by-design-tech","signal_url":"https://onlylabs.fyi/signals/3183ed38-b620-40aa-a6e2-b4f7ae2bb291","signal_json_url":"https://onlylabs.fyi/signals/3183ed38-b620-40aa-a6e2-b4f7ae2bb291/signal.json","text":"post_published · Open by Design: How NVIDIA and DigitalOcean Are Building the Stack for the Always-On Agentic Era · signal_desk=talking · occurred_at=2026-06-02T18:29:57.287+00:00 · url=https://www.digitalocean.com/blog/open-by-design-tech · raw={\"excerpt\":\"The growth of generative AI isn’t driven solely by AI companies with proprietary models. Open-source AI is reshaping the developer ecosystem, fueled by a growing community of builders. But what does it take to go from open models to production-ready agentic AI, and what do developers need to know to get there?\\nThis question was the focus of the DigitalOcean Deploy session, “Open by Design: How NVIDIA and DigitalOcean Are Building the Stack for the Always-On Agentic Era.” During this 30-minute chat, Kari Briski, VP Gen AI at NVIDIA, and Salman Paracha, SVP AI at DigitalOcean, discuss why AI-native teams are demanding openness, model flexibility, and infrastructure built for agents that never sleep—and what NVIDIA and DigitalOcean are doing to build support for this next generation of AI development.\\nWatch the full recorded session from Deploy 2026:\\nView YouTube video\\n\\n\\n\\n\\nOpen-Source Models Need Commitment, Not Just a Launch\\nThere are many open models in the ecosystem, but having great models doesn’t guarantee they will be consistently improved or regularly updated. NVIDIA noticed a potential gap in this space for its enterprise customers, who regularly wanted access to open-source models that are launched and then left untouched.\\nThis spurred the development of open models such as NVIDIA Nemotron. Released in March 2026, it serves as a family of multi-modal models designed for agentic AI. Having access to these open models enables developers to create agentic applications that require advanced reasoning, high compute efficiency, and open source standards. With Nemotron models and NVIDIA software libraries, developers can evolve their projects over time and receive regular updates and expanded support.\\nRunning open-weight LLMs locally gives you more control over performance, privacy, and customization. This NVIDIA Nemotron 3 tutorial walks through deploying NVIDIA’s Nemotron 3 Nano on a DigitalOcean GPU Droplet, helping you experiment with efficient open models on dedicated GPU infrastructure without relying entirely on hosted AI APIs.\\n“We’ve been building these models for ourselves because we want to build great systems,” Briski says. “We’re treating [these models] like a library and are committed just like we are with our GPUs and [CUDA] libraries and our stack that we’ll improve upon.”\\nBeyond the models themselves, there’s also a proliferation of harnesses—the orchestration frameworks that wrap around models to manage agent lifecycle, memory, tool calling, and scaling—which are just as important for building agentic systems.\\nAgents Are Only as Good as Their Evaluations\\nParacha highlighted that most developers building AI-native applications are still facing a high hurdle and admission rate in determining whether it’s possible to build something as durable as OpenClaw or Claude Code.\\nFiguring out true evaluations and observability becomes a challenge, and these developers are left wondering whether they can truly compete with AI companies that have funding for research and top-of-the-line hardware. So what does lowering that barrier to entry (and creating developer confidence) look like?\\nEvaluation is where it starts, according to Briski. While there are many test cases and verifications for specific use cases (such as coding), other applications lack readily available benchmarks, and academic options don’t necessarily effectively evaluate real-world models or optimize performance.\\nWithout these standards, it becomes harder for developers to gauge the viability of their idea. For broader development, more test cases need to be created and data pulled from, which requires human knowledge and labeling. For industries like electronic automation, NVIDIA is currently working with Synopsys and Cadence to develop these test cases and benchmarks to encourage development and agent creation.\\nSub-Agents Only Scale When You Can Trace Them\\nDevelopers running AI-native applications have adopted sub-agent workflows that break a problem into subtasks and delegate them to a single agent. Paracha has seen developers do this, but is curious about how this subset of AI development might shape up over the next few years, and what engineering principles still apply.\\nIf you’re curious about what a sub-agent (or multi-agent) system can do, read about the TradingAgents LLM, which is designed to function as a simulation for financial trading through specialized agents.\\n“There’s a thread in engineering right now where you still have to understand how the system was built, even though the agent is writing the code. So when sub-agents are going off, you are able to test them, you are able to verify, and break it down to where something might be going wrong, so that you, as the architect, can understand the system,” Briski explains.\\nThis philosophy also pairs well with adding traceability throughout the system, so you can have references during troubleshooting instead of just the end product to look at, leaving you with a black box. While there is a newer approach of feeding a system a whole bunch of information and having it develop an answer, having the “divide and conquer” approach still seems to be the standard.\\nGood Tokenomics Starts With Outcomes, Not Token Count\\nScaling AI comes with a new problem: token usage. How can developers run AI systems that are consistently generating tokens and simultaneously build an effective business around them?  What it really comes down to is the product’s value; the items delivered and the workflow efficiencies created.\\n“We’re in a stage right now where tokens are going to be counted differently as model architectures change. [But] we have to evolve our way of thinking because the way we count tokens generated with diffusion models and the latent spaces of tokens could all change. So I think instead of spinning out on how many tokens are being generated, it’s more about the value,” Briski explains.\\nBut organizations do need to consider cost, especially with the larger models. NVIDIA is taking technical measures to improve the efficiency of token use. This includes using a hybrid-state-space transformer in the latest Nemotron Model, rather than combining a dense model with mixture-of-experts (MoEs).\\nModel architectures are fundamental to token economics, and there’s been a general shift: from a very model-dense view of the world (megamodels with 8 billion or a trillion parameters) to a sparser proliferation of MoEs and the use of solid-state models (SSMs) backed by NVIDIA. These SSMs, Briski says, remove some of the attention layers for pre-processing and reduce the compute you need for the data prefill.\\nOpen Source Forces the Whole AI Ecosystem to Move Faster\\nBeyond using SSMs, NVIDIA’s applied research team is consistently reviewing academic papers, testing new models, exploring new architectures, and collaborating with the open-source community.\\n“We actually put out a paper about the hybrid Mamba architecture in [early] 2025. What was interesting was that the Qwen model adopted it before our Nemotron product did. The point of how important open source is to share these ideas and learn from each other. We’re not just putting [our tech] out there. We’re also picking up ideas from other open source projects,” Briksi says.\\nAt DigitalOcean, one of Paracha’s focuses is expanding the ecosystem around open-source projects. There’s Plano (DigitalOcean’s data-plane technology), along with a push for research on small action models (SAMs). These models can complete tasks using context compression (instead of requiring reasoning tokens) to perform specific tasks more efficiently without requiring long context windows. Paracha’s team is also looking into AI system harnesses and how DigitalOcean can use open source to empower developers with the freedom to choose the harness they want to run their models.\\n“The open harness is a zero-instrumentation plug-in architecture where you can bring in open code or a LangChain or LangGraph type of agent, and we help you manage and scale it. There are a lot of lifecycle events of an agent that still have to be solved for. With the open harness story, the real mantra is, ‘how do we enable choice and freedom and support the ecosystem versus create our own?’” Paracha asks.\\nAgent Workloads Are Always Changing Shape\\nGoing forward with these multi-agent and multi-layered systems, Paracha says, there will be a lot of work to be done on context compression and expansion.\\nBriksi expands on this idea, saying inference workloads are dynamically changing and that there’s been a shift from long context input to long context output and initial reasoning and long context output; all the in-between steps are still evolving.\\n“Everything in the [Deploy] keynote is heading to ‘how can you optimize for these dynamically changing workloads with routing, keeping the cache and context right, even with compression for really long horizon tasks?’” Briski says.\\nGoing forward, developers will need to become familiar with long-horizon and long-running tasks, as well as self-evolving systems, which are related but ultimately distinct. Knowing these tasks are beneficial to how developers manage their memory, compute power, and model architectures.\\nEvery Legacy Application Is an Agent Opportunity\\nAs the market moves from generative AI to open source AI, organizations and individual developers alike are looking at what might change over time and what won’t when it comes to how we think about and build AI-native applications.\\nBriski says that the need for compute won’t go away. It’s been proven across many scaling laws in pre-training, post-training, interface training, and agents that more computing power means greater intelligence capabilities.\\nWhat she’s most excited to see is more domains beyond coding pick up and create verifiers and reinforcement learning environments to support a wider range of AI-native and agent-based applications across different industries.\\n“There are so many things in our lives where I can’t wait until agents are infused into these applications. And so that’s why when you think about when agentic AI will be integrated into all of these legacy software applications, I get really excited,” she says.\\nDigitalOcean and NVIDIA are building together. DigitalOcean’s serverless inference runs on NVIDIA accelerated computing including NVIDIA Blackwell GPUs, NVIDIA Nemotron models are available directly on DigitalOcean’s AI Platform, and builders can prototype on build.nvidia.com before deploying to DigitalOcean GPU Droplets without rebuilding their stack.\\nWith NVIDIA Dynamo 1.0 integrated for production inference scaling and the joint NemoClaw project bringing secure, always-on agent deployment, the collaboration gives developers a direct path from experimentation to production.\\n→ Get started with DigitalOcean’s AI-Native Cloud\"}"},{"ref":"E57","kind":"event","title":"The Inference Tax: How Prefix-Aware Routing Eliminates the Hidden Cost of LLMs at Scale","date":"2026-06-02T00:00:00.000Z","date_source":"page.visible_date","source_url":"https://www.digitalocean.com/blog/reduce-llm-inference-costs-prefix-caching","signal_url":"https://onlylabs.fyi/signals/d88fd1d6-c8ff-4326-8a86-94bf45b96f3c","signal_json_url":"https://onlylabs.fyi/signals/d88fd1d6-c8ff-4326-8a86-94bf45b96f3c/signal.json","text":"post_published · The Inference Tax: How Prefix-Aware Routing Eliminates the Hidden Cost of LLMs at Scale · signal_desk=talking · occurred_at=2026-06-02T00:00:00.000Z · url=https://www.digitalocean.com/blog/reduce-llm-inference-costs-prefix-caching · raw={\"excerpt\":\"Introduction\\nInference demand is growing fast, and it’s only accelerating. By 2030, inference is expected to account for the majority of AI compute globally. But scaling inference isn’t just a hardware problem. Most teams discover too late that a significant portion of their compute spend is avoidable, primarily because their systems are silently repeating work they have already done, recomputing the same prompt prefixes and system instructions over and over again.\\nWe’ve seen this from two vantage points. From the infrastructure layer, the cost curve becomes visible at scale with clusters that look busy but aren’t efficiently utilized. From the engine layer, the picture is just as clear. Without the right caching and scheduling primitives, even a well-optimized model wastes cycles on redundant computation. The root cause is the same regardless of where you’re standing. The system lacks the memory and coordination to recognize when it’s already done the hard part.\\nFixing this requires work at every layer of the stack. DigitalOcean has invested in GPU optimization across multiple fronts, from vLLM parallelism and quantization tuning to hardware-level kernel work. But one technique has had an outsized impact on cost efficiency at scale: prefix-aware routing and caching. In this post, we walk through how vLLM enables advanced prefix caching, how DigitalOcean’s inference gateway uses prefix awareness to make smarter routing decisions, and how we plan to make this available to everyone on Serverless Inference in the coming weeks.\\nThe Cost Cliff and the Hidden Culprit\\nInference now accounts for roughly 70% of total AI compute costs. For most teams, a significant share of that is avoidable. It’s not due to hardware limits. Instead, it’s because the system keeps recomputing work it has already done, also known as redundant prefill.\\nEvery LLM inference request has two distinct computational phases. The first phase is prefill, where the model processes the entire input sequence and builds the KV (key-value) cache that represents its state. The second phase is decode, where the model generates output tokens one at a time, attending back to that cached state. Prefill is where the structural inefficiency hides. Its computation scales quadratically with input length: attention computation quadruples with doubling of input length.\\nConsider a real-world customer support workload running on NVIDIA H200 or AMD Instinct™ MI325X GPUs. A typical deployment carries a 2,000-token system prompt (defining persona, policies, response format) that is identical across every request. With an average user message of 200 tokens, roughly 91% of every input is shared context.\\nOn AMD Instinct™ MI325X GPUs or NVIDIA H200, prefilling 2,000 tokens takes approximately 45–50ms and costs in the range of 100-120 GFLOPs per request. At 10,000 requests per hour, that’s over 1 trillion redundant FLOPs per hour. Compute spent reconstructing the state the system has already built, discarded, and is now rebuilding from scratch.\\nThe pattern is even more pronounced in coding assistants or document Q&A tools, where the same API documentation or reference material is prepended to nearly every request. A 5,000-token shared context costs roughly 600 GFLOPs to prefill, which is nearly 25× more than a 1,000-token prefix, due to that quadratic relationship. When hundreds of users are querying the same underlying documents, the redundant computation compounds rapidly.\\nThis is precisely the redundant “prefill tax” that we will focus on how to eliminate in the rest of this post.\\nHow Prefix Caching Works at the Engine Layer\\nThe redundant prefill problem has a clean structural solution, but landing it at production scale takes several mechanisms working in concert. Here’s what’s happening inside the engine when a cache hit lands.\\nBlock-Based KV Storage\\nDuring prefill, every input token produces a key and value tensor at every attention layer, and storing these per-token would be a memory-management nightmare. The engine instead groups them into fixed-size blocks (16 tokens by default on CUDA, though configurable) allocated out of a pre-reserved GPU memory pool sized at engine startup. Each layer maintains its own pool of blocks. A single block holds the K and V tensors for block_size tokens for one layer’s KV heads, laid out so PagedAttention kernels can read them with coalesced memory accesses. A 2,000-token system prompt occupies 125 block positions (allocated per layer under the hood); once those blocks are sitting in the pool, any future request that begins with the same 2,000 tokens can point at them rather than recomputing. PagedAttention is the kernel technique that operates on this block-based layout, and the same memory machinery underlies both prefix caching and paged attention’s batching benefits, described in more detail in the engine anatomy writeup.\\nPrefix Hashing and Cache Lookup\\nRecognizing that two requests share a prefix is a string-matching problem on potentially very long inputs, and doing it naively would defeat the point. The engine instead hashes prefixes block by block, with each block’s hash depending on its own tokens, the hash of the previous block, and any extra inputs that affect the computation, including LoRA adapter IDs, multimodal image hashes, and optional cache salts for multi-tenant isolation. Identical prefixes under identical conditions produce identical hash chains, and the first divergent block is also the first point where the hashes disagree. Only full blocks are hashed and cached, so a partial trailing block at the end of a prefix doesn’t get reused and is recomputed along with the rest of the suffix. These hashes live in a lookup table mapping hash to cached block, and finding “the longest prefix of this request that’s already cached” reduces to walking the request’s block hashes against the table. The lookup is small and cheap, and the KV data itself lives in the GPU memory pool. Memory pressure comes from the pool, not the index.\\nFrom Cache Miss to Cache Hit: The FLOP Savings\\nThe payoff shows up in compute terms. On a full cache miss, the engine runs prefill across the entire input, processing every token across every layer at full quadratic attention cost. On a full hit on the prefix, the engine skips nearly all of that work: the KV state for the prefix is already materialized in GPU memory, and only the trailing token needs to run through prefill so the first generated token has somewhere to attend from. Partial hits land in the middle, with prefill running only over the un-cached suffix and cached blocks treated as pre-computed context. On the customer-support workload from Section 1, a partial hit covering the 2,000-token system prompt turns a 45–50ms prefill into something dominated by the much shorter user message, and the structural FLOP savings show up directly as time to first token (TTFT) improvement.\\nA single engine instance can only cache what it has personally seen, which is the routing problem Section 3 picks up. The engine publishes KV cache events (block stored, block removed, with their associated hashes) over ZeroMQ on a PUB/SUB channel, while utilization metrics like batch size and free-block count flow through the StatLogger path. A router consumes both to make decisions. The interface is deliberately neutral: any compatible consumer can subscribe, whether that’s NVIDIA Dynamo, llm-d, or a custom gateway built in-house. Session-affinity routing handles the easy case of sending a user back to the instance that served their previous turn, but the event stream enables much more. A router can build its own global prefix tree from KV block events, balance load against per-instance batch size and cache utilization, and make routing decisions that account for cache locality and instance pressure rather than treating them as separate problems.\\nHardware Headroom: AMD and NVIDIA\\nThese mechanism benefits compound on the AMD Instinct™ MI325X GPUs. 192GB of HBM3 per GPU means the KV pool can hold substantially more cached blocks than on comparable hardware, resulting in more cached prefixes, higher hit rates, longer-lived cache entries before eviction. Layered on top, FP8 KV cache quantization roughly doubles effective cache capacity again (though combining FP8 KV cache with prefix caching has historically required specific kernel support, so it’s worth checking compatibility for your vLLM version), and the attention kernels on the read path have been tuned for AMD Instinct™ MI325X GPUs memory hierarchy so a cache hit doesn’t trade prefill cost for a slow cache read. The mechanism works universally, but on AMD Instinct™ MI325X GPUs it has more room to operate, which is what makes the routing layer in the next section worth building.\\nThe picture is similar to NVIDIA Hopper, with different shapes of headroom. NVIDIA H200’s 141GB of HBM3e per GPU expands the KV pool considerably over H100’s 80GB, which translates directly into more cached prefixes and longer-lived entries before eviction. FP8 KV cache lands on Hopper through FlashAttention 3 and FlashInfer kernels. The same caveat about checking prefix-caching compatibility for your vLLM version applies, and the read-path attention kernels have been tuned around TMA loads and the Hopper memory hierarchy, so a cache hit doesn’t trade prefill cost for a slow KV read. Blackwell stretches this further, with 192GB of HBM3e per B200, and on GB200 NVL72 the NVLink domain collapses 72 GPUs into a single shared-memory fabric, opening up cross-instance KV reuse that single-node caching can’t touch. The underlying mechanism is the same across vendors, and what changes is how much room it has to operate, which is exactly what the routing layer in the next section is built to exploit.\\nThe Routing Problem: Why Single-Instance Caching Isn’t Enough\\nOnce the KV state for a shared prefix is computed, subsequent requests that share that prefix can reuse the cached blocks directly, bypassing prefill entirely. But production workloads don’t run on a single instance. They run across fleets of GPU workers behind a load balancer, and this is where naive deployments silently destroy the cache hit rate they worked to build. The DigitalOcean Inference Gateway (which is a fork of llm-d) embeds an Endpoint Picker (EPP), a component that intercepts every inference request via Envoy’s external processing (ext_proc) callback before it reaches any vLLM instance. The EPP is where all routing intelligence lives.\\nThe Write Path: Publishing KV Cache Events\\nOn the write path, each vLLM instance is configured with --kv-events-config to publish KVEvent messages over a ZMQ socket (tcp://*:5557) every time a KV cache block is allocated or evicted. Each event carries the block hash - computed using sha256_cbor_64bit over the token IDs in that block, using the same algorithm vLLM uses internally. The EPP subscribes to all instances, consuming this high-throughput stream and continuously updating a KV-Block Index: a low-level map of block_hash → {pod, memory_medium (GPU/CPU)}. memory_medium is the storage tier the block currently lives in on that pod. In our current implementation, KV blocks are always in GPU memory, but this will soon change as we look into multi-tiered storage architecture for KV blocks. From this index, the indexer builds and maintains a per-pod prefix tree - a radix structure of consecutive block hashes that reflects exactly what prefix state is warm in each pod’s GPU memory at this moment.\\n\\nThe Read Path: Prefix-Aware Request Scoring\\nOn the read path, when a new request arrives, the gateway tokenizes the incoming prompt and computes its prefix block hashes using the identical sha256_cbor_64bit algorithm. It then walks the KV-Block Index, querying how many consecutive prefix blocks each pod holds for this request. The result is a cache affinity score per pod: a pod holding 90% of the prompt’s prefix blocks scores 0.9 × 3 = 2.7 on the prefix-cache-scorer (weight 3, the dominant signal). This is combined with a kv-cache-utilization-scorer (weight 2) that down-scores pods whose GPU vRAM is near capacity, preventing the router from routing to a pod that would have to evict blocks to accommodate the new request, negating any cache benefit. The max-score-picker selects the highest combined score, and Envoy forwards the request to that pod. As we get into multi-tiered KV cache, we are also looking at “tier-aware” prefix scoring where a GPU resident match scores higher than a CPU-resident or lower tiers. In general, there are multiple cost functions with varying priorities taken into account while making the routing decision.\\nThe result: cache hit rates flip from ~25% under round-robin to 75%+ on workloads with shared prefixes - on the same hardware, with no model changes.\\n\\nThe Math: What Cache Hits Mean at Scale\\nThe impact becomes concrete when you put real numbers behind it. At 1 million requests per day, a modest scale for a production deployment, assume 70% of requests share a common system prompt. Without prefix-aware routing, cache hits are essentially random: roughly 1-in-4 requests land on an instance with that prefix already warm. With prefix-aware routing, that flips to 3-in-4.\\nThat delta, 350,000 additional cache hits per day, doesn’t sound dramatic until you attach the compute cost. Each cache hit skips roughly 350ms of prefill work. Across those 350,000 requests, that’s 34 GPU-hours saved every single day. Scale to 10 million requests per day and you’re recovering 340 GPU-hours daily, compute that was previously being silently wasted on work the system had already done.\\n\\nFor the right workload profile, multi-turn conversations with persistent context, shared system prompts, RAG pipelines querying the same document sets, the economics compound further. The same prefix appears not just frequently, but across long sessions where every turn benefits. In these cases, prefix-aware routing can reduce effective compute cost by up to 4x per request on identical hardware.\\nThe Engine Work Inferact Is Building\\nPrefix caching is one piece of a larger problem. Inference engines are evaluated on a lot of dimensions: raw speed, model coverage, and increasingly, how well they handle the messy shape of real production traffic. Closing the gap on all three, on frontier hardware, is the problem Inferact is organized around.\\nInferact is a company built on vLLM, and in practice Inferact’s roadmap and vLLM’s roadmap are virtually identical. The work happens upstream, in the open, in the same repository everyone else uses.\\nThe work falls into a few themes, each building on the last.\\nThe first is pushing vLLM toward SOTA performance on frontier models on frontier hardware. The clearest external signal here is Artificial Analysis, whose independent benchmarks have become a common reference point across engines and providers. vLLM’s recent top rankings on DeepSeek V3.2 and DeepSeek V4 reflect work that is increasingly a community effort: kernel and fusion optimizations, large-scale serving improvements for disaggregated and wide-EP setups, speculative decoding, quantization, and torch.compile integration are all being pushed forward by contributors across vendors.\\nThe second is day-0 model support, which is one of vLLM’s structural strengths. When a new frontier model drops, running it well on vLLM is the default, with recent launches like Gemma 4 and DeepSeek V4 supported on the engine from day one. Our goal is to continue this trend. The bar isn’t just accuracy on day zero. It’s continued accuracy and high performance on day zero.\\nThe third part is optimizing vLLM for areas that benchmarks don’t measure well yet. Top-of-leaderboard token throughput on a single prompt is a real signal, but it isn’t the same as performance on the workloads that actually run in production. Real inference traffic, and agentic traffic especially, looks very different from what most benchmarks capture: long shared prefixes from system prompts and tool definitions, multi-turn conversations with rich cache-hit structure, and bursty arrival patterns that don’t resemble a steady stream of independent prompts.\\nOptimizing prefix caching is the clearest example of what this means in practice. On agentic traffic, the bottleneck isn’t raw decode throughput on a fresh prompt — it’s whether the engine recognizes that most of the prompt is identical to something it processed seconds ago, and reuses the KV cache accordingly. Getting this right can be the difference between a model feeling fast and a model feeling unusable in an agent loop, and on a standard benchmark it barely shows up at all. The same pattern holds for the rest of the production-traffic stack: scheduler design under bursty arrival, KV cache layout under high reuse, and the prefill/decode connector path all matter disproportionately on real workloads relative to what benchmarks reward.\\nNone of this is work vLLM does alone, and none of it is work Inferact does alone either. The performance and workload story leans on cross-project, cross-vendor effort, and on the hundreds of vLLM contributors and downstream users who surface real problems and keep the project honest about what production actually looks like.\\nInferact’s role in that ecosystem is to invest deeply in vLLM as a maintainer and contributor, not to fork it or wrap it. The bet is on an open, broadly-owned inference engine as the right foundation for the next several years of inference work.\\nThese Optimizations Will Soon Ship to Everyone\\nEverything described in this post was originally built in the context of deep partnerships with large customers on DigitalOcean’s Dedicated Inference platform, but these optimizations will soon ship with every Serverless Inference deployment as well.\\nPrefix-aware routing via the Inference Gateway (live on Serverless Inference now)\\nPrefix caching with cached token pricing (launching on Serverless Inference in the coming weeks)\\nvLLM runtime with optimizations on AMD Instinct™ MI325X GPUs as well as NVIDIA Hopper\\nThe same benchmark performance Simon and Inferact helped achieve\\nPrefix caching and routing are just part of the picture. DigitalOcean’s GPU hardware collaboration goes deeper across the stack, from FP8 quantization to parallelism tuning, and those gains flow through to Serverless Inference customers as well.\\nYou won’t need a custom contract to benefit from these results. You will only need a DigitalOcean Serverless Inference endpoint.\\nConclusion\\nDelivering best-in-class inference performance requires optimization at every layer of the stack, and no single team owns all of it. That’s the foundation of our partnership with Inferact.\\nInferact brings deep expertise at the engine and kernel layer by optimizing vLLM internals, tuning GPU kernels for NVIDIA and AMD hardware, and squeezing maximum throughput out of the compute itself. DigitalOcean brings the infrastructure layer, virtualizing state-of-the-art AMD and NVIDIA GPUs at scale, building large GPU clusters purpose-built for serverless inference, and baking serving optimizations directly into the platform. That means auto-scaling, prefix-aware routing through our Inference Gateway, parallelism tuning, KV cache tiering across GPU and CPU memory to maximize effective cache capacity, disaggregated serving over a high-speed RoCE network, dynamic load rebalancing across model endpoints, and fleet-wide utilization optimization that continuously shifts capacity to where demand is highest.\\nTogether, the two layers close the loop. Engine-level efficiency means nothing if the infrastructure routes requests poorly. Infrastructure-level routing means nothing if the engine is leaving performance on the table. This partnership is about making both layers aware of each other, so the gains compound.\\nStart using Serverless Inference today. Prefix caching with cached token pricing launches in the coming weeks. Sign up now to be among the first to benefit.\"}"},{"ref":"E58","kind":"event","title":"AI Disruptors: How the Next Generation of Business is Being Built","date":"2026-05-29T21:30:04.614+00:00","date_source":"rss.item_date","source_url":"https://www.digitalocean.com/blog/ai-disruptors","signal_url":"https://onlylabs.fyi/signals/8c53d198-be3f-494c-bd08-05453c894f29","signal_json_url":"https://onlylabs.fyi/signals/8c53d198-be3f-494c-bd08-05453c894f29/signal.json","text":"post_published · AI Disruptors: How the Next Generation of Business is Being Built · signal_desk=talking · occurred_at=2026-05-29T21:30:04.614+00:00 · url=https://www.digitalocean.com/blog/ai-disruptors · raw={\"excerpt\":\"Getting your hands on a capable AI model is the easy part now. Every team can reach the same frontier models through an API, so a strong model is not what sets a product apart. What separates a working product from a demo is everything around the model. You have to measure whether the agent is actually doing its job, then keep grinding on reliability until it stops making expensive mistakes in front of real users.\\nI moderated a panel on exactly that at DigitalOcean’s Deploy 2026 conference in San Francisco, a forty-minute conversation with four founders on what they’ve learned shipping AI products that people depend on:\\nAngela Hoover, co-founder and CEO of Andi AI, an ad-free consumer search engine that pairs generative AI with live web data to give people direct answers instead of a page of ad-heavy links.\\nAlex Mashrabov, co-founder and CEO of Higgsfield AI, a platform that lets creators and agencies produce cinematic video without any physical production.\\nHovsep Seraydarian, co-founder and CTO of LawVo, a Canadian legal platform that pairs hundreds of AI agents trained in specific legal areas with human lawyers who verify their accuracy.\\nPeter Elias, founder of Probably, a data analysis agent that lets non-technical people query their data in plain English and runs calculations on a local engine instead of an LLM so it can decline to answer when the data does not support a clear result.\\nThe discussion got into what each founder underestimated once their agents had to run at scale, how they choose models from a field that keeps growing, what “agentic” actually means in production, and where a real moat comes from when everyone builds on the same foundation.\\nWatch the full session from Deploy 2026:\\nView YouTube video\\n\\n\\n\\nMaking agents work in production\\nWhen the founders were asked what they underestimated once their agents had to run in production, none of them pointed to the model.\\nYou need creative DNA\\nHiggsfield spent a year on R&D without traction. What finally moved the product was bringing on people who understood how creative work actually happens, then putting them next to the engineers every day.\\n“We started to see success when we got non-technical people on the team, like ex-creative directors, who now work daily with engineers to wrap this powerful technology and make it accessible for creatives.”\\nAlex Mashrabov, Higgsfield AI\\nIn high-stakes domains, AI plus humans beats AI alone\\nLawVo assumed its agents could handle legal guidance with little human involvement. That did not survive contact with real users.\\n“We need human lawyers to verify the data and test these agents every day.”\\nHovsep Seraydarian, LawVo\\nWhen asked whether that human role shrinks as the agents get smarter, Hovsep said the opposite is happening. The team watches what its lawyers do and folds that judgment back into the agents one step at a time, which means leaning on people more, not less.\\nMeasurement infrastructure is unavoidable\\nPeter’s point was that one of the first problems you have to solve when you build an agent is the infrastructure that tells you whether the thing works at all. Probably went through several rounds of this until it built an analytics system that watches everything the agent does, and the product now evaluates its own behavior.\\n“As long as we record everything the AI is doing, this particular AI can now actually aid us in improving its own performance.”\\nPeter Elias, Probably\\nAngela made the same point from the oversight side.\\n“We’re still early on in implementing agents at Andi. We’ve noticed that when you let them run wild, they’ll do anything. You really do have to make sure that they get high quality, accurate, grounded data. We keep an eye on the agents that we have running; we haven’t let them be fully autonomous.”\\nAngela Hoover, Andi AI\\nModel selection is a four-variable problem\\nThe number of available models has exploded over the past year, with frontier releases from Anthropic and OpenAI followed within weeks by open-source alternatives. Capable models are everywhere now. It’s challenging to choose among dozens of them when each carries a different mix of cost and capability.\\nPeter broke the decision into four variables he is always trading against each other: cost, latency, intelligence, and capacity. Smaller models tend to be faster and cheaper but give up intelligence, and an agent that fires many parallel calls runs into capacity limits fast.\\n“You want to get to the dumbest model you can get to before you actually go below the product performance that you need.”\\nPeter Elias, Probably\\nHe also warned that users have less patience than founders expect.\\n“People will start to get impatient. We found that users were more latency-sensitive than we wanted them to be.”\\nPeter Elias, Probably\\nAlex runs evaluations every week at Higgsfield because the proprietary data about how users take action has to stay current as models change. He has also moved away from fine-tuning small models toward prompting larger ones, which he finds faster with fewer hallucinations.\\n“The rules of machine learning do not change. Understand the customer, understand the business goal.”\\nAlex Mashrabov, Higgsfield AI\\nHovsep’s rule for any new startup is to start on a frontier model but architect for independence, so only a small slice of the system depends on the LLM and the rest lives in your own application and orchestration. Angela took the lean path from day one, relying on open-source models wherever they were good enough at a lower cost.\\n‘Agentic’ doesn’t mean autonomous\\nNone of these founders treat “agentic” as a synonym for autonomous. I asked what happens as agents move from co-pilots toward systems that act on their own, and what guardrails that calls for.\\nHovsep described a legal field run by dinosaurs, where regulation moves slowly and full autonomy is simply not on the table.\\n“Regulations won’t allow you to go fully autonomous. You literally get shut down if you do that in this space.”\\nHovsep Seraydarian, LawVo\\nWhat makes that constraint interesting is that LawVo’s agents already beat human lawyers on accuracy.\\n“We have 92% accuracy on our average agent performance. Your average lawyer has 87%. If you go to a lawyer 100 times, 13 times they’re going to make a mistake. We’re paying for that.”\\nHovsep Seraydarian, LawVo\\nPeter pushed on the word “agent” itself.\\n“Agency is the ability to spontaneously take action with no external input. LLMs are not agents. They don’t have agency. That’s why we have to prompt them.”\\nPeter Elias, Probably\\nHis view is that an LLM never acts on its own. You point it in a direction and keep pushing until it produces what you want.\\n“It’s being poked with a stick in whatever direction you’re trying to get it to do something.”\\nPeter Elias, Probably\\nThat has a practical consequence for anyone building. A model cannot reliably check its own work, and stacking one model to verify another tends to break down, so a human stays in the loop to verify. Peter pointed to the experiment where Claude was put in charge of running a store and lost a remarkable amount of money, the kind of failure that shows up the moment you take human judgment out. His read on the fear that AI will replace everyone is that it is overblown, because these systems are not agents in any real sense. We’re just calling them that.\\nAngela put the actual job of an agent in plain terms. Building one means doing the prompting on the customer’s behalf. A task that might take fifty prompts by hand gets compressed into a single step, so the person states the outcome they want and the product runs the prompts behind the scenes and hands back the finished result.\\nThe moat is in the execution\\nAccess to foundational models is getting commoditized. Open-source alternatives trail frontier releases by weeks, and any team can build on the same intelligence. When everyone can reach the same models, what actually sets a company apart?\\nHovsep had a simple test.\\n“There are startups that are science projects, and there are startups solving real-world problems. Are you solving a real-world problem? That’s it. It ends there for me.”\\nHovsep Seraydarian, LawVo\\nPeter described where the value is going right now. Some of it flows to the labs training the models. A lot of it flows to the inference platforms in the middle, which are making enormous money simply by running GPUs. The application layer holds value because getting these products to behave reliably is genuinely hard, and that reliability is the moat.\\n“I could race anybody in building an agent, and it would be: how fast until your agent is as reliable as mine? I will probably win that race because I spent two years getting it to not screw up. That is the moat.”\\nPeter Elias, Probably\\nReliability also compounds. A product that works attracts users, the users put their data into it, and that data makes the next version better in a way competitors cannot copy. Peter also pointed to where he thinks the biggest opening is. Software can finally speak plain English, which means whole categories of tools that were stuck with tiny markets can suddenly reach far more people, because the only thing holding them back was an interface too complicated for a normal person to use.\\nAngela’s own moat is the data underneath Andi. It started as a consumer search engine, and building it surfaced something more valuable, which is data accurate enough for other systems to depend on. That data has turned into a business of its own as more AI agent companies look for a trustworthy source to ground their answers.\\n“There’s a lot of AI agent companies now that need access to high quality, accurate, grounded data.”\\nAngela Hoover, Andi AI\\nBoth the product and that insight came from the same place: the work.\\n“When you’re actually in the trenches building, you learn some insightful things, and then you can build out your moat.”\\nAngela Hoover, Andi AI\\nI ended by asking each founder for one piece of parting advice, and the through-line was demand. Alex framed it as a warning, that too many AI companies build for other AI companies and never check whether real customers want what they are selling. Angela put it more directly, that you should talk to your users and test their willingness to pay as early as you can. The most capable agent in the world is still just a demo until a customer pays for it.\\nDigitalOcean’s AI-Native Cloud is built for teams at every stage, from testing early demand to scaling into the enterprise. It’s one integrated stack from silicon to agent runtime. You get more than 70 models on a single endpoint, with an Inference Router that handles model selection for you. One API and one bill, with economics that improve as you scale.\\n→ Get started with DigitalOcean’s AI-Native Cloud\"}"},{"ref":"E59","kind":"event","title":"OpenCode Now Supports DigitalOcean Inference Router for Intelligent Model Routing","date":"2026-05-28T21:02:42.819+00:00","date_source":"rss.item_date","source_url":"https://www.digitalocean.com/blog/digitalocean-opencode-inference-routers","signal_url":"https://onlylabs.fyi/signals/890fd86b-a8eb-4f96-addf-a04d50d7b350","signal_json_url":"https://onlylabs.fyi/signals/890fd86b-a8eb-4f96-addf-a04d50d7b350/signal.json","text":"post_published · OpenCode Now Supports DigitalOcean Inference Router for Intelligent Model Routing · signal_desk=talking · occurred_at=2026-05-28T21:02:42.819+00:00 · url=https://www.digitalocean.com/blog/digitalocean-opencode-inference-routers · raw={\"excerpt\":\"Coding agents today have a massive spending problem. Every request, whether you’re designing system architecture or writing a single-line docstring, often gets routed to the same expensive frontier model. The result: unnecessary token usage, higher inference costs, and little awareness of task complexity or budget constraints.\\nThis high cost stems from a “one-size-fits-all” approach to model usage, where premium frontier models are utilized for trivial tasks that don’t require such intensive reasoning effort. In multi-agent workflows, where orchestrators delegate work to specialized subagents, this lack of discrimination frequently leads to runaway costs and opaque failure modes. Without intelligent routing, developers can essentially be forced into closed-provider lock-in and high API usage fees, which quickly escalate during exploratory building phases.\\nDigitalOcean Inference Router, now in Public Preview, was built to solve this problem by dynamically routing requests to the right model for the job. As part of DigitalOcean’s AI-Native Cloud, it gives developers a unified way to control, optimize, and evaluate AI inference across models. And as of today, you can access it through OpenCode, the open-source AI coding agent, in as little as a few seconds.\\nWhat is an Inference Router?\\nAn Inference Router is the auto-mode pattern engineers are used to, but with deliberate control over the tradeoffs that matter: latency, cost, and output quality. Rather than statically pointing your coding agent to a single model, an Inference Router can analyze each request and route it to the model best suited for that specific task. Not the most powerful model available, but the right model. That distinction is what drives real savings without compromising on your desired quality of output.\\nTo use DigitalOcean’s Inference Router: Create an Inference Router from the router catalog—pick a preset or build a custom router via the API or UI. No GPU management, no infrastructure to run. Use it by setting “model”: “router:your-router-name” in any OpenAI-compatible API call.\\nWhat Changed for OpenCode\\nOpenCode has become one of the most popular AI coding harnesses on GitHub, earning over 160,000 stars by embracing a simple idea: developers should not be locked into a single model provider. Its rise has shown a demand for provider agnostic AI use cases. At Deploy 2026, Tyler Gillam - a core engineer on Inference Router - demoed our integration live on stage, showing exactly how OpenCode and Inference Router work together to make intelligent model selection decisions in real time. If you want to see it before diving in, the full recording is linked at the bottom of this post.\\nPreviously, integrating DigitalOcean models into OpenCode meant manually editing your opencode.json, adding each model by hand, a list that would be outdated within weeks given the pace of new model launches. So, we built a native OpenCode integration that supports Inference Routers and DigitalOcean Serverless Inference models right out of the box.\\nNow you can run the following steps:\\nLaunch OpenCode (desktop, web or TUI) and run /connect\\nSelect Login with DigitalOcean\\nYour Inference Routers are shown in the Model Selection tab\\nThat’s it. You’re plugging directly into a routing layer that’s already helping to make the cost & quality tradeoff decisions for you based on your stated needs — with our purpose-build Software Engineering preset.\\nBeyond Coding Agents\\nThis integration is part of a broader effort to bring DigitalOcean’s Inference Engine into the tools developers already use, while continuing to invest in open source and upstream contributions. OpenCode is one example of that direction.\\nThe goal is to make intelligent, cost-aware model routing the default for coding agents, not something you have to manually configure and hope for the best. As the OSS model landscape keeps improving, routing intelligence will become more valuable, not less. The gap between “frontier” and “good enough” is closing fast, and developers who take advantage of routing will consistently come ahead on both desired quality and cost.\\nIf you’re using OpenCode, try /connect today. If you want to dig deeper on what Inference Router is and how it works, the full documentation is available below.\\nInference Router Resources:\\nHow We Built DigitalOcean Inference Router\\nInference Router Documentation\\nOpenCode DigitalOcean Install\\nInferenceRouter OpenCode Deploy 2026 Demo\"}"},{"ref":"E60","kind":"event","title":"Scalable, Cost-Efficient AI: Introducing Unified Batch Inference on DigitalOcean","date":"2026-05-28T00:00:00.000Z","date_source":"page.visible_date","source_url":"https://www.digitalocean.com/blog/introducing-batch-inference","signal_url":"https://onlylabs.fyi/signals/80cfd545-793f-464c-a4ed-3404a0bdbebe","signal_json_url":"https://onlylabs.fyi/signals/80cfd545-793f-464c-a4ed-3404a0bdbebe/signal.json","text":"post_published · Scalable, Cost-Efficient AI: Introducing Unified Batch Inference on DigitalOcean · signal_desk=talking · occurred_at=2026-05-28T00:00:00.000Z · url=https://www.digitalocean.com/blog/introducing-batch-inference · raw={\"excerpt\":\"At Deploy 2026, we introduced the DigitalOcean AI-Native Cloud, built for the inference era. Batch Inference on the DigitalOcean Inference Engine enables high-volume asynchronous workloads. As developers move from AI prototypes to production-scale applications, the challenges of cost and rate limits often become a bottleneck. Batch Inference addresses these hurdles by allowing you to process high-volume workloads asynchronously at a fraction of the cost of synchronous requests.\\nWhether you are performing large-scale data transformation, content generation, building embeddings or offline evaluations, Batch Inference provides a unified, consistent way to leverage the world’s leading models from OpenAI and Anthropic, all through a single DigitalOcean interface.\\nThe AI Scaling Bottleneck\\nReal-time inference is essential for interactive AI applications such as chatbots, copilots, and search-as-you-type experiences. However, when the task involves processing 10,000 support tickets for sentiment analysis, generating SEO metadata for an entire product catalog, or benchmarking a new system prompt against a test suite, real-time inference becomes an expensive and inefficient tool for the job.\\nEach of those requests competes for the same rate-limited throughput as your production traffic. Teams spend engineering time writing retry logic, managing backpressure, and monitoring scripts that work through sequential API calls for hours. If you use models from multiple providers, such as OpenAI for embeddings and Anthropic for generation, you are managing separate credentials, separate billing dashboards, and separate error-handling strategies, even though the core workflow is the same: submit requests, wait, retrieve results.\\nProcessing thousands of synchronous requests is not only slow, it is an architectural challenge. At scale, synchronous inference becomes inefficient requiring thousands of open connections, creating constant rate-limit pressure and wasting compute while waiting for responses. It also introduces throughput bottlenecks, retry storms, and inconsistent latency, while pushing complex orchestration logic (queuing, retries, backoff) onto the client. Across multiple providers, this fragmentation only compounds the operational burden.\\nIntroducing DigitalOcean Batch Inference\\nWith Batch Inference, you can submit up to 50k requests for OpenAI or 100k for Anthropic in a single .jsonl file and let DigitalOcean handle the orchestration: queuing, execution, and result delivery.\\nWhat distinguishes this approach is its unified interface. Instead of working with each provider individually, OpenAI and Anthropic models are accessible through a single DigitalOcean API. One set of endpoints, one authentication flow, and one billing account allows you to monitor every job in one place, regardless of which provider executes it.\\nThis single control plane manages the operational complexity while preserving full access to each provider’s native model capabilities.\\nDigitalOcean Batch Inference provides a single control plane\\nThe upload, submission, and retrieval workflow is identical regardless of which model you use. By using one set of endpoints and a single authentication flow, you can switch between (or combine) providers without rewriting your orchestration logic or reconciling separate invoices.\\nSignificant Cost Savings\\nBatch requests are billed at a significant discount compared to standard real-time inference rates, for both input, output, and cache tokens. If you are running background workloads at real-time prices today, switching to batch can reduce that cost by up to 50%\\nExample: 50,000 requests using Claude Opus 4.6 Assumes an average of 1,000 input and 500 output tokens per request.**\\n\\n\\n\\nMetric\\nRea-time Inference\\nBatch Inference\\n\\n\\n\\n\\nInput Cost (50M tokens @ $5/M)\\n$250.00\\n$125.00\\n\\n\\nOutput Cost (25M tokens @ $25/M)\\n$625.00\\n$312.50\\n\\n\\nTotal Cost\\n$875.00\\n$437.50\\n\\n\\nPricing information current as of May 2026\\n\\n\\n\\n\\n\\n\\nBy switching to Batch in this example, you save $437.50 on a single run. This enables you to use top-tier intelligence for massive data processing tasks that might otherwise be cost-prohibitive, while also creating new opportunities to optimize inference budgets across high-volume workloads.\\nBypass Rate Limits\\nBatch jobs run on a dedicated throughput lane, completely separate from your real-time inference quota. Your production endpoints remain healthy while a 40,000-request batch job processes in the background across either provider. This helps reduce 429 Too Many Requests errors in your data pipelines.\\nAsynchronous Processing\\nSubmit a job and continue with other work. DigitalOcean manages the queue, retries, and delivery. You can poll for results when the job completes, or configure a webhook to receive notifications automatically (webhook delivery is coming soon).\\nDeeply Integrated with DigitalOcean\\nBatch inferencing  is built into the DigitalOcean platform. Every part of the workflow, from file storage to job monitoring to usage analytics, runs on infrastructure you already use.\\nPowered by DigitalOcean Spaces\\nInput files (up to 200 MB) are uploaded directly to DigitalOcean Spaces via presigned URLs. There is no external storage to configure, no S3 buckets to provision, and no cross-account IAM policies to manage. The API generates a presigned upload URL, you PUT your .jsonl file, and Spaces handles the rest.\\nResults are delivered the same way. When a job completes, the results endpoint returns a presigned Spaces download URL. Result files are retained up to 30 days, so you can retrieve them on your own schedule.\\nThis is the same Spaces object storage that powers the rest of the DigitalOcean ecosystem, now integrated into your AI batch pipeline.\\nJob Queue: Track Every Job in Real Time\\nThe Batch Inference Job Queue in the DigitalOcean Control Panel provides a live view of every batch job, with OpenAI and Anthropic jobs displayed side by side in a single list. For each job, you can view:\\nStatus:  awaiting_processing, in_progress, completed, failed, cancelled\\nProgress: total requests, completed, and failed counts, updated as the job runs\\nTimestamps: when the job was submitted, started, and completed\\nProvider: which provider is executing the batch\\nThis eliminates the need to poll the API during development. You can monitor your jobs directly from the same Control Panel you use for Droplets, Databases, and Kubernetes.\\n\\nInsights: Understand Your Usage\\nThe Batch Inference Insights page provides a centralized view of batch usage across both providers. You can track token consumption, job volumes, and completion trends over time, all in one place rather than across separate OpenAI and Anthropic dashboards.\\nUse Batch Inference Insights to understand cost patterns, identify peak usage periods, and plan capacity for your batch pipelines.\\n\\nUnified Billing\\nToken usage and job costs for both OpenAI and Anthropic batch workloads appear on a single DigitalOcean invoice. There are no separate bills to reconcile across providers and no additional payment methods to manage.\\nMCP Server Support\\nBatch Inference is also available as an MCP (Model Context Protocol) server, enabling seamless integration with AI-powered IDEs, agent frameworks, and any MCP-compatible client. This allows developers to create batch jobs, monitor their status, and retrieve results directly within their existing workflows.\\nAgents can be instructed to operate on input files, such as JSONL files for batch inference, by referencing a specified file path. Based on this context, the agent autonomously selects and invokes the appropriate MCP tools to handle file upload and initiate batch job creation. It can monitor status and upon completion, users can prompt the agent to retrieve the final job results and corresponding download URL, providing a seamless, end-to-end workflow with minimal manual intervention.\\nHow It Works\\nThe workflow is the same whether you are targeting OpenAI or Anthropic: prepare, upload, submit, and retrieve. All requests are sent to https://inference.do-ai.run/v1 and authenticate with your Model Access Key.\\nPrepare your input file. Create a .jsonl file where each line is a single inference request in the provider’s native format. OpenAI lines include custom_id, method, url,  and body. Anthropic lines include custom_id and params. The model is specified per request inside the file, giving you full flexibility within a single batch.\\nUpload your file. Call POST /v1/batches/files with your file name to get a file_id and a presigned Spaces upload URL. Then PUT your .jsonl file to that URL. The presigned URL is valid for 15 minutes.\\nCreate the batch job. Call POST /v1/batches with your file_id, provider (openai or anthropic), and completion_window (24h). The endpoint, authentication, and response shape are identical for both providers. The only difference is the provider field.\\nMonitor and retrieve results. Poll GET /v1/batches/{batch_id} for status, or monitor progress through the Job Queue in the Control Panel. Once the job reaches completed status, call GET /v1/batches/{batch_id}/results to get presigned download URLs for your output and error files. Result files are retained for 30 days.\\nYou can also list all jobs with GET /v1/batches and cancel a running job with POST /v1/batches/{batch_id}/cancel.\\nFor full API details, code samples (cURL and Python), and input file format examples, see the Batch Inference documentation.\\nUse Cases\\nBatch Inference is well-suited for any high-volume, non-latency-sensitive workload. The following examples are some of the most common patterns.\\nE-Commerce Catalog Enrichment\\nAn e-commerce platform with 50,000 products needs SEO-friendly titles, marketing descriptions, and metadata tags for each listing. Rather than processing them through sequential API calls over several days, the entire catalog can be submitted as a single batch. You can use gpt-4o-mini for the English copy, then run a second batch through Claude for localized translations, all through the same pipeline with a different provider field.\\nSupport Ticket Classification and Triage\\nOrganizations can process a year’s worth of support tickets in a single batch, classifying them by category, urgency, and sentiment while extracting structured fields like product name, issue type, and customer tier. The output is a clean .jsonl file ready to import into an analytics pipeline or CRM.\\nContent Moderation at Scale\\nPlatforms with user-generated content, such as marketplaces, forums, and review sites, often need to scan thousands of posts, images, and listings for policy violations. Batch Inference allows you to process the entire backlog overnight without competing with your production moderation endpoint’s rate limits.\\nModel Evaluation and Prompt Engineering\\nWhen developing a new system prompt, you can benchmark it against thousands of test cases by running the same evaluation suite against both OpenAI and Anthropic models through the same API. This enables side-by-side comparison of results at batch pricing, which is significantly lower than running the same evaluation in real time.\\nDocument Processing and Data Extraction\\nBatch Inference can summarize thousands of legal contracts, research papers, or financial filings. It can also extract structured data such as dates, amounts, parties, and clauses from unstructured documents, or classify a backlog of invoices and receipts. These jobs can be large in volume but are rarely time-sensitive.\\nGetting Started\\nBatch Inference is available now on the DigitalOcean AI Platform.\\nPolling for job status is currently supported, with webhook notifications arriving soon for automated workflows. As the platform grows, expect expanded provider and model support.\\nThe Bigger Picture\\nInference has become the center of gravity in modern AI systems. Applications no longer make a single model call. They orchestrate multiple models, retrieve and synthesize data, execute tools, and repeat the cycle in production. This is a stack problem, not a model problem.\\nDigitalOcean’s AI-Native Cloud was built to address exactly this. Five layers, one platform, open at every layer: GPU compute, inference, data and storage, agents, and the tools to connect them. Batch Inference is the latest addition to the inference layer, sitting alongside real-time Serverless Inference, the new Inference Router, Dedicated Inference, and a catalog of 25+ models across text, image, audio, and video.\\nWhere real-time inference powers interactive experiences, Batch Inference handles the heavy lifting that happens behind the scenes. Together with GPU Droplets, Knowledge Bases, and Managed Databases (including Managed Weaviate (Private Preview) for vector workloads), they form a complete system for building production AI without stitching together services from multiple vendors.\\nThe goal is to simplify the stack so you can focus on building.\\nReady to get started? Launch your first batch job or visit the product documentation to learn more.\"}"}]}