cloudflare/agents @cloudflare/ai-chat@0.8.2
cloudflare/agents
Captured source
source ↗@cloudflare/ai-chat@0.8.2
Repository: cloudflare/agents
Tag: @cloudflare/ai-chat@0.8.2
Published: 2026-06-05T10:52:26Z
Prerelease: no
Release notes:
Patch Changes
- #1684 `ab6dd95` Thanks @threepointone! - warn when
chatRecoveryis configured inonStart()(applied too late for wake recovery)
On every Durable Object wake the SDK evaluates chat-recovery budgets — and may seal an interrupted turn, firing onExhausted — before the user's onStart() runs (_checkRunFibers() is ordered ahead of onStart()). A chatRecovery config produced inside onStart() is therefore read as the built-in defaults at the moment recovery decides, so a configured maxRecoveryWork / shouldKeepRecovering / onExhausted silently never applies to the recovery that matters.
This is now documented on ChatRecoveryConfig and the chatRecovery fields of Think / AIChatAgent, and the SDK logs a one-time warning if it detects chatRecovery being reassigned during onStart(). The warning fires both for a custom config object and for chatRecovery = true (enabling recovery / its defaults too late); assigning false (disabling) in onStart() is intentionally not warned, since recovery already ran with the pre-onStart() value and disabling it afterward is a benign no-op for that wake. The fix is to assign chatRecovery as a class field or in the constructor.
- #1684 `ab6dd95` Thanks @threepointone! - fix(chat-recovery): don't seal a human-in-the-loop turn that is waiting on a pending client tool call
A turn parked on a pending CLIENT interaction — an input-available client-tool part (no server execute) or an approval-requested part, as detected by hasPendingInteraction() — is _waiting on the human_, not stuck. After a mid-turn Durable Object restart (e.g. a deploy), the in-memory pending-interaction promise is gone, so waitUntilStable() repeatedly times out until the client reconnects and replays the tool-result/approval. That replay drives a fresh continuation via the auto-continuation barrier independently of recovery — but the recovery loop was treating those timeouts as deploy churn:
- each stable-state timeout burned a recovery attempt, eventually sealing a perfectly healthy turn with
reason="stable_timeout", and - the no-progress window (which never advances while no content is produced) could seal it with
reason="no_progress_timeout"once it elapsed.
The net effect: an interrupted human-in-the-loop turn whose user simply took longer than the configured noProgressTimeoutMs / attempt budget to answer a tool prompt was terminalized with a "session interrupted" banner, even though nothing had actually failed.
While a client interaction is pending the turn is now budget-free:
_beginChatRecoveryIncidentsuppresses the no-progress window, attempt cap, work budget, andshouldKeepRecoveringpredicate, and keeps the no-progress clock fresh so the turn gets a full window once the human finally answers._chatRecoveryContinue/_chatRecoveryRetrypark (mark the incidentskippedwithreason="awaiting_client_interaction", resolving the live "recovering…" indicator) instead of rescheduling or exhausting — the client's eventual replay resumes the turn. A client that never returns is reclaimed by the incident TTL sweep and DO idle-eviction.
In @cloudflare/think, a submitMessages-backed turn additionally has its durable submission row completed at park time. The recovery loop is that row's sole completion driver after a restart, and the client's replay resumes the conversation as an independent auto-continuation that never touches the submission — so parking without completing would leave the row running, and the next restart's _recoverSubmissionsOnStart would sweep it to error (a false "session recovery error"). The park condition is a fully-materialized client tool call in the leaf, which is exactly the terminal state a non-interrupted submission reaches when its step emits a client tool call (the model does not block on client tools), so completed is the correct, consistent outcome.
SERVER-tool orphans are deliberately excluded (their execute() died with the isolate and nothing will resolve them), so they still recover normally via the transcript-repair pass.
Both @cloudflare/think and @cloudflare/ai-chat (which carries its own copy of the recovery engine) are fixed. In @cloudflare/think the client/server distinction already lived in hasPendingInteraction(). @cloudflare/ai-chat's hasPendingInteraction() (used by waitUntilStable) does not distinguish client from server tools, so a new, narrower client-only predicate hasPendingClientInteraction() was added there and gates the exemption — leaving waitUntilStable's existing behavior untouched so server-tool orphans keep reschedule/exhaust semantics.
The exemption depends on knowing the request's client tools. @cloudflare/ai-chat restores them in its constructor, so they are available when boot recovery evaluates budgets. @cloudflare/think restored them in onStart(), which the base Agent runs _after_ the boot-recovery path (_handleInternalFiberRecovery -> _beginChatRecoveryIncident) — so on a fresh wake the in-memory cache was still empty and a client-tool input-available orphan re-detected past the no-progress window was misread as "stuck" and wrongly sealed. _beginChatRecoveryIncident now re-hydrates _lastClientTools from the durable think_config store before evaluating the budget, closing that hibernation-ordering hole (approval-requested turns were never affected, since that branch does not depend on the client tool set).
- #1672 `f96a2ba` Thanks @threepointone! - fix(chat-recovery): a turn making forward progress now survives unbounded deploy churn; add a work budget +
shouldKeepRecoveringrunaway guard
Durable chat recovery used to bound a single incident with a non-resetting 15-minute wall-clock ceiling…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine package update