RepoMicrosoftMicrosoftpublished Feb 5, 2026seen 2d

microsoft/interwhen

Python

Open original ↗

Captured source

source ↗
published Feb 5, 2026seen 2dcaptured 2dhttp 200method plain

microsoft/interwhen

Description: A framework for verifiable reasoning with language models.

Language: Python

License: MIT

Stars: 39

Forks: 12

Open issues: 6

Created: 2026-02-05T07:10:10Z

Pushed: 2026-06-24T04:09:18Z

Default branch: main

Fork: no

Archived: no

README: interwhen

A Generalizable Framework for Steering Reasoning Models with Test-time Verification

Paper | Quick Start | Examples | Monitors | Create your own Monitors

interwhen is a test-time verification framework for language models that enforces correctness with respect to a set of verifiers. It is designed to improve *instance*-level reliability in reasoning systems, particularly in high-stakes domains where occasional errors are unacceptable.

This is especially important for agentic workflows, where models make sequences of decisions interleaved with tool calls, database writes, or external API actions. In these settings, verifying only the final answer can miss early policy violations or irreversible mistakes. interwhen instead enables LLM-Process-Modulo execution: the model is steered during the reasoning or action process so that its trace remains compliant with task-specific policies.

interwhen addresses the problem by providing a plug-and-play mechanism to improve instance-level reliability of any language model, which we call *verifier-guided reasoning*. Instead of verifying only the final output, the framework enables verification of intermediate reasoning traces during generation. When a violation is detected, the system can steer, revise, or halt generation. If no output is produced, the system abstains; if an output is produced, it satisfies the specified verifiers.

The framework has two parts. Offline, interwhen can synthesize code-based verifiers from natural-language policy documents, including provably correct verifiers in Lean or Z3. Online, interwhen periodically polls the reasoning trace and forks inference of the reasoning model to recover intermediate states. Verifiers are run asynchronously alongside generation, adding negligible overhead on correct executions and intervening only when violations occur.

From a research perspective, interwhen makes the following contributions:

  • A New Axis for Test-Time Scaling — Introduces verifier compute as an additional dimension of scaling at inference time. Rather than scaling model size or sampling alone, performance can be improved by allocating compute to structured verification.
  • Automatic Verifier Synthesis — Provides a method for generating verifiers automatically from a given natural-language policy. We also propose a Lean-based variant that produces formal specifications, corresponding verifiers, and machine-checked proofs of soundness and completeness of the verifiers.
  • A Testbed for Verifier Development — Enables systematic evaluation of verifier designs at inference time before incorporating them into training objectives (e.g., as reward models or critics).

A detailed discussion of interwhen, including how it was developed and tested, can be found in our paper.

A demo on the Telecom domain in Tau2-Bench, operating in solo mode. The verifiers are first generated from the rules defined in the policy. As the agent's execution progresses, each tool call is checked against the applicable policy verifiers, with feedback returned when a violation is detected. The demo shows how interwhen steers the same trajectory toward policy-compliant execution without restarting the agent.

A demo on the Maze dataset. The task is to find the number of right and left turns on the path from the starting to the ending position. The colour green indicates steps that pass verification, while red marks those that failed. The text stream on the right is the verifier output. A higher res mp4 can be found here.

A demo on the ZebraLogic dataset. The task is to find the correct assignments given the constraints. The colour green indicates steps that pass verification, while red marks those that failed. The text stream on the right is the verifier output. A higher res mp4 can be found here.

Table of Contents

  • [Key Features](#key-features)
  • [Installation](#installation)
  • [Verifiable Reasoning in Three Lines](#verifiable-reasoning-in-three-lines)
  • [Examples](#examples)
  • [Available Monitors and Verifiers](#available-monitors-and-verifiers)
  • [Creating Custom Verifiers and Monitors](#creating-custom-verifiers-and-monitors)
  • [How It Works](#how-it-works)
  • [Evaluation](#evaluation)
  • [Intended Uses](#intended-uses)
  • [Limitations](#limitations)
  • [License](#license)
  • [Contact](#contact)

Key Features

interwhen changes the inference pipeline of a language model by creating an auxiliary Monitor that runs alongside the main model and interacts with the model’s output to improve its quality. The Monitor agent reads the output of a language model in real time and calls necessary verifiers to check its validity.

1. Policy Compliant Agentic Reasoning. interwhen verifies intermediate reasoning states, tool-use decisions, and tool-responses before the model reaches a final answer, with the aim of ensuring that the actions taken by the agent are compliant with the policy provided. This is useful for agentic workflows where early mistakes can propagate into irreversible tool calls or invalid task outcomes, and hence process verification is essential.

2. Verification During Generation. interwhen verifies reasoning traces as they are produced, without requiring external step extraction or structured decomposition. This allows the model to retain flexible reasoning strategies while remaining subject to correctness constraints.

3. Asynchronous and Efficient Execution. Verifiers are executed asynchronously and intervene only when violations are detected, minimizing inference overhead while preserving responsiveness.

4. Unified Model–Verifier Interface. The framework provides a general API for interaction between language models and different kind of verifiers. Based on the objectivity of a domain, verifiers can be symbolic, neuro-symbolic or even fully neural verifiers. They can operate on partial outputs, final answers, or both.

----------------

At a conceptual level, interwhen reframes reliability in language models:

> Instead of asking whether a model is accurate on average, we ask whether a particular output complies with explicit, verifiable constraints derived...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low-traction new repo from Microsoft