What does this writing signal mean?

Anthropic published Claude Think Tool. This talking signal gives public context for research themes, product direction, policy, or launch framing. High-signal details: Anthropic's new feature with moderate HN traction. · The "think" tool: Enabling Claude to stop and think \ Anthropic Engineering at Anthropic The "think" tool: Enabling Claude to stop and think in complex tool use.... onlylabs links this event to 1 captured evidence page and 6 related writing signals.

Anthropic Writing: Claude Think Tool

Captured source

source ↗

anthropic.com/anthropic.com/engineering/claude-think-tool

Claude Think Tool

Source ↗

published Mar 20, 2025seen Jun 9captured Jun 11http 200method plain

The "think" tool: Enabling Claude to stop and think \ Anthropic Engineering at Anthropic The "think" tool: Enabling Claude to stop and think in complex tool use situations

Published Mar 20, 2025 A new tool that improves Claude's complex problem-solving performance

Extended thinking update

Extended thinking capabilities have improved since its initial release, such that we recommend using that feature instead of a dedicated think tool in most cases. Extended thinking provides similar benefits—giving Claude space to reason through complex problems—with better integration and performance. See our extended thinking documentation for implementation details.

As we continue to enhance Claude's complex problem-solving abilities, we've discovered a particularly effective approach: a "think" tool that creates dedicated space for structured thinking during complex tasks. This simple yet powerful technique—which, as we’ll explain below, is different from Claude’s new “ extended thinking ” capability (see here for extended thinking implementation details )—has resulted in remarkable improvements in Claude's agentic tool use ability. This includes following policies, making consistent decisions, and handling multi-step problems, all with minimal implementation overhead. In this post, we'll explore how to implement the “think” tool on different applications, sharing practical guidance for developers based on verified benchmark results. What is the "think" tool? With the "think" tool, we're giving Claude the ability to include an additional thinking step—complete with its own designated space—as part of getting to its final answer. While it sounds similar to extended thinking, it's a different concept. Extended thinking is all about what Claude does before it starts generating a response. With extended thinking, Claude deeply considers and iterates on its plan before taking action. The "think" tool is for Claude, once it starts generating a response, to add a step to stop and think about whether it has all the information it needs to move forward. This is particularly helpful when performing long chains of tool calls or in long multi-step conversations with the user. This makes the “think” tool more suitable for cases where Claude does not have all the information needed to formulate its response from the user query alone, and where it needs to process external information (e.g. information in tool call results). The reasoning Claude performs with the “think” tool is less comprehensive than what can be obtained with extended thinking, and is more focused on new information that the model discovers. We recommend using extended thinking for simpler tool use scenarios like non-sequential tool calls or straightforward instruction following. Extended thinking is also useful for use cases, like coding, math, and physics, when you don’t need Claude to call tools. The “think” tool is better suited for when Claude needs to call complex tools, analyze tool outputs carefully in long chains of tool calls, navigate policy-heavy environments with detailed guidelines, or make sequential decisions where each step builds on previous ones and mistakes are costly. Here's a sample implementation using the standard tool specification format that comes from τ-Bench : { "name": "think", "description": "Use the tool to think about something. It will not obtain new information or change the database, but just append the thought to the log. Use it when complex reasoning or some cache memory is needed.", "input_schema": { "type": "object", "properties": { "thought": { "type": "string", "description": "A thought to think about." } }, "required": ["thought"] } } Copy

Performance on τ-Bench We evaluated the "think" tool using τ-bench (tau-bench), a comprehensive benchmark designed to test a model’s ability to use tools in realistic customer service scenarios, where the "think" tool is part of the evaluation’s standard environment. τ-bench evaluates Claude's ability to: Navigate realistic conversations with simulated users Follow complex customer service agent policy guidelines consistently Use a variety of tools to access and manipulate the environment database

The primary evaluation metric used in τ-bench is pass^ k , which measures the probability that all k independent task trials are successful for a given task, averaged across all tasks. Unlike the pass@ k metric that is common for other LLM evaluations (which measures if at least one of k trials succeeds), pass^ k evaluates consistency and reliability—critical qualities for customer service applications where consistent adherence to policies is essential. Performance Analysis Our evaluation compared several different configurations: Baseline (no "think" tool, no extended thinking mode) Extended thinking mode alone "Think" tool alone "Think" tool with optimized prompt (for airline domain)

The results showed dramatic improvements when Claude 3.7 effectively used the "think" tool in both the “airline” and “retail” customer service domains of the benchmark: Airline domain : The "think" tool with an optimized prompt achieved 0.570 on the pass^1 metric, compared to just 0.370 for the baseline—a 54% relative improvement; Retail domain : The "think" tool alone achieves 0.812, compared to 0.783 for the baseline.

Claude 3.7 Sonnet's performance on the "airline" domain of the Tau-Bench eval under four different configurations. Claude 3.7 Sonnet's performance on the "Airline" domain of the Tau-Bench eval Configuration k =1 k =2 k =3 k =4 k =5 "Think" + Prompt 0.584 0.444 0.384 0.356 0.340 "Think" 0.404 0.254 0.186 0.140 0.100 Extended thinking 0.412 0.290 0.232 0.192 0.160 Baseline 0.332 0.206 0.148 0.116 0.100

Evaluation results across four different configurations. Scores are proportions.

The best performance in the airline domain was achieved by pairing the “think” tool with an optimized prompt that gives examples of the type of reasoning approaches to use when analyzing customer requests. Below is an example of the optimized prompt:

Using the think tool

Before taking any action or responding to the user after receiving tool results, use the think tool as a scratchpad to:

List the specific rules that apply to the current request
Check if all required information is collected
Verify that the planned action complies with all policies
Iterate over tool results for correctness

Here are some examples...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Anthropic's new feature with moderate HN traction.