RepoTencent HunyuanTencent Hunyuanpublished Nov 6, 2025seen 5d

Tencent-Hunyuan/Thinking-Free_Policy_Initialization

Python

Open original ↗

Captured source

source ↗

Tencent-Hunyuan/Thinking-Free_Policy_Initialization

Description: The official code of [ICLR 2026] TFPI: Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

Language: Python

License: NOASSERTION

Stars: 103

Forks: 12

Open issues: 0

Created: 2025-11-06T07:38:13Z

Pushed: 2026-01-27T11:59:26Z

Default branch: main

Fork: no

Archived: no

README:

1Hunyuan LLM Department, Tencent 

2The Hong Kong University of Science and Techology 

3The University of Hong Kong 

Overview

Thinking-Free Policy Initialization (TFPI), a simple yet effective adaptation to Reinforcement Learning with Verifiable Reward (RLVR) that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkingFree* operation, explicitly discarding the thinking content via a direct append, to reduce token usage during inference. Training with *ThinkingFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we can train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench with extremely low training compute.

📝 News

  • [2026/01/26] Our paper is accepted to ICLR 2026.
  • [2025/12/22] We released the codes.
  • [2025/11/7] We released the model checkpoints.
  • [2025/9/30] We released the paper!

🚀 Quick Start

Installation

1. Environment setup

conda create -n TFPI python=3.10 -y
conda activate TFPI

2. Requirements installation

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install vllm==0.8.5.post1
pip install -e .
pip install vertexai
pip install sentence_transformers
pip install flash-attn==2.7.4.post1 --no-build-isolation

Run Training

The training dataset is the training prompts in Polaris-53K.

First, download and transform the format of training data using the following Python script:

python scripts/download_train.py

The training data is saved in ./data/train/tfpi-polaris53k.parquet

Next, adapt the training script in "./scripts/train/qwen3-4b-tfpi.sh" by setting the WandB key, model path and dataset path.

Finally, run the following commands at the master node:

bash ./scripts/ray_start.sh # start ray
bash ./scripts/train/qwen3-4b-tfpi.sh # submit training

Run Evaluation

First, download the evaluation datasets using

hf download xx18/TFPI-EVA --repo-type=dataset --local-dir ./data/eval

All test datasets are downloaded to the folder data/eval.

for evaluation, use:

bash ./scripts/ray_start.sh # start ray, use pssh to run on multiple nodes if necessary
bash scripts/eval/start_generate.sh

The resulted metrics and evaluation outputs will be saved under the folder your_model_path/eval_results

For IFEval, please refer to the official repo IFEval evaluation.

🤗 Datasets and Models

we are open-sourcing our complete codes, and training details for the research community. All our resulted checkpoints can be found in TFPI Collection.

| Name | Link | Remarks | | - | - | - | | Evaluation Sets | TFPI-EVA | All evaluation datasets used in the TFPI paper, including AIME24, AIME25, BeyondAIME, LiveCodeBench, GPQA, and IFEval | | Training set | Polaris-53K | - | | 1.5B TFPI Stage 1 | TFPI-DeepSeek-Qwen-1.5B-Stage1 | Results in Table 1; Training Response Length 2048 | | 1.5B TFPI Stage 2 | TFPI-DeepSeek-Qwen-1.5B-Stage2 | Results in Table 1; Training Response Length 4096 | | 1.5B TFPI Stage 3 | TFPI-DeepSeek-Qwen-1.5B-Stage3 | Results in Table 1; Training Response Length 8192 | | 1.5B TFPI Stage 3 + DAPO | TFPI-DeepSeek-Qwen-1.5B-Stage3_then_RL | Results in Table 7; Training Response Length 16K; | | 1.5B Direct RL checkpoint 1 | DirectRL_DeepSeek-Qwen-1.5B_baseline1 | Results in Table 1; Training Response Length 16K; Traning Time = 3 stages of TFPI | | 1.5B Direct RL checkpoint 2 | DirectRL_DeepSeek-Qwen-1.5B_baseline2 | Results in Table 7; Training Response Length 16K; Traning Time = ''TFPI+RL'' | | Qwen3-4B TFPI Stage 1 | TFPI-Qwen3-4B-Stage1 | Results in Table 1; Training Response Length 4096 | | Qwen3-4B TFPI Stage 2 | TFPI-Qwen3-4B-Stage2 | Results in Table 1; Training Response Length 8192 | | Qwen3-4B TFPI Stage 3 | TFPI-Qwen3-4B-Stage3 | Results in Table 1; Training Response Length 16K | | Qwen3-4B TFPI Stage 3 + DAPO | TFPI-Qwen3-4B-Stage3_then_RL | Results in Table 2; Training Response Length 32K | | Qwen3-4B Direct RL checkpoint 1 | DirectRL_Qwen3-4B_baseline1 | Results in Table 1; Training Response Length 32K; Traning Time = 3 stages of TFPI | | Qwen3-4B Direct RL checkpoint 2 | DirectRL_Qwen3-4B_baseline2 | Results in Table 2; Training Response Length 32K; Traning Time = ''TFPI+RL'' | | Qwen3-4B-Thinking-2507 Stage 3 | TFPI-Qwen3-4B-Thinking-2507-Stage3 | Results in Table 2; Training Response Length 16K |

🤝 Acknowledgement

We are deeply grateful for the following GitHub…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New repo, moderate traction.