Tencent-Hunyuan/Thinking-Free_Policy_Initialization
Python
Captured source
source ↗Tencent-Hunyuan/Thinking-Free_Policy_Initialization
Description: The official code of [ICLR 2026] TFPI: Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners
Language: Python
License: NOASSERTION
Stars: 103
Forks: 12
Open issues: 0
Created: 2025-11-06T07:38:13Z
Pushed: 2026-01-27T11:59:26Z
Default branch: main
Fork: no
Archived: no
README:
1Hunyuan LLM Department, Tencent 
2The Hong Kong University of Science and Techology 
3The University of Hong Kong 
Overview
Thinking-Free Policy Initialization (TFPI), a simple yet effective adaptation to Reinforcement Learning with Verifiable Reward (RLVR) that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkingFree* operation, explicitly discarding the thinking content via a direct append, to reduce token usage during inference. Training with *ThinkingFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we can train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench with extremely low training compute.
📝 News
- [2026/01/26] Our paper is accepted to ICLR 2026.
- [2025/12/22] We released the codes.
- [2025/11/7] We released the model checkpoints.
- [2025/9/30] We released the paper!
🚀 Quick Start
Installation
1. Environment setup
conda create -n TFPI python=3.10 -y conda activate TFPI
2. Requirements installation
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124 pip install vllm==0.8.5.post1 pip install -e . pip install vertexai pip install sentence_transformers pip install flash-attn==2.7.4.post1 --no-build-isolation
Run Training
The training dataset is the training prompts in Polaris-53K.
First, download and transform the format of training data using the following Python script:
python scripts/download_train.py
The training data is saved in ./data/train/tfpi-polaris53k.parquet
Next, adapt the training script in "./scripts/train/qwen3-4b-tfpi.sh" by setting the WandB key, model path and dataset path.
Finally, run the following commands at the master node:
bash ./scripts/ray_start.sh # start ray bash ./scripts/train/qwen3-4b-tfpi.sh # submit training
Run Evaluation
First, download the evaluation datasets using
hf download xx18/TFPI-EVA --repo-type=dataset --local-dir ./data/eval
All test datasets are downloaded to the folder data/eval.
for evaluation, use:
bash ./scripts/ray_start.sh # start ray, use pssh to run on multiple nodes if necessary bash scripts/eval/start_generate.sh
The resulted metrics and evaluation outputs will be saved under the folder your_model_path/eval_results
For IFEval, please refer to the official repo IFEval evaluation.
🤗 Datasets and Models
we are open-sourcing our complete codes, and training details for the research community. All our resulted checkpoints can be found in TFPI Collection.
| Name | Link | Remarks | | - | - | - | | Evaluation Sets | TFPI-EVA | All evaluation datasets used in the TFPI paper, including AIME24, AIME25, BeyondAIME, LiveCodeBench, GPQA, and IFEval | | Training set | Polaris-53K | - | | 1.5B TFPI Stage 1 | TFPI-DeepSeek-Qwen-1.5B-Stage1 | Results in Table 1; Training Response Length 2048 | | 1.5B TFPI Stage 2 | TFPI-DeepSeek-Qwen-1.5B-Stage2 | Results in Table 1; Training Response Length 4096 | | 1.5B TFPI Stage 3 | TFPI-DeepSeek-Qwen-1.5B-Stage3 | Results in Table 1; Training Response Length 8192 | | 1.5B TFPI Stage 3 + DAPO | TFPI-DeepSeek-Qwen-1.5B-Stage3_then_RL | Results in Table 7; Training Response Length 16K; | | 1.5B Direct RL checkpoint 1 | DirectRL_DeepSeek-Qwen-1.5B_baseline1 | Results in Table 1; Training Response Length 16K; Traning Time = 3 stages of TFPI | | 1.5B Direct RL checkpoint 2 | DirectRL_DeepSeek-Qwen-1.5B_baseline2 | Results in Table 7; Training Response Length 16K; Traning Time = ''TFPI+RL'' | | Qwen3-4B TFPI Stage 1 | TFPI-Qwen3-4B-Stage1 | Results in Table 1; Training Response Length 4096 | | Qwen3-4B TFPI Stage 2 | TFPI-Qwen3-4B-Stage2 | Results in Table 1; Training Response Length 8192 | | Qwen3-4B TFPI Stage 3 | TFPI-Qwen3-4B-Stage3 | Results in Table 1; Training Response Length 16K | | Qwen3-4B TFPI Stage 3 + DAPO | TFPI-Qwen3-4B-Stage3_then_RL | Results in Table 2; Training Response Length 32K | | Qwen3-4B Direct RL checkpoint 1 | DirectRL_Qwen3-4B_baseline1 | Results in Table 1; Training Response Length 32K; Traning Time = 3 stages of TFPI | | Qwen3-4B Direct RL checkpoint 2 | DirectRL_Qwen3-4B_baseline2 | Results in Table 2; Training Response Length 32K; Traning Time = ''TFPI+RL'' | | Qwen3-4B-Thinking-2507 Stage 3 | TFPI-Qwen3-4B-Thinking-2507-Stage3 | Results in Table 2; Training Response Length 16K |
🤝 Acknowledgement
We are deeply grateful for the following GitHub…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New repo, moderate traction.