QwQ-32B: Embracing the Power of Reinforcement Learning
Captured source
source ↗QwQ-32B: Embracing the Power of Reinforcement Learning | Qwen
We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now
QwQ-32B: Embracing the Power of Reinforcement Learning March 6, 2025 · 4 min · 742 words · Qwen Team | Translations: 简体中文
QWEN CHAT Hugging Face ModelScope DEMO DISCORD Scaling Reinforcement Learning (RL) has the potential to enhance model performance beyond conventional pretraining and post-training methods. Recent studies have demonstrated that RL can significantly improve the reasoning capabilities of models. For instance, DeepSeek R1 has achieved state-of-the-art performance by integrating cold-start data and multi-stage training, enabling deep thinking and complex reasoning. Our research explores the scalability of Reinforcement Learning (RL) and its impact on enhancing the intelligence of large language models. We are excited to introduce QwQ-32B, a model with 32 billion parameters that achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated). This remarkable outcome underscores the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge. Furthermore, we have integrated agent-related capabilities into the reasoning model, enabling it to think critically while utilizing tools and adapting its reasoning based on environmental feedback. These advancements not only demonstrate the transformative potential of RL but also pave the way for further innovations in the pursuit of artificial general intelligence. QwQ-32B is open-weight in Hugging Face and ModelScope under the Apache 2.0 license and is accessible via Qwen Chat . Performance # QwQ-32B is evaluated across a range of benchmarks designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. The results below highlight QwQ-32B’s performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1. Reinforcement Learning # We began with a cold-start checkpoint and implemented a reinforcement learning (RL) scaling approach driven by outcome-based rewards. In the initial stage, we scale RL specifically for math and coding tasks. Rather than relying on traditional reward models, we utilized an accuracy verifier for math problems to ensure the correctness of final solutions and a code execution server to assess whether the generated codes successfully pass predefined test cases. As training episodes progress, performance in both domains shows continuous improvement. After the first stage, we add another stage of RL for general capabilities. It is trained with rewards from general reward model and some rule-based verifiers. We find that this stage of RL training with a small amount of steps can increase the performance of other general capabilities, such as instruction following, alignment with human preference, and agent performance, without significant performance drop in math and coding. Use QwQ-32B # Below are brief examples demonstrating how to use QwQ-32B via Hugging Face Transformers and Alibaba Cloud DashScope API. from transformers import AutoModelForCausalLM , AutoTokenizer
model_name = "Qwen/QwQ-32B"
model = AutoModelForCausalLM . from_pretrained ( model_name , torch_dtype = "auto" , device_map = "auto" ) tokenizer = AutoTokenizer . from_pretrained ( model_name )
prompt = "How many r's are in the word \" strawberry \" " messages = [ { "role" : "user" , "content" : prompt } ] text = tokenizer . apply_chat_template ( messages , tokenize = False , add_generation_prompt = True )
model_inputs = tokenizer ([ text ], return_tensors = "pt" ) . to ( model . device )
generated_ids = model . generate ( ** model_inputs , max_new_tokens = 32768 ) generated_ids = [ output_ids [ len ( input_ids ):] for input_ids , output_ids in zip ( model_inputs . input_ids , generated_ids ) ]
response = tokenizer . batch_decode ( generated_ids , skip_special_tokens = True )[ 0 ] print ( response )
from openai import OpenAI import os
Initialize OpenAI client
client = OpenAI (
If the environment variable is not configured, replace with your API Key: api_key="sk-xxx"
How to get an API Key:https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key
api_key = os . getenv ( "DASHSCOPE_API_KEY" ), base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1" )
reasoning_content = "" content = ""
is_answering = False
completion = client . chat . completions . create ( model = "qwq-32b" , messages = [ { "role" : "user" , "content" : "Which is larger, 9.9 or 9.11?" } ], stream = True ,
Uncomment the following line to return token usage in the last chunk
stream_options={
"include_usage": True
}
)
print ( " \n " + "=" * 20 + "reasoning content" + "=" * 20 + " \n " )
for chunk in completion :
If chunk.choices is empty, print usage
if not chunk . choices : print ( " \n Usage:" ) print ( chunk . usage ) else : delta = chunk . choices [ 0 ] . delta
Print reasoning content
if hasattr ( delta , 'reasoning_content' ) and delta . reasoning_content is not None : print ( delta . reasoning_content , end = '' , flush = True ) reasoning_content += delta . reasoning_content else : if delta . content != "" and is_answering is False : print ( " \n " + "=" * 20 + "content" + "=" * 20 + " \n " ) is_answering = True
Print content
print ( delta . content , end = '' , flush = True ) content += delta . content
Future Work # This marks Qwen’s initial step in scaling Reinforcement Learning (RL) to enhance reasoning capabilities. Through this journey, we have not only witnessed the immense potential of scaled RL but also recognized the untapped possibilities within pretrained language models. As we work towards developing the next generation of Qwen, we are confident that combining stronger foundation models with RL powered by scaled computational resources will propel us closer to achieving Artificial General Intelligence (AGI). Additionally, we are actively exploring the integration of agents…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable 32B model release from Qwen on RL.
Qwen (Alibaba Cloud) has a writing signal matching data demand, infrastructure.