Hello Qwen2
Captured source
source ↗Hello Qwen2 | Qwen
We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now
Hello Qwen2 June 7, 2024 · 15 min · 3119 words · Qwen Team | Translations: 简体中文
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction # After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you: Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B ; Having been trained on data in 27 additional languages besides English and Chinese; State-of-the-art performance in a large number of benchmark evaluations; Significantly improved performance in coding and mathematics; Extended context length support up to 128K tokens with Qwen2-7B-Instruct and Qwen2-72B-Instruct.
We have opensourced the models in Hugging Face and ModelScope to you and we are looking forward to hearing from you! Model Information # The Qwen2 series include base and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, Qwen2-72B. We illustrate the key information of the models in the following table: Models Qwen2-0.5B Qwen2-1.5B Qwen2-7B Qwen2-57B-A14B Qwen2-72B
Params 0.49B 1.54B 7.07B 57.41B 72.71B
Non-Emb Params 0.35B 1.31B 5.98B 56.32B 70.21B
GQA True True True True True Tie Embedding True True False False False Context Length 32K 32K 128K 64K 128K Specifically, previously in Qwen1.5, only Qwen1.5-32B and Qwen1.5-110B have adopted Group Query Attention (GQA). This time, for all model sizes, we apply GQA so that they can enjoy the benefits of faster speed and less memory usage in model inference. For small models, we prefer the application of tying embedding as the large sparse embeddings take up a large proportion of the total model parameters. In terms of the context length, all base language models have been pretrained on data of the context length of 32K tokens, and we observe satisfactory extrapolation capabilities up to 128K in PPL evaluation. However, for instruction-tuned models, we are not satisfied with merely PPL evaluation; we need the models to be capable of correctly understanding long context and completing tasks. In the table, we list the context length capabilities of instruction-tuned models, as assessed through the evaluation of the Needle in a Haystack task. Notably, when augmented with YARN, both Qwen2-7B-Instruct and Qwen2-72B-Instruct models demonstrate an impressive capacity to handle context lengths extending up to 128K tokens. Significant efforts were directed towards augmenting both the volume and quality of pretraining and instruction-tuning datasets across a diverse linguistic spectrum, beyond English and Chinese, to bolster its multilingual competencies. Although large language models possess an inherent capacity to generalize to other languages, we explicitly highlight the inclusion of 27 additional languages in our training: Regions Languages Western Europe German, French, Spanish, Portuguese, Italian, Dutch Eastern & Central Europe Russian, Czech, Polish Middle East Arabic, Persian, Hebrew, Turkish Eastern Asia Japanese, Korean South-Eastern Asia Vietnamese, Thai, Indonesian, Malay, Lao, Burmese, Cebuano, Khmer, Tagalog Southern Asia Hindi, Bengali, Urdu Additionally, we have devoted significant effort to addressing code-switching, a frequent occurrence in multilingual evaluation. Consequently, our models’ proficiency in handling this phenomenon have notably enhanced. Evaluations using prompts that typically induce code-switching across languages confirm a substantial reduction in associated issues. Performance # Comparative assessments reveal substantial enhancements in performance for large-scale models (70B+ parameters) relative to Qwen1.5. Here our evaluation centers on the large-size model Qwen2-72B. In terms of base language models, Qwen2-72B and state-of-the-art open models are evaluated for different capbilities including natural language understanding, knowledge acquisition, coding proficiency, mathematical skills, and multilingual abilities. Benefiting from meticulously curated datasets and optimized training methods, Qwen2-72B exhibits superior performance compared to leading models such as Llama-3-70B. Notably, it surpasses the performance of its predecessor, Qwen1.5-110B, despite having fewer parameters. After extensive large-scale pre-training, we conduct post-training to further enhance Qwen’s intelligence, bringing it closer to human. This process further improves the model’s capabilities in areas such as coding, mathematics, reasoning, instruction following, multilingual understanding, and more. Additionally, it aligns the model’s output with human values, ensuring that it is helpful, honest, and harmless. Our post-training phase is designed with the principle of scalable training with minimal human annotation. Specifically, we investigate how to obtain high-quality, reliable, diverse and creative demonstration data and preference data with various automated alignment strategies, such as rejection sampling for math, execution feedback for coding and instruction-following, back-translation for creative writing, scalable oversight for role-play, etc. As for training, we apply a combination of supervised fine-tuning, reward model training and online DPO training. We also employ a novel Online Merging Optimizer to minimize the alignment tax. These collective efforts have significantly boosted the capabilities and intelligence of our models, as illustrated in the following table. We comprehensively evaluate Qwen2-72B-Instruct on 16 benchmarks across various domains. Qwen2-72B-Instruct strikes a balance between obtaining better capabilities and aligning well with human values. Specifically, Qwen2-72B-Instruct significantly surpasses Qwen1.5-72B-Chat across all benchmarks, and also reaches competitive performance compared with Llama-3-70B-Instruct. 1 In terms of smaller models, our Qwen2 models also outcompete the SOTA models of similar or even larger sizes. In comparison with the very recently released SOTA models, Qwen2-7B-Instruct can still demonstrate advantages across benchmarks, showing specifically outstanding performance on coding and Chinese-related metrics. 1 Highlights # Coding & Mathematics # We have persistently dedicated our efforts to enhance the advanced…
Excerpt shown — open the source for the full document.
Notability
Impressive Qwen2 performance under Apache 2.0, but censorship and 'open-washing' concerns persist.
Qwen (Alibaba Cloud) has a writing signal matching data demand, evals and quality, product and customer.