WritingQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Feb 4, 2024seen 6d

Introducing Qwen1.5

Open original ↗

Captured source

source ↗
published Feb 4, 2024seen 6dcaptured 3dhttp 200method plain

Introducing Qwen1.5 | Qwen

We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now

Introducing Qwen1.5 February 4, 2024 · 14 min · 2895 words · Qwen Team | Translations: 简体中文

GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction # In recent months, our focus has been on developing a “good” model while optimizing the developer experience. As we progress towards Qwen1.5 , the next iteration in our Qwen series, this update arrives just before the Chinese New Year. With Qwen1.5, we are open-sourcing base and chat models across six sizes: 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B, and 110B, and also an MoE model (see blog for more information). In line with tradition, we’re also providing quantized models, including Int4 and Int8 GPTQ models, as well as AWQ and GGUF quantized models. To enhance the developer experience, we’ve merged Qwen1.5’s code into Hugging Face transformers, making it accessible with transformers>=4.37.0 without needing trust_remote_code . We’ve collaborated with frameworks like vLLM , SGLang for deployment, AutoAWQ , AutoGPTQ for quantization, Axolotl , LLaMA-Factory for finetuning, and llama.cpp for local LLM inference, all of which now support Qwen1.5. The Qwen1.5 series is available on platforms such as Ollama and LMStudio . Additionally, API services are offered not only on DashScope but also on together.ai , with global accessibility. Visit here to get started, and we recommend trying out Qwen1.5-72B-chat . This release brings substantial improvements to the alignment of chat models with human preferences and enhanced multilingual capabilities. All models now uniformly support a context length of up to 32768 tokens. There have also been minor improvements in the quality of base language models that may benefit your finetuning endeavors. This step represents a small stride toward our objective of creating a truly “good” model. Performance # To provide a better understanding of the performance of Qwen1.5, we have conducted a comprehensive evaluation of both base and chat models on different capabilities, including basic capabilities such as language understanding, coding, reasoning, multilingual capabilities, human preference, agent, retrieval-augmented generation (RAG), etc. Basic Capabilities # To assess the basic capabilities of language models, we have conducted evaluations on traditional benchmarks, including MMLU (5-shot), C-Eval, Humaneval, GS8K, BBH, etc. Model MMLU C-Eval GSM8K MATH HumanEval MBPP BBH CMMLU GPT-4 86.4 69.9 92.0 45.8 67.0 61.8 86.7 71.0 Llama2-7B 46.8 32.5 16.7 3.3 12.8 20.8 38.2 31.8 Llama2-13B 55.0 41.4 29.6 5.0 18.9 30.3 45.6 38.4 Llama2-34B 62.6 - 42.2 6.2 22.6 33.0 44.1 - Llama2-70B 69.8 50.1 54.4 10.6 23.7 37.7 58.4 53.6 Mistral-7B 64.1 47.4 47.5 11.3 27.4 38.6 56.7 44.7 Mixtral-8x7B 70.6 - 74.4 28.4 40.2 60.7 - - Qwen1.5-7B 61.0 74.1 62.5 20.3 36.0 37.4 40.2 73.1 Qwen1.5-14B 67.6 78.7 70.1 29.2 37.8 44.0 53.7 77.6 Qwen1.5-32B 73.4 83.5 77.4 36.1 37.2 49.4 66.8 82.3 Qwen1.5-72B 77.5 84.1 79.5 34.1 41.5 53.4 65.5 83.5 At every model size, Qwen1.5 demonstrates strong performance across the diverse evaluation benchmarks. In particular, Qwen1.5-72B outperforms Llama2-70B across all benchmarks, showcasing its exceptional capabilities in language understanding, reasoning, and math. In light of the recent surge in interest for small language models, we have compared Qwen1.5 with sizes smaller than 7 billion parameters, against the most outstanding small-scale models within the community. The results are shown below: Model Non-Emb Params MMLU C-Eval GSM8K MATH HumanEval MBPP BBH CMMLU Tinyllama-1.1B 1.1B 24.3 25.0 2.3 0.7 6.7 19.9 28.8 24.0 Gemini-Nano-3B - - - 22.8 - - 27.2 42.4 - StableLM-Zephyr-3B 2.7B 45.9 30.3 52.5 12.5 35.4 31.9 37.7 30.9 Phi-2 2.5B 52.7 23.4 57.2 3.5 47.6 55.0 43.4 24.2 MiniCPM-2B 2.4B 53.5 51.1 53.8 10.2 50.0 47.3 36.9 51.1 Gemma-2B 2.0B 42.3 - 17.7 11.8 22.0 29.2 35.2 - Qwen1.5-0.5B 0.3B 39.2 50.5 22.0 3.1 12.2 6.8 18.3 46.6 Qwen1.5-1.8B 1.2B 46.8 59.7 38.4 10.1 20.1 18.0 24.2 57.8 Qwen1.5-4B 3.1B 56.1 67.6 57.0 10.0 25.6 29.2 32.5 66.7 Qwen1.5-MoE-A2.7B 2.0B 62.5 79.2 61.5 21.9 34.2 36.6 39.1 79.2 We can confidently assert that Qwen1.5 base models under 7 billion parameters are highly competitive with the leading small-scale models in the community. In the future, we will continue to improve the quality of small models and exploring methods for effectively transferring the advanced capabilities inherent in larger models into the smaller ones. Aligning with Human Preference # Alignment aims to enhance instruction-following capabilities of LLMs and help provide responses that are closely aligned with human preferences. Recognizing the significance of integrating human preferences into the learning process, we effectively employed techniques such as Direct Policy Optimization (DPO) and Proximal Policy Optimization (PPO) in aligning the latest Qwen series. However, assessing the quality of such chat models poses a significant challenge. Admittedly, while comprehensive human evaluation is the optimal approach, it faces significant challenges pertaining to scalability and reproducibility. Therefore, we initially evaluate our models on two widely-used benchmarks, utilizing advanced LLMs as judges: MT-Bench and Alpaca-Eval. The results are presented below: We notice there are non-negligible variance in the scores on MT-Bench. So we have three runs with different seeds in our results and we report the average score with standard deviation. Models MT-Bench AlpacaEval 2.0 Avg. Score Win Rate Length Qwen1.5-72B-Chat 8.61 0.04 (8.67/8.61/8.56) 27.18 1.30 1600 Qwen1.5-14B-Chat 7.91 0.11 (7.99/7.99/7.77) 19.7 1.12 1608 Qwen1.5-7B-Chat 7.60 0.05 (7.58/7.55/7.66) 13.20 1.43 1606 Despite still significantly trailing behind GPT-4-Turbo, the largest open-source Qwen1.5 model, Qwen1.5-72B-Chat, exhibits superior performance, surpassing Claude-2.1, GPT-3.5-Turbo-0613, Mixtral-8x7b-instruct, and TULU 2 DPO 70B, being on par with Mistral Medium, on both MT-Bench and Alpaca-Eval v2. Furthermore, although the scoring of LLM Judges may seemingly correlate with the lengths of responses, our observations indicate that our models do not generate lengthy responses to manipulate the bias of LLM judges. The average length of Qwen1.5-Chat on AlpacaEval 2.0 is only 1618, which…

Excerpt shown — open the source for the full document.