Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters
Captured source
source ↗Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters | Qwen
We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now
Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters March 28, 2024 · 7 min · 1411 words · Qwen Team | Translations: 简体中文
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction # Since the surge in interest sparked by Mixtral, research on mixture-of-expert (MoE) models has gained significant momentum. Both researchers and practitioners are keenly interested in understanding how to effectively train such models and assessing their efficiency and effectiveness. Today, we introduce Qwen1.5-MoE-A2.7B, a small MoE model with only 2.7 billion activated parameters yet matching the performance of state-of-the-art 7B models like Mistral 7B and Qwen1.5-7B. Compared to Qwen1.5-7B, which contains 6.5 billion non-embedding parameters, Qwen1.5-MoE-A2.7B contains only 2.0 billion non-embedding parameters, approximately one-third of Qwen1.5-7B’s size. Notably, it achieves a 75% decrease in training expenses and accelerates inference speed by a factor of 1.74, offering substantial improvements in resource utilization without compromising performance. Architecture # We build the Qwen1.5-MoE models with a specially designed MoE architecture. Typically, as seen in methods like Mixtral, MoE layers within each transformer block employ eight experts and utilize a top-2 gating strategy for routing purposes. This configuration, while straightforward and efficacious, presents ample scope for enhancement. Consequently, through an extensive series of experiments, we have introduced several modifications to this architecture: Finegrained experts Initialization, which we call it “upcycling” Routing mechanism, with shared and routing experts
Previous research projects such as DeepSeek-MoE and DBRX have demonstrated the effectiveness of using fine-grained experts. Conventionally, when transitioning from a standard FFN layer to a Mixture-of-Experts (MoE) layer, one merely replicates the FFN multiple times to create multiple experts. However, in the context of fine-grained experts, the goal is to generate a larger number of experts without increasing the parameter count. To accomplish this, we partition a single FFN into several segments, each serving as an individual expert. This is a more nuanced approach to constructing experts. We have identified an optimal configuration with a total of 64 experts, representing an 8-time increase compared to the conventional MoE setup of 8 experts. The initialization stage of the model is critical. Our initial experiments suggest that training a MoE model from scratch may prove inefficient and challenging to elevate it to the anticipated peak performance. Instead, we start by repurposing our existing Qwen-1.8B, transforming it into Qwen1.5-MoE-A2.7B. A noteworthy finding is that introducing randomness during initialization significantly expedites convergence and results in superior overall performance throughout the pre-training process. An essential aspect deserving attention is the routing methodology employed. Presently, there is a growing trend towards using shared and routing-specific experts within the MoE layer. To view it from a broader perspective, this is a generalized MoE routing approach, as having zero shared experts effectively reduces to the conventional MoE routing setup. In the case of Qwen1.5-MoE-A2.7B model, we have incorporated 4 shared experts to be always activated alongside 60 routing experts with 4 to be activated. This configuration offers a more adaptable method for constructing the MoE routing mechanism, providing greater flexibility and efficiency. Performance # In order to thoroughly assess and showcase the capabilities and superiority of our newly developed model, we have conducted extensive evaluations across various benchmark datasets for both the base and chat models. For the base model, we evaluated its performance on 3 benchmarks: MMLU, GSM8K, and HumanEval for evaluating language understanding, mathematics, and coding. Additionally, to gauge its multilingual proficiency, we followed the evaluation protocol of Qwen1.5 and tested it on several benchmarks that spanned diverse domains such as exams, understanding, math, and translation, presenting an aggregate score in the “Multilingual” column. For the chat model, rather than employing traditional benchmarks, we subjected it to testing using MT-Bench. In this comparative analysis, we juxtaposed Qwen1.5-MoE-A2.7B against top-performing 7B base models like Mistral-7B (v0.1 base and v0.2 instruct), Gemma-7B, and Qwen1.5-7B. Furthermore, we included a comparison with other MoE models of comparable parameter counts, notably DeepSeekMoE 16B. The results are summarized in the table below: Model MMLU GSM8K HumanEval Multilingual MT-Bench Mistral-7B 64.1 47.5 27.4 40.0 7.60 Gemma-7B 64.6 50.9 32.3 - - Qwen1.5-7B 61.0 62.5 36.0 45.2 7.60 DeepSeekMoE 16B 45.0 18.8 26.8 - 6.93 Qwen1.5-MoE-A2.7B 62.5 61.5 34.2 40.8 7.17 The Qwen1.5-MoE-A2.7B model has demonstrated competitive performance akin to the top 7B models in various evaluations. Despite this parity, our analysis reveals untapped potential for enhancement in the domain of chat models specifically. As such, we are committed to furthering our research efforts towards refining the effective finetuning strategies for MoE models. Costs and Efficiency # The training costs of MoE models deviates significantly from that of their dense counterparts. Despite a larger parameter count, MoE models’ training expenses can be notably reduced due to sparsity. To better understand this, let’s first delve into three key components: total number of parameters, the count of active parameters, and non-embedding parameters and make a comparison between models: Model #Parameters #(Activated) Parameters #(Activated) Non-embedding parameters Mistral-7B 7.2 7.2 7.0 Qwen1.5-7B 7.7 7.7 6.4 Gemma-7B 8.5 7.8 7.8 DeepSeekMoE 16B 16.4 2.8 2.4 Qwen1.5-MoE-A2.7B 14.3 2.7 2.0 It is obvious that the count of non-embedding parameters of our MoE model is much smaller than those of 7B models. In our practical implementation, we have observed a remarkable reduction of 75% in training costs when using Qwen1.5-MoE-A2.7B in comparison to Qwen1.5-7B. Of particular…
Excerpt shown — open the source for the full document.
Notability
Community criticizes misleading naming and high VRAM usage despite fewer active parameters.