精选· 重要性 4/5

全局批量负载平衡：提升MoE大模型训练的免费午餐

Qwen Team Blog·超过 1 年前·约 4 分钟阅读

中文导读

本文提出全局批量负载平衡损失，替代传统微批量平衡，在不增加计算开销的前提下显著提升MoE模型的性能和专家专业化程度，为大规模MoE训练提供了新思路。

GITHUB HUGGING FACE MODELSCOPE DISCORDBackgroundThe Mixture-of-Experts (MoEs) architecture has become a popular model-parameter-scale-up

technique. Typically,

one MoE layer consists of a router (often parameterized as one single Linear layer) and a group of experts (for transformer-based models

,each expert is one feedforward layer). Given an input,only a subset of experts will be activated,

and then their outputs will be aggregated based on the scores the router assigned. Specifically,

$$ \mathbb{y}=\sum_{i\in N_E,

g_i\in\operatorname{topK}}g_i(\mathbb{x})E_i(\mathbb{x}) $$Load Balancing lossLoad balancing loss is an essential regularization techniq

ue in training MoE-based networks,and high-level intuition encourages the balanced activation of all experts. It can be calculated as:

$$ L_{\text{balance}}=N_E \sum_{i=1}^{N_E} f_ip_i $$where $f_i$ is the activation frequency of the expert $E_i$,

and the $p_i$ is the the average gating score that the expert $E_i$ is assigned. However,most existing MoE training frameworks (e. g. ,

Megatron-core),implement micro-batch level balance,

which means the $L_{\text{balance}}$ is calculated within every micro-batch and is then averaged on the global batch level. Our key poin

t is that this implementation could be problematic if one micro-batch does not contain diverse data.

For instance, imagine one micro-batch only contains some code data;

the aforementioned load-balancing loss still pushes the router to distribute these code tokens to all experts uniformly,

potentially hurting the model performance and preventing expert specialization. This situation is even more common in training MoE-based

LLMs:

the data in one micro-batch is often from the same domain. This partially explains why most existing open-source MoE-based LLMs do not a

chieve notable expert specialization. This drawback motivates us to extend the current method to the global-batch level balance. From mi

cro-batch balance to global-batch balanceOne easy way to calculate global-batch balance loss is to 1) Synchronize expert selection frequ

ency $f_i$ across all parallel groups;2) Calculate the load-balancing loss in each parallel group (e. g. , one GPU);

3) Aggregate the loss across all micro-batches. Specifically,

$$ L_{\text{global}}=N_E\sum_{i=1}^{N_E}f_i^{\text{global}}p_i^{\text{global}}=N_E\sum_{i=1}^{n_E}f_i^{\text{global}}\cdot(\frac{1}{N_p}

\sum_{j=1}^{N_p}p_j)= \frac{1}{N_P} \sum_{j=1}^{N_p}(N_E \sum_{i=1}^{N_E} \bar{f_i} \cdot P^j_i) $$Note that the expert selection freque

ncy is just one expert-num-dimentional vector!It is almost free to synchronize them across micro-batches. Results:

More Performant and Interpretable MoEWe experiment with three MoE configs (3. 4B with 0. 6B activated,

15B with 2. 54B activated,and 43B with 6.

6 B已激活）和两个数据格式（120 B令牌和400 B令牌）。结果见下图和图表。简而言之，与微批级损失相比，全局批处理在所有设置（模型、数据和任务）下都实现了更好的性能。更重要的是，MoE模型通过全球批量平衡实现了显着的领域专业化。

在左图（b）中，几乎所有专家都被统一激活，无论领域如何。但在图（b）右侧中，一些专家经常被特定领域激活，证明了他们的专业化。我们进一步比较了3上的模型性能与平衡批量大小。4B，0。6 B激活模型。

ptrr训练PPL从平衡BZZ 2快速下降到128，并在128后逐渐饱和。在目前主流的MoE框架中，即使有交叉专家-并行组通信，对于较大的模型，平衡BSZ通常在8到16之间，这进一步反映了我们方法的意义。

使用全局批量平衡可能会导致微批量平衡下降，从而可能影响MoE的计算效率。我们进一步实验了在全局批次平衡损失（恒重为0。全球批量损失的01）。可见，添加局部均衡提高了模型的速度（从1. 64比1。

每个更新步骤59秒），而模型的有效性几乎不受影响。

结论总而言之，我们研究了训练MoE模型时与LBL相关的挑战。通过引入全球批次平衡损失，我们可以提高性能并在MoE模型中培养专家专业化。我们相信，这一进步解决了现有MoE培训中的一个根本限制，为教育部模型优化提供了一个新的视角。

虽然主要尝试基于语言的任务，但我们希望我们的工作能够为在各个领域训练更实质性和专业的MoE模型铺平道路。引文如果您发现我们的工作有帮助，请随时给我们引用。

@article{qiu2025demonsdetailimplementingfeed，《细节中的魔鬼：为训练专业混合专家模型实现负载平衡损失》作者={邱子涵、黄泽宇、郑波、温开跃、王泽坤、门瑞、伊万·蒂托夫、刘达义恒、周景仁、林俊阳},

journal={arXiv preprint arXiv：2501. 11873}，年={2025}}

原文出处

本文为机器翻译辅以 AI 润色，仅供参考。原始事实以原文为准。

相关阅读