Mixture of Experts (MoE)

Mixture of Experts (MoE) is a neural-network architecture where only a subset of parameters ('experts') activates per input — enabling total parameter counts far beyond dense-model limits while keeping inference compute low. GLM 5.1 (754B total / ~30B active), MiniMax M2.7 (230B total / 10B active), Mixtral 8x22B all use MoE. The architecture dominates 2025-2026 frontier open-weight models.

A **Mixture of Experts** (MoE) is a neural-network architecture where a large bank of parameters ('experts') is trained, but only a small subset is activated for any given input. A learned **router** (usually a small linear layer with softmax) selects the top-k experts per token. This lets models have huge total parameter counts while keeping per-token inference compute bounded. ## How it works In a standard (dense) transformer, every token passes through every parameter of every layer. In an MoE transformer: 1. Standard attention blocks are unchanged. 2. The feedforward network (FFN) layer — normally a single large block — is replaced by **N parallel expert FFNs**. 3. A router (gating network) scores each token's affinity for each expert. 4. Only the **top-k experts** (typically k=2, sometimes k=8) process each token. 5. Output is a weighted sum of the active experts. Because only k of N experts are active per token, compute scales with k, not with N. But the model's total parameter capacity (what it can memorize, what skills it can specialize) scales with N. ## Numbers for current models | Model | Total Params | Active Params/Token | Ratio | |---|---|---|---| | Mixtral 8x7B | 47B | ~13B | 3.6x | | Mixtral 8x22B | 141B | ~39B | 3.6x | | GLM 5.1 Open-Weight Model | 754B | ~30B | 25x | | MiniMax M2.7 | 230B | 10B | 23x | | DeepSeek V3 | 671B | 37B | 18x | | Rumored GPT-5.4 | ~2T | ~100B | 20x | Ratios above 10x are characteristic of the 2025-2026 'sparse' MoE generation. Higher ratio = more capacity per FLOP of compute. ## History - **Jacobs, Jordan, Nowlan, Hinton 1991**: original Mixture of Experts paper for classification. - **Shazeer et al. 2017**: 'Outrageously Large Neural Networks' — first LSTM-based sparse MoE at scale. - **Switch Transformer (Google 2021)**: top-1 routing for transformer FFN. - **GShard, GLaM**: early large-scale MoE transformer LLMs at Google. - **Mixtral 8x7B** (Mistral, Dec 2023): first widely-used open-weight MoE LLM, k=2 top-2 routing. - **DeepSeek-V2, **DeepSeek-V3: introduced multi-head latent attention + fine-grained expert specialization. - **GLM 4.5, **GLM 5.1** (ZAI 2025-2026): ~700B+ MoE at the open-weight frontier. ## Why MoE matters - **Compute efficiency**: for a given inference cost, MoE models have dramatically more capacity than dense equivalents. Roughly: a 500B MoE with 30B active is comparable in output quality to a 150-250B dense model, at ~1/5 the inference FLOPs. - **Specialization**: different experts learn different skills (code, math, specific languages, reasoning patterns). The router directs each token to relevant expertise. - **Scaling frontier**: dense models hit training-efficiency and inference-latency walls around 500B-1T parameters. MoE pushes the usable total-parameter count to multiple trillion parameters. ## Challenges - **Memory footprint**: all experts need to be in memory even though only some are active. This is why 230B MoE models need ~500GB weight storage despite only using ~20GB of compute per token. - **Load balancing**: routers must distribute tokens roughly evenly across experts, or some experts get under-trained while others are overloaded. **Load-balancing loss** terms during training address this. - **Token dropping**: if an expert's batch is full, excess tokens are dropped or re-routed — can hurt quality. - **Distributed training complexity**: experts are typically sharded across GPUs, requiring all-to-all communication per layer. - **Inference infrastructure**: cost model differs from dense — specialized serving frameworks (vLLM, TensorRT-LLM, SGLang) added MoE-specific optimizations through 2024-2025. ## Variants - **Top-1 routing** (Switch): each token goes to one expert. Simplest, lowest compute. - **Top-2 routing** (Mixtral): each token activates two experts, outputs combined. More robust, slightly more compute. - **Top-k with k=8** (MiniMax M2.7, GLM 5.1): more experts per token, finer-grained specialization. - **Fine-grained experts**: many smaller experts (256+ in MiniMax M2.7) activate several at once, vs few large experts. - **Shared experts**: some fraction of experts are always active (shared across all tokens), providing baseline capability, while the rest specialize. - **Expert dropping / dropout**: regularization during training. ## Current (2026) frontier patterns - Chinese open-weight labs (ZAI, MiniMax, DeepSeek, Alibaba Qwen) have led MoE architecture innovation for open models. - Closed Western labs (OpenAI, Anthropic, Google) are widely believed to use MoE for their flagship models but don't disclose architecture details. - Meta's Meta Muse Spark and the End of the Llama Era is rumored to be MoE (~500B total). - The dense/MoE split in the market: Llama 3.x was dense; Llama 4 was MoE; most 2026 frontier is MoE. MoE is now the dominant architecture at the trillion-parameter frontier, with dense architectures remaining relevant for smaller efficient models and some specialized training regimes.

Mixture of Experts (MoE)

Have insights to add?