Sparse Mixture of Experts (SMoE)
An architecture where only a small fraction of all "expert sub-networks" is activated per input – enabling huge model capacity with efficient inference.
Architecture behind Mixtral, GPT-4, Gemini and other state-of-the-art models. Enables models with trillions of parameters at affordable inference. The future of LLM scaling.
Explanation
A gating network routes each token to the top-K experts (of N total, e.g., K=2 of N=64). Only these experts are computed. Model has N*expert-size parameters but only K*expert-size FLOPs per token.
Marketing Relevance
Architecture behind Mixtral, GPT-4, Gemini and other state-of-the-art models. Enables models with trillions of parameters at affordable inference. The future of LLM scaling.
Example
Mixtral 8x7B has 8 experts of 7B parameters each (56B total) but activates only 2 per token. Result: GPT-3.5 quality at Mistral-7B inference cost. 8x cheaper per token.
Common Pitfalls
High memory requirements (all experts must be loaded). Load balancing between experts critical. More complex training. Not all tokens benefit equally.
Origin & History
Sparse Mixture of Experts (SMoE) is an established concept in the field of Artificial Intelligence. The concept has evolved alongside the growing importance of AI and data-driven methods.