DeepMind’s PEER scales language models with millions of tiny experts
What is the Mixture-of-Experts (MoE) technique?

Mixture-of-Experts (MoE) is a machine learning technique that divides an artificial intelligence model into separate sub-networks or "experts," each specializing in a subset of the input data2. It enables large-scale models to reduce computation costs during pre-training and achieve faster performance during inference time by selectively activating only the specific experts needed for a given task. MoE architectures are used in several popular large language models, including Mixtral, DBRX, Grok, and reportedly GPT-4.
How does MoE differ from traditional large language model architectures?

MoE (Mixture of Experts) differs from traditional large language model architectures by using a sparse approach, where only a subset of the model's components, called "experts", are used for each input6. This allows for more efficient pretraining and faster inference while managing a larger model size. In contrast, traditional dense models use the entire model capacity for every input, which can be less computationally efficient.
What are some limitations of current MoE techniques?

Current MoE techniques have limitations such as fixed routers designed for a specific number of experts, requiring readjustment when new experts are added. They also often use experts as large as the FFW layers they replace, which can restrict scalability. Additionally, some successful MoE methods process tokens from different positions in a sequence together, potentially leading to information leakage.