
Mixture-of-Experts (MoE) is a machine learning technique that divides an artificial intelligence model into separate sub-networks or "experts," each specializing in a subset of the input data2. It enables large-scale models to reduce computation costs during pre-training and achieve faster performance during inference time by selectively activating only the specific experts needed for a given task. MoE architectures are used in several popular large language models, including Mixtral, DBRX, Grok, and reportedly GPT-4.

MoE (Mixture of Experts) differs from traditional large language model architectures by using a sparse approach, where only a subset of the model's components, called "experts", are used for each input6. This allows for more efficient pretraining and faster inference while managing a larger model size. In contrast, traditional dense models use the entire model capacity for every input, which can be less computationally efficient.

Current MoE techniques have limitations such as fixed routers designed for a specific number of experts, requiring readjustment when new experts are added. They also often use experts as large as the FFW layers they replace, which can restrict scalability. Additionally, some successful MoE methods process tokens from different positions in a sequence together, potentially leading to information leakage.