
Wider feedforward (FFW) layers in transformer models increase computational costs and activation memory linearly, posing scaling challenges. This hampers the feasibility of deploying large-scale models in real-world applications like language modeling and natural language processing tasks, making it essential to address this issue for AI research advancement.

MoE architectures reduce computational costs by activating only a subset of experts for each input instead of the entire model4. This approach allows for scalability and adaptability, making MoE suitable for large-scale and diverse datasets4. The sparse activation of experts enables efficient utilization of computational resources.

Existing MoE models face limitations such as computational and optimization challenges when scaling beyond a small number of experts. The efficiency gains often plateau with increasing model size due to a fixed number of training tokens1. These limitations prevent the full potential of MoEs from being realized, especially in tasks requiring extensive and continual learning.