The primary challenges associated with scaling up Large Language Models (LLMs) like Mistral, Gemma, and Llama are mainly related to processing power and affordability. These models are dense, meaning they use every parameter during inference, which requires significant computational resources. As a result, achieving both affordability and widespread availability becomes difficult.
One potential solution to increase efficiency is using conditional computation, which reduces unnecessary calculations by activating only some of the model's neurons in response to the input. There are two main methods to implement conditional computation: the Mixture-of-Experts (MoE) strategy and using activation functions with intrinsic sparsity, such as ReLU.
However, many LLMs use activation functions like GELU and Swish, which do not encourage as much sparsity and are more challenging to accelerate using conditional computation. ReLUfication, a technique that replaces the original activation function with ReLU during pretraining, has been proposed to address this issue. Still, it often fails to achieve adequate sparsity levels and may negatively impact performance.
To overcome these inefficiencies, a new activation function called dReLU has been suggested, which tackles the issues of negative activations in the GLU component. Additionally, combining MoE approaches with ReLU-induced sparsity may yield further efficiency benefits.
The activation function ReLU (Rectified Linear Unit) contributes to sparsity and efficiency in Large Language Models (LLMs) by producing zero for non-positive inputs. This inherent sparsity allows for many dormant neurons that do not contribute to the computation, thus improving inference efficiency.
ReLU-based LLMs have been shown to exhibit sparse activation, where the activation values of most neurons are zero. This property enables efficient inference of LLMs by leveraging sparse computation techniques. Sparse computation refers to the ability to dynamically skip the computation of inactive neurons, reducing computational cost and improving inference efficiency, especially in resource-constrained scenarios.
Recent studies have explored the use of ReLU and its variants as the activation function in LLMs to achieve activation sparsity and inference acceleration. By substituting ReLU for the original activation function during pretraining, a technique called ReLUfication has been proposed. However, performance may suffer, and this approach often falls short of achieving the appropriate levels of sparsity.
To address these challenges, researchers have introduced a new activation function called dReLU, which tackles the inefficiencies of negative activations in the GLU component. Experiments with small-scale LLMs pretrained with dReLU in addition to SwiGLU have demonstrated that models with dReLU perform on par with SwiGLU models, with sparsity levels approaching 90%.
In summary, the ReLU activation function contributes to sparsity and efficiency in LLMs by producing zero for non-positive inputs, resulting in many dormant neurons and improved inference efficiency. Techniques like ReLUfication and the use of ReLU variants, such as dReLU, have been proposed to enhance activation sparsity and optimize LLM inference.
The Mixture-of-Experts (MoE) strategy is a method used in Large Language Models (LLMs) to implement conditional computation and increase efficiency. It introduces sparsity by selectively activating specific model components based on the input, thus reducing unnecessary calculations.
In the MoE approach, the model is divided into separate sub-networks, called "experts", each specializing in a subset of the input data2. A gating network, or router, is trained to activate only the most suitable expert(s) for a given input. This expert routing technique allows for the selective activation of model components, improving efficiency without increasing computational complexity.
By predefining constraints on the model's structure, such as the number of experts to activate for a particular input, MoE effectively introduces conditional computation. This results in the model only using a subset of its parameters for each input, reducing computational requirements while maintaining performance.
Overall, the MoE strategy enables the model to have a larger capacity without a corresponding increase in computation, making it an effective approach for improving the efficiency of LLMs.