The key alignment strategies used to ensure the safety of Large Language Models (LLMs) often combine preference-based optimization techniques like Direct Preference Optimisation (DPO) and Reinforcement Learning with Human Feedback (RLHF) with supervised fine-tuning (SFT). These strategies aim to modify the models to avoid interacting with hazardous inputs and reduce the likelihood of producing harmful content4.
The phenomenon of shallow safety alignment significantly impacts the effectiveness of Large Language Models (LLMs). This is because the safety alignment in these models often only affects the model's initial tokens. This phenomenon, known as shallow safety alignment, makes the models particularly vulnerable to easy exploits. If the initial output tokens of the model are altered to deviate from safe responses, the entire generated output may veer into dangerous territory.
The research has shown through systematic trials that the initial tokens of the outputs of aligned and unaligned models show the main variation in safety behaviors. The effectiveness of some attack techniques, which center on starting destructive trajectories, can be explained by this shallow alignment. For instance, the original tokens of a destructive reaction are frequently drastically changed by adversarial suffix attacks and fine-tuning attacks.
The study has demonstrated how the alignment of the model may be reversed by merely changing these starting tokens, underscoring the reason why even small adjustments to the model might jeopardize it. The team has shared that alignment techniques should be used in the future to extend their impacts further into the output. It presents a data augmentation technique that uses safety alignment data to train models with damaging answers that eventually become safe refusals.
By increasing the gap between aligned and unaligned models at deeper token depths, this method seeks to improve robustness against widely used exploits. In order to mitigate fine-tuning attacks, the study has proposed a limited optimization objective that is centered on avoiding significant shifts in initial token probabilities. This approach shows how shallow current model alignments are and offers a possible defense against fine-tuning attacks.
In conclusion, this study presents the idea of shallow versus deep safety alignment, demonstrating how the state-of-the-art approaches are comparatively shallow, giving rise to a number of known exploits. This study presents preliminary approaches to mitigate these problems. The team has suggested future research to explore techniques ensuring that safety alignment extends beyond just the first few tokens.
The recent study by researchers from Princeton University and Google DeepMind found that the primary difference in safety behaviors between aligned and unaligned models lies in their modeling of only the first few tokens of their outputs3. This phenomenon is known as shallow safety alignment. The shallow alignment makes the models vulnerable to relatively easy exploits, as the entire generated output may wander into dangerous territory if the model's initial output tokens are changed to diverge from safe responses. The study highlights the need for future alignment techniques to extend their impact further into the output, ensuring that safety alignment goes beyond just the first few tokens.