RLHF and RLAIF play crucial roles in multilingual LLM safety by mitigating toxicity in non-English languages. Cross-lingual generalization of RLHF and RLAIF has shown promise in reducing toxicity levels in LLM generations across various languages. These techniques demonstrate zero-shot cross-lingual generalization, offering efficient multilingual toxicity mitigation and addressing a critical challenge in LLM safety across diverse linguistic contexts.
The main challenges in multilingual LLM toxicity mitigation are the lack of sufficient annotated datasets across languages, the resource-intensive nature of existing solutions, and the need for effective cross-lingual generalization techniques2. Current approaches often rely on translated English data, which may not fully address the complexities of multilingual toxicity.
Researchers address LLM toxicity in non-English languages by using techniques like cross-lingual generalization of Reinforcement Learning with Human Feedback (RLHF) or AI Feedback (RLAIF), and translating toxic and non-toxic data from English to target languages. A recent study shows that Direct Preference Optimization (DPO) with only English training data can significantly reduce toxicity levels in LLM generations across different languages, demonstrating zero-shot cross-lingual generalization.