
TowerLLM, a new AI model created by Unbabel, has been reported to outperform OpenAI's GPT-4o and other commercially available AI systems in translating between English and six commonly spoken European and Asian languages. Unbabel tested TowerLLM against GPT-4o, GPT-4, GPT-3.5, and competing models from Google and DeepL. The results showed that TowerLLM achieved higher accuracy in almost every case. Its highest accuracy was observed in English-Korean translations, where it surpassed OpenAI's best models by about 1.5%. For English-German translations, GPT-4 and GPT-4o were slightly better by a fraction of a percentage point1. Furthermore, TowerLLM was tested on translations of documents for specific professional domains like finance, medicine, law, and technical writing. In these tests, it performed between 1% and 2% better than OpenAI's best models.
It's important to note that Unbabel’s results have not been independently verified. If confirmed, this could suggest that GPT-4, which has been considered the top-performing LLM on most language benchmarks, might now be surpassed by newer AI systems trained with different methods.

Unbabel has developed a specific AI model called TowerLLM that reportedly outperforms OpenAI's GPT-4o in language translation. TowerLLM is designed, trained, and optimized for translation, resulting in the best performing translation LLM commercially available. It provides superior translation quality to leading LLMs like GPT-4o and GPT-3.5, while consistently outperforming Google and DeepL, because it was made from scratch to be multilingual.

TowerLLM was trained to perform multilingual tasks using a two-step process. First, it was trained on a large public dataset of multilingual text, which allowed the model to perform better on reasoning tasks in multiple languages compared to some competing open-source AI models. This initial training set was specifically designed to include a large amount of multilingual text, unlike other LLMs that have a higher proportion of English-language text in their training set.
Next, TowerLLM was fine-tuned using a carefully curated dataset of high-quality translations between language pairs. Unbabel used another AI model called COMETKiwi, which was trained to assess translation quality, to help curate this fine-tuning dataset. This fine-tuning process was crucial in achieving the model's superior performance in translation tasks.
In summary, TowerLLM was trained on a large multilingual dataset and then fine-tuned with a high-quality curated dataset of translations, which contributed to its exceptional performance in translation tasks.