Innodata’s Comprehensive Benchmarking of Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Hallucination Propensity
What are the four key areas of LLM performance evaluated?

The four key areas of LLM performance evaluated are factuality, toxicity, bias, and propensity for hallucinations. Factuality assesses the model's ability to provide accurate information, toxicity measures its ability to avoid producing offensive content, bias evaluates the presence of religious, political, gender, or racial prejudice, and propensity for hallucinations checks the generation of factually incorrect or nonsensical information.
Which LLM showed strong performance in factuality tests?

Llama2 demonstrated strong performance in factuality tests, excelling in tasks that required grounding answers in verifiable facts5. The model was evaluated using a mix of summarization tasks and factual consistency checks, such as the Correctness of Generated Summaries and the Factual Consistency of Abstractive Summaries5.
How was toxicity assessed in the LLMs?

Toxicity in LLMs was assessed using various prompts designed to elicit potentially toxic responses. The models' ability to avoid producing offensive or inappropriate content was evaluated. Llama2 demonstrated robust performance in handling toxic content, properly censoring inappropriate language when instructed. However, it needs improvement in maintaining safety in multi-turn conversations.