The primary goals of developing large language models (LLMs) in the field of artificial intelligence are to understand and generate human language. These models aim to perform tasks such as translation, summarization, and question-answering with high accuracy. By enhancing the reasoning capabilities of LLMs, researchers seek to improve their performance on complex multi-step reasoning tasks, such as solving mathematical problems. Ultimately, the development of LLMs aims to create AI systems capable of performing tasks that typically require human intelligence.
Current LLMs face several common errors when engaging in tasks that require multiple logical steps. These errors can be broadly categorized into three types: calculation errors, numeric errors, and logical errors.
Calculation Errors: These errors occur when LLMs make mistakes in performing arithmetic operations or calculations. For example, in arithmetic reasoning tasks, LLMs might struggle with addition, subtraction, multiplication, or division.
Numeric Errors: These errors arise when LLMs misinterpret numerical values or make mistakes in handling numbers. This could include incorrectly reading numbers, misplacing decimal points, or misremembering numerical facts.
Logical Errors: These errors occur when LLMs fail to reason correctly or make incorrect inferences. Logical errors can be caused by misinterpreting data, misapplying arithmetic operations, or failing to consider all relevant information. They can also result from the model's limitations in understanding complex logical relationships or its inability to correctly infer the needed function from the context of the question.
These errors can lead to incorrect final answers in multi-step reasoning tasks, highlighting the need for methods that can more accurately guide LLMs through each step of the reasoning process.
Outcome Reward Models (ORMs) and Process Reward Models (PRMs) play a crucial role in training Large Language Models (LLMs) by providing feedback on their performance25. Both types of reward models are designed to evaluate the responses generated by LLMs and assign a score based on their correctness.
ORMs focus on evaluating the final output of the LLM, considering only the end result of the model's chain-of-thought5. They determine whether the final answer is correct or incorrect, disregarding the intermediate steps taken by the model. ORMs are trained using a supervised learning approach, where the model learns from labeled data to predict the correctness of the final answer.
On the other hand, PRMs provide more detailed supervision by evaluating each step in the LLM's chain-of-thought. Instead of only considering the final answer, PRMs assess the correctness of each reasoning step. This allows PRMs to provide more precise feedback and identify the specific errors made by the model during the reasoning process. PRMs are trained using a combination of supervised and reinforcement learning techniques, where the model learns to assign rewards to each step based on its correctness.
In summary, ORMs and PRMs play a vital role in training LLMs by providing feedback on their performance. ORMs evaluate the final output, while PRMs evaluate each step in the reasoning process. Both types of reward models contribute to improving the reasoning capabilities of LLMs and enhancing their performance in complex multi-step reasoning tasks.