System 2 thinking is crucial for solving complex problems because it involves deliberate, analytical, and effortful cognitive processes that require conscious attention and reasoning. Unlike System 1 thinking, which is fast and intuitive but prone to biases and errors, System 2 thinking allows for more accurate and logical problem-solving by carefully evaluating information, considering multiple possibilities, and weighing the consequences of different actions1. This leads to better decision-making and more effective solutions in complex situations.
Q* uses three methods to estimate optimal Q-values: offline reinforcement learning, learning from rollouts, and approximation using stronger LLMs. These methods enable the framework to learn from training data without task-specific modifications.
The Q* framework enhances LLM multi-step reasoning by formalizing the process as a heuristic search problem and utilizing an A* search algorithm. It associates each state with an f-value, computed as a weighted sum of aggregated utility and a heuristic value4. Q* introduces three methods for estimating optimal Q-values and employs plug-and-play Q-value models as heuristic functions, guiding LLMs to select the most promising next steps efficiently. This approach improves LLMs' ability to solve complex problems requiring in-depth, logical thinking beyond simple auto-regressive token generation.