
In the SaySelf framework, reasoning chains play a crucial role in enhancing the training process for LLMs. They represent the thought process of the model for each query and are used to generate a model-specific dataset for supervised fine-tuning5. Here's how reasoning chains are utilized in SaySelf:
Generation of Reasoning Chains: For every query, multiple reasoning chains, which are sequences of tokens, are sampled from LLMs to capture the model's thought process5.
Clustering and Selection: The sampled reasoning chains are then clustered based on their semantic similarity. One example is selected from each cluster to create a diverse and representative set of reasoning chains for training.
Summarizing Uncertainty: From a first-person perspective, GPT-4 is prompted to analyze the selected examples from different clusters and summarize the uncertainty regarding specific knowledge in plain language.
Reinforcement Learning: The confidence estimate of LLMs in each response is calibrated using reinforcement learning to ensure accurate confidence estimations. A reward system is designed to discourage overconfidence and penalize incorrect predictions.
By incorporating reasoning chains in this manner, SaySelf enables LLMs to provide self-reflective rationales that identify gaps in their knowledge and explain their confidence estimates, leading to more accurate and fine-grained confidence assessments4.

The SaySelf training framework improves the precision and accuracy of confidence estimations in LLMs through a two-stage process. First, it performs supervised fine-tuning using a dataset that contains summaries of the difference between multiple reasoning chains. This dataset is created by sampling multiple responses from a vanilla LLM and clustering them based on semantic similarity5. One instance is retained from each cluster, and an instruction-finetuned text embedding model is used for clustering. Confidence estimation is based on the size of the cluster, reflecting the consistency among different reasoning paths. Rationale generation is done by GPT-4, which analyzes the inconsistencies in the selected responses from different clusters and summarizes the uncertainties in natural language from a first-person perspective.
The second stage of the SaySelf framework involves reinforcement learning from task supervision. A reward function is designed to encourage accurate, high-confidence predictions and penalize overconfidence in incorrect answers. The Proximal Policy Optimization (PPO) algorithm is employed to further calibrate the LLM's confidence estimates based on the reward function.
This approach significantly reduces calibration errors and maintains performance while generating insightful rationales. The SaySelf framework equips LLMs with the ability to express fine-grained confidence estimates and generate self-reflective rationales explaining their uncertainty, thus improving the precision and accuracy of confidence estimations in LLMs5.

The use of pre-made LLMs like GPT-4 plays a crucial role in the SaySelf training framework. SaySelf leverages GPT-4 to automatically generate a dataset tailored to the model, which can then be used for supervised fine-tuning. This approach allows for the creation of confidence estimations with increased precision and accuracy. Additionally, GPT-4 is used to provide self-reflective rationales that show where the model lacks knowledge and explain its confidence estimates. This innovative framework helps improve the transparency and performance of LLMs in various applications.