The scarcity of high-quality internet data for LLMs is caused by the exhaustion of available pool of such data, as predictions suggest that by 2026, the available high-quality internet data will be scarce, forcing researchers to turn to model-generated or synthetic data for training6.
Synthetic data can significantly influence LLM behavior by addressing data scarcity and enabling the training of AI models at scale. It allows for the creation of diverse datasets that can be tailored to specific requirements, helping to improve model performance and generalization. However, synthetic data can also amplify biases or introduce new ones, leading to misinformation and misaligned AI systems. Therefore, it is crucial to ensure the factuality, fidelity, and unbiasedness of synthetic data to build more powerful and trustworthy language models.
Verifying the quality of synthetic math data involves evaluating its fidelity, utility, and privacy. Metrics such as statistical similarity, category and range completeness, boundary preservation, and incomplete data similarity are used to measure fidelity. Utility metrics like prediction score and feature importance score assess the performance of the generated dataset on downstream tasks. Privacy metrics ensure sensitive information is protected. Model audit processes and regular reviews of synthetic datasets help maintain quality assurance.