
In the context of JEST, "data quality bootstrapping" refers to the process of using a small, curated dataset to guide the learning on larger, uncurated ones14. This approach improves the efficiency of pretraining by focusing on the most learnable data batches, leading to faster and more effective AI training23.

Data curation significantly impacts pretraining performance by selecting high-quality and diverse data subsets for training. Well-curated datasets can achieve strong performance with less data, and model-based data curation methods, like JEST, enhance learning by selecting relevant sub-batches from larger super-batches, reducing computational overhead and improving performance with fewer iterations. This demonstrates the effectiveness of model-based data curation in multimodal learning and suggests potential for "data quality bootstrapping" in future research.

Manual curation is difficult and expensive to scale, while model-based data curation, which uses features of the model being trained to select high-quality data, holds promise for improving scalability in large-scale pretraining2.