
The Zyda dataset contains 1.3 trillion tokens and is composed of several major respected open-source datasets integrated into a single, high-quality corpus. The data sources include RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv.

Zyda is a massive dataset designed to train language models, consisting of 1.3 trillion tokens. It is a filtered and deduplicated mashup of existing premium open datasets, specifically RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv2. Zyda was developed by Zyphra Technologies, a company that specializes in AI and machine learning solutions. The company claims that Zyda performs better than the datasets it was built on, and it can be used for tasks such as word predictions, text generation, and language translation.

The VB Transform 2024 event will take place in San Francisco from July 9-11.