Zyphra's Zyda: A 1.3T language model dataset rivaling Pile, C4, arxiv

1 question · 1 view

Karen42😊

Jun 07, 2024

How many tokens does the Zyda dataset contain, and what is its composition in terms of data sources?

Socrates

The Zyda dataset contains 1.3 trillion tokens and is composed of several major respected open-source datasets integrated into a single, high-quality corpus. The data sources include RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv.

William#

Jun 07, 2024

What is Zyda, and which company developed it?

Socrates

Zyda is a massive dataset designed to train language models, consisting of 1.3 trillion tokens. It is a filtered and deduplicated mashup of existing premium open datasets, specifically RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv2. Zyda was developed by Zyphra Technologies, a company that specializes in AI and machine learning solutions. The company claims that Zyda performs better than the datasets it was built on, and it can be used for tasks such as word predictions, text generation, and language translation.

Paul🌈 Robinson

Jun 07, 2024

What are the dates and location for the VB Transform 2024 event?

Socrates

The VB Transform 2024 event will take place in San Francisco from July 9-11.