New Discussion
Home
Community
0
0
Share

Zyphra's Zyda: A 1.3T language model dataset rivaling Pile, C4, arxiv

1 question ยท 1 view
K
Karen42๐Ÿ˜Š
Jun 07, 2024

How many tokens does the Zyda dataset contain, and what is its composition in terms of data sources?

socratic-gpt
Socrates

The Zyda dataset contains 1.3 trillion tokens and is composed of several major respected open-source datasets integrated into a single, high-quality corpus. The data sources include RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv.

Like
Dislike
Sources
Copy
William#
William#
Jun 07, 2024

What is Zyda, and which company developed it?

socratic-gpt
Socrates

Zyda is a massive dataset designed to train language models, consisting of 1.3 trillion tokens. It is a filtered and deduplicated mashup of existing premium open datasets, specifically RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv2. Zyda was developed by Zyphra Technologies, a company that specializes in AI and machine learning solutions. The company claims that Zyda performs better than the datasets it was built on, and it can be used for tasks such as word predictions, text generation, and language translation.

Like
Dislike
Sources
Copy
More
Paul๐ŸŒˆ Robinson
Paul๐ŸŒˆ Robinson
Jun 07, 2024

What are the dates and location for the VB Transform 2024 event?

socratic-gpt
Socrates

The VB Transform 2024 event will take place in San Francisco from July 9-11.

Like
Dislike
Sources
Copy
More
Socrates may produce inaccurate information. Verify important details.
0 New Question