Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion podcast/the-changelog-519.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ So in the sort of classical understanding of machine learning, you would have a

**Jerod Santo:** That's really cool.

**Shawn Wang:** So just to close this out, the thing I was saying about Chinchilla was "More data is good, we've found the double descent problem. Now let's go get all the data that's possible." I should make a mention about the open source data attempts... So people understand the importance of data, and basically Luther.AI is kind of the only organization out there that is collecting data that anyone can use to train anything. So they have two large collections of data called The Stack and The Pile, I think is what it's called. Basically, the largest collection of open source permissively-licensed text for you to train whatever language models you want, and then a similar thing for code. And then they are training their open source equivalents of GPT-3 and Copilot and what have you. But I think those are very, very important steps to have. Basically, researchers have maxed out the available data, and part of why Open AI Whisper is so important for OpenAI is that it's unlocking sources of text that are not presently available in the available training data. We've basically exhausted, we're data-constrained in terms of our ability to improve our models. So the largest source of untranscribed text is essentially on YouTube, and there's a prevailing theory that the primary purpose of Whisper is to transcribe all video, to get text, to train the models... \[laughs\] Because we are so limited on data.
**Shawn Wang:** So just to close this out, the thing I was saying about Chinchilla was "More data is good, we've found the double descent problem. Now let's go get all the data that's possible." I should make a mention about the open source data attempts... So people understand the importance of data, and basically [eleuther.ai](https://www.eleuther.ai) is kind of the only organization out there that is collecting data that anyone can use to train anything. So they have two large collections of data called [The Stack](https://huggingface.co/datasets/bigcode/the-stack) and [The Pile](https://pile.eleuther.ai), I think is what it's called. Basically, the largest collection of open source permissively-licensed text for you to train whatever language models you want, and then a similar thing for code. And then they are training their open source equivalents of GPT-3 and Copilot and what have you. But I think those are very, very important steps to have. Basically, researchers have maxed out the available data, and part of why Open AI Whisper is so important for OpenAI is that it's unlocking sources of text that are not presently available in the available training data. We've basically exhausted, we're data-constrained in terms of our ability to improve our models. So the largest source of untranscribed text is essentially on YouTube, and there's a prevailing theory that the primary purpose of Whisper is to transcribe all video, to get text, to train the models... \[laughs\] Because we are so limited on data.

**Adam Stacoviak:** Yeah. We've helped them already with our podcasts. Not that it mattered, but we've been transcribing our podcasts for a while, so we just gave them a leg up.

Expand Down