GPT 5 is All About Data
Understanding GPT-5
In this transcript, the speaker discusses GPT-5 and its potential release later this year. They summarize that data is a key factor in determining whether it will approach genius-level IQ. The speaker also discusses the accuracy of leaked information about GPT-5 and the number of GPUs it may be trained on.
Factors Affecting GPT-5's Release
- Data is a key factor in determining whether GPT-5 will approach genius-level IQ.
- The completion of GPT-4 or equivalent around late spring/early summer 2022 suggests that GPT-5 could be released later this year.
Importance of Data for Language Modeling Performance
- Recent landmark models are wastefully big, and leveraging enough data can eliminate the need for large parameter count models.
- High-quality data is important for training language models, and we are within one order of magnitude of exhausting high-quality data between 2023 and 2027.
Sources of High-Quality Data
- Scientific papers, books, scraped content from the web, news, code, etc., plus Wikipedia are sources of high-quality data.
Estimate of Available High-Quality Data
- An estimate suggests there are between 4.6 trillion to 17 trillion words worth of high-quality language data available.
Conclusion
The availability and quality of data play a crucial role in developing advanced AI language models like GPT. While there is still much to learn about how these models work and what they can do, understanding their limitations and potential applications is essential for anyone interested in AI research.
Introduction
The speaker discusses the potential reasons why GPT-4 or Bing's responses may sometimes turn out like emoting teenagers. He also suggests that there might be a reason that neither Google nor OpenAI have been forthcoming about where they get their data from.
Possible Reasons for Poor Responses
- GPT-4 or Bing may have scraped the bottom of the web text barrel, which could explain why its responses sometimes turn out like emoting teenagers.
- Neither Google nor OpenAI have been forthcoming about where they get their data from, which could be due to avoiding controversy over attribution and compensation.
GPT-5 Improvements
The speaker discusses how GPT-5 will likely improve upon previous models by scraping as much high-quality data as possible and making other upgrades.
Upgrades to Data Discussion
- GPT-5 will learn lessons from previous models and scrape as much high-quality data as possible.
- The stock of high-quality data grows by around 10% annually, even without further efficiencies in data use or extraction.
Other Upgrades
- More ways might be found to extract high-quality data from low-quality sources.
- Automating Chain of Thought prompting into the model can lead to small gains but would still be significant when Bing is already strong.
- Language models can teach themselves to use tools such as calculators, calendars, and APIs.
- Models can check if their code compiles and thereby teach themselves better coding.
Additional Data Sets
- It may be possible to train a model several times using the same data, which could lead to significant improvements.
- Models can generate additional data sets on problems with which they struggle, such as those with complex pans. Humans could filter their answers for correctness.
The Future of AI and Jobs
In this section, the speaker discusses how AI will revolutionize the job market and predicts that AI tutors could replace human tutors by the end of next year. He also talks about the likely divergence between changes to cognitive work and changes to physical work.
Implications for Jobs Market
- AI will revolutionize the jobs market, although it may not lead to AGI.
- Sam Altman tweeted that in 2023, creating a simple iPhone app would cost $30,000 while a plumbing job would cost $300. The speaker wonders what these relative prices will look like in 2028.
Changes to Cognitive Work
- The best human raters will be beaten on some benchmarks such as reading comprehension when GPT5 is released.
- This release would have huge implications for summarization and creative writing.
- Logic and critical reasoning tasks such as debating topics or discerning causality in complex scenarios would also be impacted.
- Physics and high school math problems could be solved with an order of magnitude improvement.
AI Tutors
- AI tutors replacing human tutors could be with us by the end of next year.
- The release of GPT5 will coincide with final refinements in text-to-speech, image-to-text, text-to-image, and text-to-video avatars.
Safety Research
- Timelines for GPT5 depend partly on internal safety research at Google and OpenAI.
- Safety and alignment are crucial goals before releasing AGI models.
Conclusion
In this section, the speaker ends with a quote from Sam Altman emphasizing that safety progress must keep up with capability progress as AI models become more powerful.
Safety Progress
- The ratio of safety progress to capability progress must increase as AI models become more powerful.
- The speaker thanks the audience for watching and encourages them to check out his other videos on Bing chat and its use cases.