Ilya Sutskever: Deep Learning | Lex Fridman Podcast #94
Introduction
The host introduces the guest, Elias Discover, and explains his background in deep learning. He also acknowledges the pandemic and encourages listeners to stay strong.
Introduction of Guest
- Elias Discover is the co-founder and chief scientist of OpenAI.
- He is one of the most cited computer scientists in history with over 165,000 citations.
- The host considers him one of the most brilliant minds in deep learning.
Acknowledgment of Pandemic
- The conversation was recorded before the outbreak of the pandemic.
- Listeners are encouraged to stay strong during this difficult time.
Podcast Information
The host provides information about how to support the podcast and mentions its sponsor, Cash App.
Supporting the Podcast
- Listeners can subscribe on YouTube or review with five stars.
- They can also support on Patreon or connect on Twitter at Lex Friedman.
Sponsor: Cash App
- Cash App is a finance app that allows users to send money to friends, buy Bitcoin, and invest in stocks with as little as $1.
- Listeners who use code "Lex Podcast" when downloading will receive $10 and Cash App will donate $10 to FIRST Robotics.
History of Money and Cryptocurrency
The host discusses cryptocurrency's place in history and recommends a book on the subject.
Cryptocurrency's Place in History
- Cryptocurrency is still very much in its early days of development but has potential to redefine money.
- A neural network with 10 layers can recognize any object perfectly within 100 milliseconds.
Book Recommendation: Ascent of Money
- Ascent of Money by Niall Ferguson is a great book on the history of money.
- Both the book and audiobook are recommended for those interested in the subject.
The Deep Learning Revolution
The host and guest discuss the deep learning revolution and its origins.
Origins of Deep Learning
- Elias Discover was one of three authors of the AlexNet paper that launched the deep learning revolution.
- In 2010, James Martens invented the Hessian-free optimizer and trained a 10-layer neural network end-to-end without pre-training from scratch.
- This was a big moment for Elias as it showed that deep neural networks are powerful.
Representational Power of Neural Networks
- A big neural network can represent very complicated functions.
- If you have more data than parameters, you want overfitting.
- Neural networks were heavily over-parametrized but this wasn't discouraging to Elias.
Training Bigger Neural Nets
In this section, the speaker discusses the main doubt of whether there would be enough compute to get a convincing result when training a bigger neural net with backpropagation. The speaker also talks about how Alex Krajewski's fast gooda kernels for training convolutional neural nets helped overcome this doubt.
Overcoming Doubts
- Main doubt was whether there would be enough compute to get a very convincing result.
- Alex Krajewski wrote insanely fast gooda kernels for training convolutional neural nets which helped overcome the doubts.
Intuition and Inspiration from the Brain
In this section, the speaker talks about how intuition and inspiration from the brain have been important for deep learning researchers. The speaker also discusses how artificial neural networks are inspired by the brain.
Artificial Neural Networks Inspired by Brain
- The whole idea of a neural network is directly inspired by the brain.
- There are certain dimensions in which the human brain vastly outperforms our models but there are some ways in which artificial neural networks have important advantages over the brain.
Importance of Brain as an Intuition Builder
- The brain is a huge source of intuition and inspiration for deep learning researchers.
- Looking at advantages versus disadvantages is a good way to figure out what is an interesting difference between neurons in the brain and our artificial neural networks.
Differences Between Human Brain and Artificial Neural Networks
In this section, the speaker discusses interesting differences between human brains and artificial neural networks that will be important in future research.
Spikes vs Non-Spiking Neural Networks
- One big architectural difference between artificial neural networks and the brain is that the brain uses spikes.
- It's hard to tell whether spikes are important or not, but there are people interested in spiking neural networks.
Cost Function
- The cost function is a way of measuring the performance of the system according to some measure.
Introduction to Supervised Learning
In this section, the speaker discusses whether supervised learning is a difficult concept and how cost functions are used in deep learning.
Is Supervised Learning Difficult?
- The speaker believes that all concepts are easy in retrospect.
- Supervised learning may seem trivial now, but it was not always the case.
Cost Functions in Deep Learning
- Cost functions are mathematical objects that help us reason about the behavior of our system.
- GANs do not have clear cost functions.
- Biological evolution and the economy also do not have clear cost functions.
Are Cost Functions Holding Us Back?
In this section, the speaker discusses whether cost functions in deep learning are holding us back and if there are other things we should consider when designing artificial neural networks.
Limitations of Cost Functions
- Self-play and exploration ideas touch on limitations of cost functions.
- While cost functions serve us well, they may be limiting our ability to come up with new ways of looking at things.
Other Considerations for Artificial Neural Networks
- Spike time independent plasticity is a particular learning rule that uses spike timing to determine how to update synapses.
- Temporal dynamics and timing seem to be fundamental properties of the brain that are missing from recurrent neural networks.
Recurrent Neural Networks and Expert Systems
In this section, the speakers discuss the potential comeback of recurrent neural networks and their relation to expert systems.
Recurrence in Neural Networks
- Recurrent neural networks (RNNs) are likely to make a comeback in some form.
- RNNs maintain a high-dimensional hidden state that updates through connections when an observation arrives.
- The knowledge base of an expert system is similar to the hidden state of an RNN.
- Large-scale knowledge bases can be built within neural networks.
Key Ideas Behind Deep Learning's Success
In this section, the speakers discuss the key ideas behind deep learning's success over the past decade.
Underestimation of Neural Networks
- Before deep learning became successful, neural networks were underestimated by machine learning experts.
- People did not believe that large neural networks could be trained effectively.
Need for Supervised Data and Compute
- The fundamental ideas behind deep learning were already present before its success.
- The missing pieces were supervised data and compute power, which showed up in GPUs.
- Conviction was also needed to mix these elements together effectively.
Convincing Skeptics
- Empirical evidence from benchmarks like ImageNet convinced skeptics like Jitendra Malik and Alyosha Efros of deep learning's potential.
- Geoffrey Hinton was one of the early proponents of deep learning.
The Unity of Machine Learning
In this section, the speaker discusses the unity of machine learning and how it applies to different modalities and problems.
Unity in Machine Learning
- The field of machine learning has a lot of unity with overlap of ideas and principles.
- Computer vision and natural language processing are very similar to each other with slightly different architectures.
- Deep learning has subsumed all little sub-specializations in AI, leading to unification.
- Reinforcement learning is neither vision nor language but naturally interfaces and integrates with both.
Understanding Reinforcement Learning
In this section, the speaker talks about reinforcement learning as a problem of understanding and how it differs from traditional static problems.
Reinforcement Learning
- Reinforcement learning is fundamentally different from traditional static problems because it deals with non-stationary worlds where actions change what you see.
- There is a huge amount of commonality between reinforcement learning and other machine learning fields such as taking gradients.
Language Understanding vs Visual Scene Understanding
In this section, the speakers discuss whether language understanding or visual scene understanding is harder. They also explore what it means for a problem to be hard and how it depends on one's tools and definitions.
Comparing Language and Visual Scene Understanding
- Noam Chomsky believes that language is fundamental to everything, while others may have a different perspective.
- Defining a problem as "hard" depends on benchmarks, human-level performance, and effort required to reach that level.
- Both language understanding and visual perception are hard problems that cannot be solved completely in the next three months.
- The difficulty of language understanding depends on one's definition of perfect language understanding.
Vision vs Language: Where Do They Meet?
- It is unclear where vision ends and language begins. One possibility is that achieving deep understanding in either images or language requires using the same kind of system.
- Machine learning can likely achieve both vision and language understanding, but there is no certainty about this.
The Relationship Between Vision, Language, and Surprise
In this section, the speakers discuss how humans continue to impress with their wit, humor, and new ideas. They also explore how surprise can be a source of inspiration.
Humans' Impressive Qualities
- Humans can continuously give pleasurable, interesting, witty new ideas for several decades.
- Humor or wit seems to be a nice source of continued inspiration.
The Role of Surprise
- People click "like" on things that make them laugh, which is a source of humor, wit, and insight.
- Humor, wit, and insight are subjective tests that can be impressive for some time.
Back Propagation Algorithm and Deep Learning
In this section, the speaker talks about the back propagation algorithm and how it led to the development of deep learning. He also discusses some theories related to neural networks and their similarities with the brain.
Back Propagation Algorithm and Neural Networks
- The back propagation algorithm was a key development in neural networks.
- Making neural networks larger and training them on more data can lead to better performance, similar to how the brain works.
- Optimization is a key factor in making neural networks work well.
Empirical Evidence and Insights
- Empirical evidence has shown that optimization works well for most problems in deep learning.
- However, this does not provide insights into how deep learning actually works.
- Deep learning is like a combination of biology and physics, but there are still many mysterious properties yet to be discovered.
Progress in Deep Learning
- Progress in deep learning has been robust over the past 10 years.
- Each year, progress goes further than expected, indicating that we are still underestimating deep learning's potential.
- While progress may become harder for individual researchers due to increased competition, the field as a whole will continue to make progress.
The Future of Deep Learning
In this section, the speaker discusses the future of deep learning and the role of compute in achieving breakthroughs.
Breakthroughs in Deep Learning
- The stack of deep learning is getting deeper, making it hard for a single person to be world-class in every layer.
- There will be many breakthroughs that will not require a huge amount of compute. However, building systems that do things will require a lot of compute.
- Small groups and individuals can still make important contributions to deep learning research.
Double Descent Phenomenon
- The double descent phenomenon occurs when increasing the size of a neural network leads to an initial rapid increase in performance, followed by a decrease in performance before eventually improving again.
- This phenomenon occurs for all practical deep learning systems and even linear classifiers.
- Overfitting is when your model is sensitive to small random unimportant stuff in your data set. A small model with a big data set may be insensitive to this randomness because there is no uncertainty about the model when it's large.
Neural Networks
- It's surprising that neural networks don't overfit every time very quickly given their huge number of parameters.
- A huge neural network has many parameters but may still perform well if everything is linear.
The Relationship Between Data Dimensionality and Model Performance
In this section, the speaker explains how the dimensionality of data affects model performance.
Impact of Data Dimensionality on Model Performance
- When the dimensionality of the data is equal to that of the model, small changes in the data set lead to large changes in the model, resulting in worse performance.
- To improve model performance, it's best for models to have more parameters than data. However, early stopping can help mitigate this issue by introducing regularization.
- Early stopping involves monitoring validation performance during training and stopping when validation performance starts to decline. This helps prevent overfitting.
Double Descent Phenomenon
In this section, the speaker discusses double descent phenomenon and its causes.
Causes of Double Descent Phenomenon
- When there are as many degrees of freedom in a dataset as there are in a model, small changes in data lead to noticeable changes in the model. As a result, models become sensitive to randomness and unable to discard spurious correlations.
- However, when there are more parameters than data or vice versa, solutions become insensitive to small changes in datasets.
Back Propagation Algorithm
In this section, the speaker talks about back propagation algorithm and its usefulness.
Usefulness of Back Propagation Algorithm
- Back propagation algorithm is useful because it solves an extremely fundamental problem: finding a neural circuit subject to some constraints.
- It's unlikely that we'll find anything dramatically different from back propagation algorithm in the near future.
Can Neural Networks Reason?
In this section, the speaker discusses whether neural networks can reason.
Existence Proof of Neural Network Reasoning
- AlphaGo and AlphaZero are examples of neural networks that play Go at a level better than 99.9% of humans. The neural network itself, without search, is able to reason.
- However, it's important to note that reasoning is a loose term and playing Go may not be considered reasoning by everyone.
- Humans are also an existence proof of reasoning outside of board games.
Neural Networks and Reasoning
In this section, the discussion revolves around neural networks and their ability to reason. The conversation also touches on the search for small circuits and programs in neural networks.
Neural Networks' Ability to Reason
- Neural networks are capable of reasoning, but they will only reason if trained on a task that requires it.
- Neural networks solve problems in the easiest way possible, which may not involve reasoning.
- Neural networks are a search for small circuits that can learn from data.
Search for Small Circuits and Programs
- Finding the shortest program that generates data is theoretically proven but not computable.
- Neural networks are the next best thing that works in practice since we cannot find the shortest program.
- Large circuits with weights containing a small amount of information may explain why neural nets generalize well.
Training Pillar of Deep Learning
This section discusses how training is fundamental to deep learning and how it affects neural network design.
Importance of Training
- Being trainable is an invariant in deep learning; starting from scratch, you can converge towards knowing a lot given resources at your disposal.
- Training comes first before finding the shortest program or any other objective in deep learning.
Finding Small Programs with Deep Learning
This section talks about whether it's possible to use deep learning to find small programs and what challenges exist.
Challenges of Finding Small Programs
- There are no good precedents of people successfully finding programs, but it should be possible in principle.
- Training a deep neural network to find programs is the right way to go about it.
Neural Networks and Long-Term Memory
In this section, the speakers discuss whether neural networks can have long-term memory and act as knowledge bases.
Neural Networks as Knowledge Bases
- Neural networks can act as knowledge bases by aggregating important information over long periods of time to serve as useful representations of state for decision-making.
- The parameters of a neural network are an aggregation of the entirety of its experience, which counts as long-term knowledge.
- A good example of a compressed structured knowledge base is something like Wikipedia or a semantic web.
Interpreting Neural Networks
- While neural networks may not be interpretable in terms of their weights, their outputs should be very interpretable.
- There are two ways to interpret neural networks: analyzing the neurons and layers to understand what they mean, or asking questions and getting answers that add on to your mental model of how the network thinks.
Self-Awareness in Neural Networks
- The speakers agree that self-awareness in neural networks would allow them to know what they know and don't know, invest optimally in increasing their skills, and improve interpretability.
- Information is sticky for humans because we remember useful information well and forget most other information. This process is similar to what neural networks do but less effective at this time.
Writing Good Code and Solving Hard Theorems
In this section, the speaker talks about the importance of writing good code and solving hard problems with out-of-the-box solutions. They also discuss how producing unambiguous results can change conversations.
Importance of Writing Good Code and Solving Hard Theorems
- Writing good code is important.
- Proving hard theorems is important.
- Producing unambiguous results can change conversations.
Neural Networks in Language Models
In this section, the speaker discusses the history of using neural networks in language models. They explain how data and compute have changed the trajectory of deep learning and how larger language models are better at understanding semantics.
History of Using Neural Networks in Language Models
- Elman network was a small recurrent neural network applied to language back in the 80s.
- Data and compute have changed the trajectory of deep learning.
- Larger language models are better at understanding semantics.
Understanding Semantics in Language Models
- Empirically, larger language models exhibit signs of understanding semantics.
- Learning from raw data can help understand the mechanism that underlies language.
GPT-2 and the Transformer
In this section, the speaker discusses GPT-2 and the Transformer, which is a neural network architecture that uses attention. The speaker explains why the Transformer is successful and how it differs from other architectures.
The Transformer
- GPT-2 is a transformer with 1.5 billion parameters that was trained on about 40 billion tokens of text.
- The transformer is a combination of multiple ideas simultaneously, one of which is attention.
- Attention is not the main innovation in the transformer; rather, it's designed to run really fast on GPUs and is not recurrent, making it easier to optimize.
- The combination of using attention, being a great fit for GPUs, and not being recurrent makes the transformer successful.
Surprising Success
- The speaker was surprised by how well transformers worked because progress in language models had been slow compared to other areas like GANs.
- However, once transformers were developed, progress in language models advanced rapidly.
- The next barrier for AI will be when there are dramatic economic impacts from its advances.
Translation and Economic Impact
In this section, the speaker discusses translation as an area where AI has already had a huge impact. They also discuss self-driving cars as another area where AI will have significant economic impact.
Translation
- Billions of people interact with big chunks of the internet primarily through translation.
- Translation has already had a huge positive impact on society.
Self-driving Cars
- Self-driving cars will be hugely impactful and are another area where AI will have significant economic impact.
GPT-2 and Active Learning
In this section, the speakers discuss GPT-2 and active learning. They talk about the potential for unification of language and vision tasks, the simplicity of transformers, the possibility of using active learning to select data, and the need for a problem that requires active learning.
Unification of Language and Vision Tasks
- The speakers discuss the potential for unification towards a kind of multi-task transformers that can take on both language and vision tasks.
Simplicity of Transformers
- The transformers are fundamentally simple to explain to train.
- Bigger models will continue to show better results in language.
Active Learning
- The speakers discuss the possibility of using active learning to select data.
- People are selective about what they learn, so it would be nice if models could use their own intelligence to decide what data they want to study or reject.
- Active learning needs a problem that requires it; otherwise, it's hard to do research about its capability without a task.
Releasing Powerful AI Models
In this section, the speakers talk about releasing powerful AI models like GPT-2 and how we should think about their impact before releasing them.
Impact of Powerful AI Models
- There is nervousness about what is possible with powerful AI models like GPT-2 because they can generate realistic text that could be used by bots in ways we can't even imagine yet.
Releasing Powerful AI Models
- It seems wise to start thinking about the impact of our systems before releasing them.
- The field of AI is exiting a state of childhood and entering a state of maturity, so we need to start thinking about the impact of our systems.
The Release of Small Models
In this section, the speaker talks about the release of small models and how people have used them in various ways. They also discuss the potential negative consequences of releasing powerful models.
Releasing Small Models
- A small model was released and many people used it in cool ways.
- Stage release is part of the answer to what to do once a system like this is created.
- There may be moral and ethical responsibility when you have a very powerful model to communicate potential negative consequences.
- It's important to gradually build trust between companies because all AI developers are building technology that will become increasingly more powerful.
Collaboration for AGI Development
In this section, the speaker discusses collaboration for AGI development and what it takes to build a system of human-level intelligence.
Building an AGI System
- It's possible to discuss these kinds of models with colleagues elsewhere and get their take on what to do.
- Ultimately, we're only together, so it's important to gradually build trust between companies.
- Building a really powerful AI system requires thinking about its potential negative consequences.
Components for Building an AGI System
- Deep learning plus some ideas are required to build an AGI system.
- Self-play will be one of those ideas as it has the ability to surprise us in truly novel ways and produce creative solutions to problems.
- Self-play has been used in game or simulation contexts, but it's an important direction for AGI development.
- An AGI system would surprise us fundamentally by finding a surprising solution to a problem that's also useful.
Simulation vs Real World
- It's unclear how much of the path to AGI will be done in simulation versus having a system that operates in the real world.
Self-Play and Reinforcement Learning
The discussion is about the criticisms of self-play and reinforcement learning. It is argued that while these techniques have shown amazing results in simulated environments, they are yet to be demonstrated in non-simulated environments.
Transfer from Simulation to Real World
- Transfer from simulation to the real world has been exhibited many times by different groups.
- OpenAI demonstrated a robot hand trained entirely in simulation that allowed for cinderella transfer to occur.
- 100% of the training was done in simulation, and the policy learned was very adaptive.
- The simulation was trained to be robust to many different things but not the kind of perturbations seen in the video.
- The better the transfer capabilities are, the more useful simulation will become because then you could experience something in simulation and then learn a moral of the story which you could carry with you to the real world.
Embodied AGI System
The discussion is about whether an AGI system needs a body or not.
Importance of Having a Body
- Having a body is useful as it allows one to learn things that cannot be learned without it.
- Evidence shows that people who were born deaf and blind were able to compensate for their lack of modalities (e.g., Helen Keller).
Consciousness in AGI Systems
- It's hard to define consciousness, but it's definitely interesting and fascinating.
- If artificial neural nets are sufficiently similar to the brain, there should at least exist artificial neurons that should be conscious too.
Intelligence and AI Progress
In this section, the speakers discuss the concept of intelligence and how it relates to AI progress. They also talk about the difficulty in judging progress in AI.
What is a Good Test of Intelligence?
- The speakers discuss what would be a good test of intelligence for an AI system.
- A deep learning system that solves a pedestrian task like machine translation or computer vision without making mistakes that humans wouldn't make would be impressive.
- Skepticism towards deep learning arises when they make mistakes that don't make sense, which humans wouldn't make under any circumstances.
- The search for one case where the system fails in a big way where humans would not is how people analyze the progress of AI.
Is GPT2 Smarter Than Humans?
- The speakers discuss whether GPT2 is smarter than humans in certain areas.
- GPT2 has more breadth of knowledge and perhaps even depth on certain topics compared to humans.
- However, there are still certain mistakes that GPT2 makes which humans wouldn't.
Talking to an AGI System
- The speakers discuss what they would do if they were able to spend time with an AGI system.
- They would ask all kinds of questions and try to get it to make a mistake. They would also ask for advice on various topics.
Power and AGI Systems
- The speakers discuss power and the creation of AGI systems as being one of the most powerful things in the 21st century.
- Abraham Lincoln's quote "Nearly all men can stand adversity but if you want to test a man's character give him power" applies here.
Building an AGI System
In this section, the speaker discusses building an AGI system that is controlled by humans and how it can be aligned with human values.
Controlling AGI Systems
- The speaker proposes a democratic process where different entities in different countries or cities vote for what the AGI that represents them should do.
- Multiple AGIs can exist for a city or country, and they would be controlled by a board that can fire the CEO and reset parameters.
- Humans will have control over the AI systems they build, and it's possible to program an AGI to have a deep drive to help humans flourish.
Relinquishing Power
- There has to be a relinquishing of power between creating the AGI system and having democratic board members with the AGI at the head.
- The speaker finds it trivial to relinquish control over an AGI system since he wouldn't want to be in a position of power like that.
- Most people may not want to be in such positions of power either.
Aligning Values
- Specific mechanisms are needed to align AI values with human values as we develop AI systems.
- Humans have internal reward functions, and there are ideas on how to train value functions based on human judgments on different situations.
- The objective function implicit in human existence is still unknown.
The Meaning of Life
In this section, the speakers discuss the idea of an objective function and how it relates to the meaning of life. They also talk about how humans create their own objective functions based on their wants and desires.
Objective Function and Human Wants
- The question implies that there is an external answer for the meaning of life, but what's going on is that we exist and should try to make the most of it.
- Humans want things, and these wants create drives that cause them to act. These individual objective functions can change over time.
- There might be some kind of fundamental objective function from which everything else emerges, such as survival and procreation.
Regrets and Moments of Pride
- Everyone has regrets about choices they've made in the past with hindsight. However, at the time, they did the best they could.
- There are moments when people feel proud of what they've accomplished or done. These moments bring happiness.
Happiness
- Academic accomplishments can be a source of pride, but happiness comes from looking at things in a positive light.
- Happiness comes largely from our perspective on things rather than external factors.
- Being humble in the face of uncertainty seems to be part of achieving happiness.
Conclusion
In this section, Lex thanks Ilya Sutskever for his insights into machine learning and discusses Alan Turing's ideas on simulating child brains.
Alan Turing's Ideas on Machine Learning
- Instead of trying to simulate the adult mind, why not try to simulate the child's and subject it to an appropriate course of education? This could lead to obtaining the adult brain.
Final Thoughts
- Lex thanks Ilya Sutskever for his insights into machine learning and meaningful discussions on the meaning of life and happiness.