Yann LeCun | Self-Supervised Learning, JEPA, World Models, and the future of AI

Name: Yann LeCun | Self-Supervised Learning, JEPA, World Models, and the future of AI
Uploaded: 2025-09-29T18:33:33.000Z
Duration: 2 h 16 min 37 s
Description: Geometry of Machine Learning Special Lecture 9/16/2025 Speaker: Yann LeCun, NYU & META Title: Self-Supervised Learning, JEPA, World Models, and the future of AI

Introduction to the Conference

Welcome and Context

Mike Friedman introduces the conference on the Geometry of Machine Learning at Harvard's CMSA, highlighting its significance and location change due to high attendance.

He mentions Yann LeCun, chief scientist at Meta, as a key speaker whose participation encouraged other speakers to join.

Yann LeCun's Perspective on AI

Self-Identification and Audience Engagement

Yann LeCun admits he is not a mathematician or computer scientist but will discuss machine learning in an accessible manner for a general audience.

The Future of AI

He emphasizes the need for significant advancements in AI to approach human-level intelligence, noting current limitations in existing technologies.

LeCun envisions a future where AI assists individuals daily through wearable devices, necessitating systems with human-like intelligence for effective interaction.

Current Limitations of AI Architectures

Inefficiencies in Learning Techniques

Current AI architectures are inadequate compared to human and animal learning efficiency; traditional methods like supervised and reinforcement learning fall short.

Emergence of Self-Supervised Learning

Self-supervised learning has gained traction but primarily works with discrete symbols (e.g., language), lacking effectiveness with natural signals.

Challenges in Existing Models

Inference Limitations

Current models rely on fixed-layer feedforward propagation, limiting their ability to represent complex functions effectively.

Issues with Prediction Methods

Auto-regressive prediction methods lead to issues such as divergence or hallucination, further complicating model reliability.

Human-Like Intelligence Requirements

Mental Models and Goal-Oriented Behavior

Unlike current chatbots and LLMs, humans and animals possess mental models that guide behavior based on objectives and planning capabilities.

Understanding World Models and Learning in AI

The Need for Advanced Systems

We require systems that can understand the physical world, possess persistent memory, plan complex actions, reason effectively, and are controllable and safe.

Insights on Infant Learning

Infants develop basic concepts about the world within their first few months of life, such as object permanence and categorization of objects. This learning occurs without verbal communication.

By around nine months old, infants grasp intuitive physics concepts like gravity and inertia; they become surprised when witnessing unexpected physical behaviors (e.g., an unsupported object floating).

Challenges in Machine Learning

Current AI lacks the ability to learn as efficiently as human children do; we still don't have fully autonomous robots or self-driving cars comparable to a child's capabilities. Despite advancements in AI passing exams or solving math problems, they fall short in practical tasks.

A 10-year-old can perform household tasks with minimal training, while current technology struggles to replicate even basic dexterity found in animals like cats or primates.

The Paradox of AI Capabilities

Many tasks considered intellectually challenging for humans (like chess) are algorithmically simple compared to the complexity of physical interactions that machines struggle with. This highlights a significant gap between cognitive abilities and physical dexterity in AI systems.

Data Consumption Comparison

Large language models (LLMs) are trained on vast amounts of data (e.g., 30 trillion tokens), which is equivalent to hundreds of thousands of years for a human to read through this text volume. In contrast, a four-year-old has experienced similar data volume through visual stimuli over just four years of life.

The redundancy present in visual data is crucial for training systems effectively; it allows them to capture structure and dependencies necessary for learning through self-supervised methods. Thus, relying solely on text will not lead us to achieve human-level AI capabilities.

Future Directions for Robotics

To create useful robots capable of performing complex tasks autonomously, significant advancements must be made beyond current technologies that rely heavily on additional sensors and mapping techniques—many companies lack understanding on how to make these robots genuinely intelligent despite impressive demonstrations seen online.

AI Inference and Optimization

Limitations of Current AI Systems

Current AI systems excel in narrow tasks but require careful training, indicating a significant opportunity for researchers to advance AI without massive investments.

The limitations of inference arise from formal propagation through a fixed number of operations, necessitating more sophisticated computational methods.

Proposed Inference Methodology

A preferable design involves extracting information from inputs to create representations, followed by using a large neural network to assess compatibility with proposed outputs.

For example, if an image of an elephant is inputted with the label "elephant," the output should ideally be zero; any other label should yield a higher output indicating incompatibility.

Search and Optimization in Inference

In this model, inference is performed through search processes that minimize the scalar output (energy function), contrasting traditional forward propagation methods.

This optimization approach aligns with classical AI techniques like probabilistic inference and planning problems, which can often be framed as optimization challenges.

Zero-Shot Learning and Human Cognition Models

The optimization method supports zero-shot learning—solving new problems without prior specific training—similar to human cognitive processes described by dual-system theories (System 1 vs. System 2).

System 1 refers to instinctive decision-making while System 2 involves deliberate reasoning; current language models do not effectively replicate this nuanced cognitive process.

Limitations of Language Models (LM)

LMs operate on sequences of symbols through auto-regressive prediction but face inherent limitations due to fixed computation allocated per token generation.

Techniques like "chain of thoughts" are employed to extend computation time by generating more tokens, though this is considered a workaround rather than a solution.

Challenges in Token Generation

Auto-regressive generation leads to divergent processes where predicting subsequent tokens becomes uncertain due to the vast probability distribution over potential outputs.

Each generated token risks moving outside the correct sequence tree, complicating accurate response generation as it may lead away from valid answers.

Understanding Language Models and World Models

The Limitations of Language Models

The hypothesis suggests that the probability of a sequence being correct decreases exponentially with the number of symbols, assuming errors are independent.

Current language models (LMs) generate responses without deep cognitive processing, unlike humans who conceptualize answers before articulating them.

LMs like GPT are trained to reproduce input sequences but cannot learn identity functions due to their design constraints, relying on previous symbols for predictions.

Advantages and Challenges in Training

GPT architecture allows efficient training over long sequences, making it popular despite its limitations.

A key goal is to emulate human-like world models that can predict future states based on current observations and actions taken.

Concept of World Models

A world model represents the current state derived from past observations and predicts future states based on imagined actions.

Training a world model involves minimizing prediction errors between observed states and predicted next states using an encoder-predictor framework.

Predicting Future Events in Videos

Researchers aim to train generative models to predict subsequent frames in videos by masking parts of the input video during training.

The challenge lies in predicting detailed outcomes at pixel levels for complex scenarios, as many plausible futures exist.

Limitations in Video Prediction

Neural networks often produce blurry predictions when tasked with forecasting future frames due to uncertainty about potential actions (e.g., acceleration or braking).

Techniques involving latent variables can improve predictions by allowing neural networks to explore multiple plausible futures but still struggle with natural video complexities.

Real-world Application Challenges

In practical scenarios, such as predicting details from a room video, systems fail to accurately forecast specific textures or appearances due to inherent unpredictability.

Understanding Predictive Models in Video Analysis

The Challenge of Pixel-Level Prediction

The task of predicting pixel-level details in natural signals, especially video, is deemed nearly impossible due to the complexity involved.

Traditional models like LM and LMS do not predict a single token but rather the distribution of tokens, which complicates high-dimensional continuous space predictions.

Joint Embedding Predictive Architecture (JPA)

A proposed solution involves using a Joint Embedding Predictive Architecture (JPA) that focuses on predicting representations instead of individual pixels.

This architecture utilizes an encoder to process video data while training a predictor to minimize prediction error within representation space, simplifying the prediction task by abstracting unnecessary details.

Representation and Prediction in Science

The essence of understanding our world lies in finding representations that facilitate predictions without needing all intricate details; this principle is foundational to scientific inquiry.

For example, predicting planetary trajectories simplifies when using appropriate representations like elliptical orbits rather than complex motion patterns. Understanding these abstractions allows for effective long-term predictions with minimal information.

Hierarchical Abstractions in Science

Scientific knowledge is built upon hierarchical abstractions where each level enables broader predictions while discarding lower-level details; this hierarchy spans from particles to ecosystems.

In physics, concepts such as reorganization group theory and entropy illustrate how higher-level predictions can be made without accounting for every detail at lower levels, emphasizing the importance of abstraction in scientific fields.

Future Directions in AI Research

The discussion transitions into how these predictive models can inform intelligent systems capable of anticipating outcomes based on their actions, hinting at future developments in AI research over the next decade.

Intelligent AI Agents and World Models

Understanding Intelligent AI Agents

An intelligent AI agent observes the world through a perception system, which provides insights into its current state. However, it also relies on memory to understand aspects of the world that are not currently perceivable.

The agent utilizes a world model to predict the outcomes of imagined actions, generating a sequence of potential states resulting from these actions.

Task Objectives and Energy Functions

A task objective is defined as an energy function that quantifies how well a goal has been achieved. It produces a scalar value indicating success (zero) or failure (a positive number).

Additional cost functions serve as guardrails to ensure safety in action execution. For instance, if tasked with getting coffee, the robot must avoid harmful actions towards obstacles like people.

Guardrails in Action Planning

The robot's operation is constrained by hardwired guardrail objectives, ensuring it cannot deviate from safe actions while searching for an optimal action sequence.

This approach exemplifies classical planning used in robotics, where optimization influences decision-making processes.

Predictive Modeling and Control

A world model capable of making predictions can be applied iteratively to plan sequences of actions using learned models rather than traditional handwritten equations.

Classical optimal control methods like Nonlinear Model Predictive Control (NMPC) face challenges when dealing with complex neural network-based world models due to irregularities in cost functions.

Challenges in Optimization and Non-determinism

The complexity arises from discrete choices leading to significant variations in costs despite minor changes in initial actions. This poses major optimization issues within control systems.

Handling non-determinism involves utilizing latent variables sampled from distributions, complicating planning under uncertainty as predictions become less predictable.

Future Aspirations for Hierarchical Models

The ultimate goal is to develop hierarchical models for advanced planning capabilities; however, this remains an unachieved aspiration within the field at present.

Hierarchical Planning and Cognitive Architectures

The Concept of Hierarchical Planning

The speaker discusses the challenge of planning a journey from New York to Paris, emphasizing that detailed actions cannot be planned at a micro-level but rather at a higher abstraction level.

To reach Paris, one must first understand the process of going to the airport and catching a plane, which involves abstract models of transportation.

A sub-goal emerges: getting to the airport. This shifts focus from distance to Paris to distance to the airport, illustrating how objectives can change based on context.

The speaker outlines specific steps needed to achieve this sub-goal, such as taking an elevator and navigating through obstacles in an office environment.

At certain points in this hierarchy, individuals may rely on instinctive actions (System One), indicating that not all planning requires formal structure.

Training Systems for Hierarchical Planning

The discussion transitions into how hierarchical planning necessitates role models operating at various timescales and levels of abstraction.

Cognitive architectures are introduced as frameworks combining perception, memory (akin to human hippocampus), world models (linked with prefrontal cortex), and cost functions for action sequence searches.

The training of these world models is proposed through self-supervised learning methods, highlighting their evolution since the early 1990s with Siamese networks.

Energy Functions in Learning Models

A conceptual framework is presented where systems produce scalar outputs representing energy levels that measure compatibility between observed input-output pairs (X and Y).

The goal is for learning machines to generate low energy outputs for known training samples while producing higher energies for untrained samples; achieving this balance is complex.

An implicit function approach allows representation of dependencies between variables X and Y without requiring a direct functional mapping due to potential multiple outcomes for given inputs.

Caution is advised against merely minimizing prediction error during training as it could lead systems to ignore variable relationships entirely.

This structured overview captures key insights from the transcript while providing timestamps for easy reference.

Energy Function and Its Challenges

Energy Minimization in Neural Networks

The energy function can collapse if trained solely to minimize the energy of training samples, leading to a flat energy surface.

To prevent this collapse, contrasting methods are employed by generating points outside the data manifold (green points) to increase energy levels for those points.

A significant challenge arises when the dimensionality of the space is high; generating enough contrastive points becomes exponentially difficult and inefficient.

Regularized methods are preferred as they include a term that minimizes low-energy volume, ensuring that reducing energy in one area necessitates an increase elsewhere.

Transitioning from Energy-Based Models

Energy-based models can be transformed into probabilistic models using Gibbs distribution, but often face intractability issues with reasonable functions.

A variety of proposed methods exist within this framework, categorized as either contrastive or regularized; however, detailed exploration of these methods is not covered here.

Self-Supervised Learning Techniques

Contrastive Learning for Image Representation

Self-supervised learning can utilize contrastive techniques to train systems for tasks like image recognition through joint embedding architectures.

After training, only the encoder is used to produce representations while a simple classifier is added on top for supervised learning tasks such as depth estimation.

Mechanism of Contrastive Methods

Contrastive methods involve presenting pairs of images (original vs. distorted/corrupted), training the system to reconstruct original images from their altered versions.

A loss function pulls similar representations together while pushing dissimilar ones apart; however, these methods typically do not yield representations beyond 200 dimensions when applied to datasets like ImageNet.

Distillation Methods and Their Efficacy

Understanding Distillation Techniques

Distillation methods have shown greater success than traditional contrastive approaches but lack comprehensive theoretical understanding regarding their effectiveness.

In distillation, an input is transformed into a different version which then passes through two encoders with slightly different weights; prediction error minimization occurs without backpropagating gradients through one encoder.

Weight Management in Distillation

The second encoder's weights are maintained as a running average over time rather than updated rapidly via gradient descent, creating stability during training.

Theoretical Insights and Observations

Despite lacking clear monitoring functions during training (as no explicit minimization occurs), distillation techniques perform well—this phenomenon remains somewhat mysterious.

Some theoretical work suggests fixed points exist within linear encoders that prevent collapse during training processes.

DINO: A Breakthrough in Self-Supervised Learning

Overview of DINO

DINO, developed by researchers in France at FAIR Paris, demonstrates impressive results when trained on distorted versions of ImageNet and large datasets.

Recent findings indicate that self-supervised systems like DINO can match or exceed the performance of supervised systems while requiring significantly less labeled data.

Advantages of Self-Supervised Learning

Investing in self-supervised learning methods and collecting unlabeled data is more beneficial than spending resources on labeling data for traditional supervised learning.

DINO generates generic image representations applicable across various fields such as medical imaging, biological analysis, astrophysics, and remote sensing.

Application in World Modeling

Research led by Lerrel Pinto at NYU explores using DINO's representations to train world models for robotic planning tasks.

The approach involves feeding images into the DINO encoder to predict future states based on actions taken by a robot.

Performance Insights

The system shows improved performance over previous models like Dream V3 from DeepMind by optimizing action sequences to minimize distance to target images.

A demo illustrates the system's ability to plan actions effectively within complex environments, achieving goals with limited action steps.

Zero-Shot Learning Capabilities

The model showcases zero-shot learning capabilities, allowing it to accomplish tasks without prior training or reinforcement learning.

Related work by AMIA Barr focuses on navigation tasks where robots predict their next view based on transformations from previous views.

Future Directions in Training Models

Ongoing research includes advancements like video JPA (Joint Predictive Architecture), which utilizes two encoders sharing weights for efficient training and representation prediction.

MAE Master to Encoder: An Overview of Video Representation Learning

Introduction to MAE Master to Encoder

The MAE (Masked Autoencoder) Master to Encoder is compared favorably against an alternative project by colleagues at Fair, highlighting its effectiveness in reconstructing images.

This model operates by corrupting images through patch removal and training a system to predict the full image from these corrupted inputs.

Advancements in Video Representation

A recent version extends this concept to video, where sections of the video are masked, and the system learns to predict the complete representation from partial data.

The model can learn common sense reasoning; for instance, it recognizes impossible events in videos, such as an object disappearing unexpectedly.

Common Sense Learning in Models

The ability of the model to detect unrealistic scenarios indicates a level of intuitive physics learned completely unsupervised.

A new iteration called VJPA version two includes phases for training on video and action-conditioned predictions useful for robotic planning.

Robotic Action Planning

Demonstrations show how the system plans actions in unfamiliar environments without prior calibration or knowledge about camera positioning.

The goal-oriented behavior is exemplified by moving objects like cups on tables through planned sequences.

Techniques for Preventing System Collapse

Current research focuses on regularization methods that prevent systems from collapsing during training by maximizing information output from encoders.

Two strategies are proposed: ensuring diverse sample representations (rows differ), and maintaining unique variable information across representations (columns differ).

Maximizing Information Content

Techniques involve computing Gram matrices and covariance matrices to ensure orthogonality among samples and variables respectively.

These methods aim at maximizing information content within representations but face challenges due to upper bounds on estimations of information content.

Recommendations for Future Research Directions

Suggestions include abandoning traditional generative models in favor of joint embedding predictive architectures that focus on representation learning rather than input space prediction.

Emphasis is placed on using regularized methods over reinforcement learning due to inefficiencies associated with trial-and-error approaches.