The F=ma of Artificial Intelligence [Backpropagation]

The F=ma of Artificial Intelligence [Backpropagation]

The Evolution of Back Propagation in Neural Networks

Introduction to Back Propagation

  • In the early 1970s, Paul Worbos, a Harvard graduate student, discovered back propagation, a method for training multi-layer neural networks. He likened it to Newton's laws, suggesting its fundamental importance in understanding intelligence.

Initial Rejection and Subsequent Success

  • AI pioneer Marvin Minsky dismissed Worbos' discovery, arguing that back propagation could not learn complex tasks. Despite this skepticism, back propagation proved effective across various applications over the decades.

Impact on Modern AI Models

  • Today, nearly all modern AI models utilize back propagation for training. The video illustrates how Meta's Llama 3.2 model processes input data and updates its parameters through this algorithm.

Mechanism of Back Propagation

  • The algorithm adjusts the model's 1.2 billion parameters based on input text predictions. It identifies which connections within the model need updating to enhance accuracy in predicting subsequent tokens.

Importance of Attention Patterns

  • Back propagation effectively modifies weights associated with attention patterns in the model. For instance, it focuses on key tokens like "capital" and "France" when predicting "Paris," showcasing its ability to learn complex behaviors.

Understanding Loss Landscapes and Training Approaches

Transitioning from Visual to Mathematical Approaches

  • The series shifts from visual representations of loss landscapes to a mathematical approach for understanding back propagation. This transition aims to clarify how these algorithms function at a deeper level.

Simplifying Learning Problems

  • To illustrate back propagation’s mechanics, the discussion simplifies the learning problem by using GPS coordinates instead of text inputs. This allows for clearer demonstrations of how models predict outcomes based on numerical data.

Model Architecture Overview

  • A smaller model is introduced that predicts cities based on longitude coordinates using three neurons—one for each city (Paris, Madrid, Berlin). Each neuron operates as a simple linear equation involving weights and biases.

Understanding the Learning Algorithm

Adjusting Model Parameters

  • The learning algorithm's role is to adjust the parameters m and b to solve the overall task, with outputs labeled as h instead of y .
  • Outputs from neurons can be passed into various functions; final model outputs Y hat must represent probabilities between 0 and 1.

Softmax Function

  • The softmax function transforms neuron outputs into probabilities by exponentiating each output and normalizing them.
  • For example, if neuron outputs are 1, 2, and 1, softmax assigns a probability of 58% to Paris. If values change to 1, 10, and 1, Paris gets a probability of 99.98%.
  • Softmax amplifies differences in neuron outputs but is complex; calculus simplifies its differentiation.

Initial Model Predictions

  • Before training, model weights are initialized randomly; starting values for slope parameters are set (e.g., m_1 = 1 , m_2 = 0 , m_3 = -1 ).
  • An example prediction using coordinates leads to an incorrect high probability for Madrid (0.91).

Measuring Model Performance

  • Cross entropy loss is used as a performance metric: it calculates the negative logarithm of predicted probabilities.
  • For instance, if the predicted probability for Paris is only 8.6%, cross entropy loss equals approximately 2.45.

Gradient Descent Process

  • To improve predictions, adjustments are made to parameters based on computed slopes relative to loss.
  • By plotting loss against parameter changes (like increasing m_2 ), insights about improving model performance can be gained.

Optimizing Parameter Updates

Numerical Estimates vs Exact Solutions

  • While numerical estimates for slopes can guide updates in gradient descent, they are computationally intensive and potentially inaccurate.

Historical Context of Backpropagation

  • In the late '50s at Stanford, early neural network training relied on numerical slope estimates until backpropagation was discovered by Bernard Widrow and Ted Hoff.

Calculus Application in Backpropagation

  • Backpropagation efficiently computes slopes using calculus principles; this involves partial derivatives concerning model parameters like M_2 .

Loss Function Derivation

  • The relationship between cross entropy loss and model output probabilities allows for deriving complete equations linking inputs with all parameters.

Understanding Backpropagation in Neural Networks

The Chain Rule and Its Application

  • The process of backpropagation is complex but can be simplified by treating each layer of a neural network independently, allowing for efficient calculation of derivatives.
  • A simple example illustrates this: two compute blocks where the first maps input x to output y with y = 2x , and the second maps y to z with z = 4y . This leads to the overall equation z = 8x .
  • Instead of combining equations directly, we can compute derivatives separately for each block. For instance, the derivative from the first block is dy/dx = 2 , and from the second block is dz/dy = 4 . Thus, using the chain rule gives us dz/dx = dy/dx * dz/dy = 8 .

Simplifying Derivatives in Neural Networks

  • In neural networks, we apply similar principles by breaking down derivatives into manageable parts. Specifically, we separate out partial derivatives related to linear models and loss functions.
  • The logarithm from cross-entropy loss and exponentials from softmax operations effectively cancel out during calculations, simplifying our results significantly.
  • The resulting partial derivative for our model's output probabilities becomes straightforward: it equals dL/dh = y_hat - y , where y_hat represents predicted probabilities.

One-Hot Encoding and Its Implications

  • Using one-hot encoding for ground truth labels allows us to define vectors such that when predicting Paris as correct, we have values like y_1 = 0, y_2 = 1, and so forth.
  • Calculating partial derivatives reveals insights about how changes in outputs affect loss; e.g., increasing certain outputs will either increase or decrease loss based on their respective values.

Understanding Model Parameters' Impact

  • To derive further insights into model training, we need to understand how intermediate outputs relate to model parameters through equations like h_2 = m_2 * x + b_2.
  • Here, considering parameters as variables rather than constants helps clarify how learning occurs—specifically how changes in parameters influence neuron outputs based on input values.

Finalizing Gradient Computation

  • By replacing terms in our expressions with derived results (e.g., substituting for slopes), we can compute complete expressions for gradients needed during training.
  • For example, calculating gradient values shows that if parameter adjustments are made (like increasing parameter m2), expected changes in loss can be quantified accurately.

Gradient Descent and Back Propagation in Model Training

Understanding Gradient Descent

  • To minimize loss in model training, parameters are adjusted opposite to the gradient direction. Specifically, reducing M1 and increasing M2 helps decrease loss.
  • The process of gradient descent is visualized as moving downhill on a loss landscape. The learning rate, a scaling factor for gradients, controls the step size taken during this process.
  • Gradients provide local slope information; however, loss landscapes can be complex with rapidly changing slopes. This complexity makes simple downhill visualization incomplete.

Visualizing Learning Process

  • Gradients will be visualized as bars around connections in the model, where thicker bars indicate larger gradients. A heat map will show model outputs overlaid on a geographical map.
  • Initial training points are plotted on the map with probabilities for cities represented by colors: cyan for Madrid, yellow for Paris, and green for Berlin.
  • For an initial input from Paris, high output probability is incorrectly assigned to Madrid. This results in significant gradients affecting M1 and M2 parameters.

Adjusting Model Parameters

  • As training progresses through multiple steps (about 40), the model begins to correctly classify cities like Madrid and Berlin while adjusting its predictions for Paris.
  • Gradients decrease as errors reduce; thus, adjustments become smaller as the model learns more accurately over time.

Analyzing Neuron Behavior

  • Each neuron's linear models evolve during training; initially set slopes change based on learning outcomes—Madrid neuron shifts downhill while others adjust accordingly.
  • The height of each neuron's line corresponds to its output value at given inputs. The softmax function amplifies the highest values leading to final predictions.

Scaling Complexity with Additional Inputs

  • Back propagation effectively scales to larger problems such as adding new cities (e.g., Barcelona), requiring both longitude and latitude inputs (X1 and X2).
  • With additional parameters introduced (two slope values per neuron), each neuron now represents a plane rather than a line. This increases total parameters needed for updates significantly.

Training New Models

  • The updated two-input model quickly adapts by fitting planes corresponding to four cities accurately across the map regions designated for each city.

Understanding Complex Patterns in Language Models

The Challenge of Complex Borders

  • The discussion begins with the analogy of planes representing different regions, highlighting the limitations of simple models in understanding complex patterns.
  • A specific example is given regarding the intricate border between Belgium and the Netherlands, illustrating how a single model for each country fails to accurately represent these complexities.

Token Representation in Language Models

  • The llama model represents tokens (e.g., city names) as vectors composed of 48 floating-point numbers, which are crucial for processing language.
  • When inputting text like "the capital of France is," the model generates a vector that progressively aligns with the vector for "Paris" through its layers.

High-Dimensional Language Mapping

  • By analyzing various training examples leading to next tokens like Paris or Madrid, a set of 248 vectors is created, representing coordinates in a high-dimensional language space.
  • Utilizing UMAP algorithms allows projection from 248 dimensions down to two, revealing interesting clusters based on context (e.g., references to treaties or cultural works).

Learning and Model Adaptation

  • Just as disconnected regions must map correctly within their respective countries, language models must adapt their internal representations to ensure accurate predictions for next tokens.
  • Historical context is provided by referencing Marvin Minsky's skepticism towards backpropagation due to its slow convergence and perceived limitations; however, this underestimates its potential for solving complex problems today.

Personal Journey and Educational Goals

  • The speaker shares personal experiences transitioning from machine learning engineering back into content creation focused on education after leaving Welch Labs in 2019 due to financial challenges.
  • Reflecting on past educational experiences reveals a desire to improve math and science education through engaging content while acknowledging the need for sustainable business practices first.

Building Support Through Community Engagement

  • In an effort to support his educational initiatives financially, the speaker discusses strategies such as sponsorships and Patreon support aimed at achieving full-time engagement with Welch Labs projects.
Video description

Take your personal data back with Incogni! Use code WELCHLABS and get 60% off an annual plan: http://incogni.com/welchlabs New Patreon Rewards 29:48 - own a piece of Welch Labs history! https://www.patreon.com/welchlabs Books & Posters https://www.welchlabs.com/resources Sections 0:00 - Intro 2:08 - No more spam calls w/ Incogni 3:45 - Toy Model 5:20 - y=mx+b 6:17 - Softmax 7:48 - Cross Entropy Loss 9:08 - Computing Gradients 12:31 - Backpropagation 18:23 - Gradient Descent 20:17 - Watching our Model Learn 23:53 - Scaling Up 25:45 - The Map of Language 28:13 - The time I quit YouTube 29:48 - New Patreon Rewards! Special Thanks to Patrons https://www.patreon.com/welchlabs Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich, Mitch Jacobs, Lauren Steely References Werbos, P. J. (1994). The roots of backpropagation : from ordered derivatives to neural networks and political forecasting. United Kingdom: Wiley. Newton quote is on p4, Werbos expands on the analogy on p4. Olazaran, Mikel. "A sociological study of the official history of the perceptrons controversy." *Social Studies of Science* 26.3 (1996): 611-659. Minsky quote is on p 393. Widrow, Bernard. "Generalization and information storage in networks of adaline neurons.” Self-organizing systems (1962): 435-461. Historical Videos http://youtube.com/watch?v=FwFduRA_L6Q https://www.youtube.com/watch?v=ntIczNQKfjQ Code: https://github.com/stephencwelch/manim_videos Technical Notes Large Llama training animation shows 8/16 layers. Specifically layers 1, 2, 7, 8, 9, 10, 15, and 16. Every third attention pattern is shown, and special tokens are ignored. MLP neurons are downsampled using max pooling. Only the weights and gradients above a specific percentile based threshold are shown. Only query weights are shown going into each attention layer. The coordinates of Paris are subtracted from all training examples in the 4 city example as a simple normalization - this helps with convergence. In some scenes, math is happening at higher precision behind the scenes, and results are rounded, which may create apparent inconsistencies. Written by: Stephen Welch Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu Special thanks to: Emily Zhang Premium Beat IDs EEDYZ3FP44YX8OWT MWROXNAY0SPXCMBS