What are Transformer Neural Networks?

What are Transformer Neural Networks?

Introduction to Transformers

In this video, the speaker introduces transformers as a type of machine learning model designed for processing sequences. The original paper focused on natural language tasks, but transformers can be applied to other types of data as well.

What are Transformers?

  • Transformers are a type of machine learning model designed for processing sequences.
  • They were introduced in the paper "Attention is All You Need" back in 2017 and have since gained widespread use.
  • They can be used for natural language tasks, such as BERT and GPT-3, but also for other types of data like images (e.g., DALL-E).
  • One motivation behind developing transformers was to limit the lengths of paths that signals must traverse when learning long-range dependencies.

Architecture of Transformers

  • The architecture of the transformer can be intimidating at first glance, but many blocks are things you may already be familiar with like feed-forward networks.
  • The key novel components are positional encoding and multi-head attention blocks which we'll look at shortly.
  • Transformers allow for more parallelization than current models while also handling long-range dependencies. They leverage self-attention to compute updated representations for each sequence element in parallel.

Encoder and Decoder Components

  • The encoder has several components including an embedding layer, positional encodings, and multi-head attention blocks.
  • Raw input sentences are mapped to unique integer tokens using a vocabulary mapping before being passed through an embedding layer.
  • Attention mechanisms consist of queries, keys, and values. A query vector is compared to a set of key vectors to determine how compatible they are.
  • Multi-head attention works by computing a dot product between the query and each key, determining compatibility between queries and keys in this way is known as dot product attention.

Conclusion

Transformers are a powerful type of machine learning model that can handle long-range dependencies while allowing for more parallelization than current models. They have been widely used in natural language tasks but can also be applied to other types of data like images. The architecture of transformers may seem intimidating at first glance, but many components are familiar from other models.

Vectors and Scaling Factor

The performance of dot product attention suffers for larger values of dk. The authors believe that this is due to the soft max's gradient vanishing for high magnitude input. Dividing by the standard deviation root dk helps counteract this.

Scaling Factor

  • The variance of their dot product is dk dividing by the standard deviation root dk helps counteract this.
  • Once the weights are normalized, we use them to take a linear combination of the value vectors. This will be the output of the attention mechanism.

Attention Operation

We stack query key and value vectors into three matrices which we'll call q k and v. The attention operation can now be defined efficiently via a couple matrix multiplications.

Multi-head Attention Block

  • We stack these query key and value vectors into three matrices which we'll call q k and v.
  • The output will be another matrix where each row is an updated representation for the word in the corresponding sequence position.
  • If we only use a single attention head, it leads to a kind of averaging effect that limits the resolution of learned representations, so multiple attention heads are proposed.
  • For h attention heads, we'll have h sets of learned projection matrices wq wk and wv here i indexes over the heads these matrices will project each of the word representations to h different query key and value vectors.

Multi-head Attention Block

The overall multi-head attention block computes the original attention function h times q k and v are all identical just the stacked word representations at a given layer later in the decoder we'll see how q k and v can come from different sources.

Multi-head Attention Block

  • The queries and keys of course have equal dimensionality dk due to the dot product. In practice, the value dimensionality dv is often set to be equal to dk.
  • The h output matrices are concatenated and then multiplied with another weight matrix wo to linearly project the learned representations back to the original embedding dimensionality.

Positional Encodings

Unlike in a recurrent model, multi-head attention does not process input sequentially so positional encodings enable this. We will map each value token to a fixed-length vector embedding, we will do the same for position tokens.

Fixed Encoding Scheme

  • One way of doing this is simply learning them so having a separate learnable lookup table for position tokens alternatively.
  • The authors also propose a handcrafted fixed encoding scheme essentially each dimension follows a sinusoidal curve decreasing in frequency as the dimension index i increases here even dimensions follow a sine curve and odd dimensions follow cosine but these could be split up in any arbitrary way.

Unique Positional Encodings

The authors make the case that these hand-crafted encodings not only represent global position because each vector is unique but they're also well-suited to relative positioning.

Unique Positional Encodings

  • Take the vector at position t then the positional encoding at position t plus k for some fixed offset k can be computed as a linear function of the encoding at position t.
  • The authors also make the argument that these fixed encodings can potentially let the model extrapolate to longer sequences during deployment than those encountered during training while this seems reasonable.

Transformer Architecture

In this section, the speaker explains the transformer architecture and how it works.

Encoder Layer

  • The word representations before a multihead attention block are added to the output representations.
  • Layer normalization takes our word vectors and normalizes each one individually to have zero mean and variance one.
  • The second sub-layer is a position-wise feed-forward network that applies a simple network of two fully connected layers with value activation between them to each word representation.
  • We apply a residual connection and layer normalization after the second sublayer.

Full Encoder

  • We can stack encoder layers up n times to form the full encoder.
  • The end result is a fixed-length vector representation for each word that considers the full context of the sequence.

Decoder Layer

  • We compute embeddings and positional encodings for the decoder's input, which is the target sequence.
  • Then we reach this masked multi-head attention block where a mask is needed to ensure that we respect the temporal dependency of the output sentence.
  • During training, when we update the representation at a given position, it should pay zero attention to any of the later sequence positions.

Cross or Encoder Decoder Attention Block

  • This third attention block implements what's called cross or encoder-decoder attention where queries come from previous layers of decoder but keys and values come from final output of encoder during deployment.
  • Each decoder layer applies residual connection and layer normalization to each sub-layer.

Training Model

  • A final linear layer is applied to each position mapping each vector representation to set of unnormalized logics over output vocabulary followed by softmax.
  • During training, we use typical maximum likelihood objective adjusting transformer's parameters to maximize probability of next step target token in order to respect autoregressive dependence.

Comparison between RNNs and Transformers

The speaker compares the computational cost of standard multiplication in basic RNNs to that of applying weight matrices in Transformers. They also discuss how the quadratic cost in n can become expensive relative to RNN's for longer inputs.

Cost Comparison

  • Standard multiplication leads to O(n^2 * d) operations for basic RNN.
  • Applying weight matrices of size d by d with d-dimensional vector representations for input and hidden state at each of n steps in the sequence results in O(n * d^2) operations for Transformers.
  • If the sequence length n is smaller than the representation dimension d, then we don't suffer a high cost with Transformers. This is often the case if we're only operating on shorter sequences like a couple of sentences.
  • For longer inputs, the quadratic cost in n can become expensive relative to RNN's.

Connection between Transformers and Graph Neural Networks

The speaker discusses how transformers can be viewed as a kind of GNN that treats sequences as fully connected graphs.

Transformer as GNN

  • There's a nice connection between transformers and graph or message passing neural networks (GNN).
  • We can view transformers as a kind of GNN that treats sequences as fully connected graphs.
  • The transformer allows all nodes to preferentially communicate during message passing according to attention weights.

Interesting Work on Transformers

The speaker talks about some interesting work done on transformers over the past few years, including making them more efficient, developing theory on why they actually work well, and some cool applications.

Recent Work on Transformers

  • There has been a ton of interesting work on transformers in the past few years.
  • Some examples include making them more efficient, developing theory on why they actually work well, and some really cool applications.
  • Check out the links in the description to learn more.
Video description

This short tutorial covers the basics of the Transformer, a neural network architecture designed for handling sequential data in machine learning. Timestamps: 0:00 - Intro 1:18 - Motivation for developing the Transformer 2:44 - Input embeddings (start of encoder walk-through) 3:29 - Attention 6:29 - Multi-head attention 7:55 - Positional encodings 9:59 - Add & norm, feedforward, & stacking encoder layers 11:14 - Masked multi-head attention (start of decoder walk-through) 12:35 - Cross-attention 13:38 - Decoder output & prediction probabilities 14:46 - Complexity analysis 16:00 - Transformers as graph neural networks Original Transformers paper: Attention is All You Need - https://arxiv.org/abs/1706.03762 Other papers mentioned: (GPT-3) Language Models are Few-Shot Learners - https://arxiv.org/abs/2005.14165 (DALL-E) Zero-Shot Text-to-Image Generation - https://arxiv.org/abs/2102.12092 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - https://arxiv.org/abs/1810.04805 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity - https://arxiv.org/abs/2101.03961 Finetuning Pretrained Transformers into RNNs - https://arxiv.org/abs/2103.13076 Efficient Transformers: A Survey - https://arxiv.org/abs/2009.06732 Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth - https://arxiv.org/abs/2103.03404 Do Transformer Modifications Transfer Across Implementations and Applications? - https://arxiv.org/abs/2102.11972 Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies - https://ml.jku.at/publications/older/ch7.pdf Transformers are Graph Neural Networks (blog post) - https://thegradient.pub/transformers-are-graph-neural-networks Video style inspired by 3Blue1Brown Music: Trinkets by Vincent Rubinetti Links: YouTube: https://www.youtube.com/ariseffai Twitter: https://twitter.com/ari_seff Homepage: https://www.ariseff.com If you'd like to help support the channel (completely optional), you can donate a cup of coffee via the following: Venmo: https://venmo.com/ariseff PayPal: https://www.paypal.me/ariseff