RAAIS 2019 - Ashish Vaswani, Senior Research Scientist at Google AI
Introduction
In this section, Ashish Vaswani introduces himself and his topic of discussion, self-attention.
Self-Attention as an Inductive Bias
- Ashish Vaswani is a researcher at Google AI.
- Self-attention is a good inductive bias for language, images, and music.
- Variable length data is common in deep learning problems.
- Recurrent neural networks (RNNs) are popular for variable length data but have limitations.
- Convolutional sequence models are efficient but struggle with long-distance dependencies.
- Attention-based models can address these issues.
Problems with Sequential Computation
In this section, Ashish discusses the limitations of sequential computation and how attention-based models can overcome them.
Desiderata for Learning Representations of Variable Length Data
- Sequential computation is slow and cannot be parallelized efficiently.
- RNNs have a single hidden state that must contain all information about the past, which limits their ability to model hierarchy or coreference relationships.
- Convolutional sequence models are efficient but struggle with long-distance dependencies.
The Role of Attention-Based Models
- Attention has been critical for modeling encoder/decoder relationships in machine translation models.
- Attention allows for content-based pooling and multiplicative interactions between positions.
- This mechanism enables linking any two distant positions in one hop.
Basic Mechanism of Attention-Based Models
In this section, Ashish explains the basic mechanism behind attention-based models.
Content-Based Similarity
- Attention computes a representation for each position based on its compatibility with neighboring positions.
- This compatibility measure is based on content-based similarity.
- Attention can be thought of as content-based pooling.
Benefits of Attention-Based Models
- Attention allows for multiplicative interactions and gating.
- Attention is efficient and can be parallelized at every position.
Introduction to the Transformer Model
In this section, the speaker introduces the transformer model and explains how it differs from previous models.
The Transformer Model
- The transformer model is a large-scale machine translation model that uses a standard encoder/decoder architecture.
- The encoder consists of a stack of self-attention and feed-forward layers, while the decoder consists of a stack of decoder self-attention, encoder-decoder attention, and feed-forward layers.
- Attention is used to represent text understanding by creating queries and keys through bilinear transformations. Based on compatibility scores obtained through matrix multiplication and softmax function, information is pulled from neighboring positions through values. This operation is easily parallelizable.
Advantages of Attention Mechanism
In this section, the speaker discusses why attention mechanism is advantageous compared to other computational paradigms.
Computational Complexity
- Compared to other computational paradigms such as convolutional sequence models or LSTM-based models, attention mechanism has quadratic computational complexity in the number of positions which makes it very favorable for machine translation where hidden dimensions are typically larger than position numbers.
- Attention mechanism can be easily parallelized due to its simple structure which makes it three times faster than other models at that time.
Learning Hierarchy
- Attention mechanism helps learn hierarchy by getting constant path length interactions between tokens in a sequence which is essential for language processing. Self attention should do the job if you care about learning hierarchy in your data.
Inductive Biases of Self Attention Mechanism
In this section, the speaker talks about what kind of inductive biases self attention mechanism can model and whether it can transfer to other domains.
Transfer Learning
- Self attention mechanism can be used for transfer learning in other domains such as images or music.
Multi-Head Attention
- To simulate convolutions, multiple attentional layers are run in parallel at the same time which is called multi-head attention. This helps combine different pieces of information from different representations in different positions.
- Multi-head attention is parallelizable and was state-of-the-art when the paper was published. It outperformed convolution sequence models and LSTM-based models while being three times faster due to its simple structure that makes it easy to parallelize.
Non-Local Means and Self-Similarity
This section discusses the concept of self-similarity in music and how it has been exploited in non-local means. The speaker explains how this approach is used in image denoising and how it inspired the use of self-attention modules for image generation.
Self-Similarity and Non-Local Means
- Repeating motifs at different time scales reference each other, creating a hierarchy in music.
- Non-local means exploit the property that many patches are similar to each other for image denoising.
- Content-based similarity between patches is used to pull information together.
- Self attention modules replace tokens with patches, pooling based on content-based similarity.
- Position information must be explicitly injected into attention since it is invariant to permutation.
Image Generation with Self Attention
- The image transfer model replaces words with pixels.
- Absolute positions or learned position representations can be used for position information.
- Local fields are restricted due to large images, but still outperform convolutions.
- Computational profiling allows for fewer layers and sparsely parametrized models.
- Two-dimensional representations of positions are added for super resolution tasks.
Results of Super Resolution Model
This section discusses the results of using local attention for super resolution tasks. The speaker compares their results to state-of-the-art models at the time.
State-of-the-Art Results
- Local attention was used for both one-dimensional and two-dimensional cases.
- Results were state-of-the-art at the time and outperformed previous convolutional models.
- Quality of images has improved dramatically with self attention and auto-regressive models.
- Samples at different temperatures show diversity in generated images.
Introduction to Self-Attention Models
In this section, the speaker discusses the use of self-attention models in image classification and introduces a new kind of attention stem that can outperform convolutions.
Self-Attention vs Convolutions
- Self-similarity models can provide diversity in image classification.
- Self-attention has been used with convolutions but is typically applied towards the end of models or after pooling to reduce spatial dimensions.
- A pure self-attention model was created by replacing all spatial convolutions in ResNet 50 with a seven by seven convolution.
- Relative self-attention was introduced using 2D coordinate geometry to explicitly use relative positions.
Results and Future Work
- Replacing convolutions with self-attention resulted in better performance on object detection with RetinaNet, even with fewer parameters and flops.
- Convolution still performs better at lower layers for computing edge detectors while self attention is better at higher layers for spatial mixing of content.
- Native apps for GPUs and NTP use are being developed to improve computation speed.
Generative Music Using Language Models
In this section, the speaker explains how generative music works using an event-based intermediate representation modeled by a language model.
Event-Based Intermediate Representation
- Generative music uses an event-based intermediate representation instead of going from score to sound directly.
- The performance representation is modeled using a language model which is then synthesized into music.
Self-Attention in Music Generation
In this section, the speaker discusses how self-attention is used in music generation and its limitations.
Self-Attention in Music Generation
- The transformer model carries over some information but loses its memory of what it generated. It has no memory of longer sequences.
- The transformer uses relative position to capture timing information through a relative self-attention mechanism.
- The music transformer can repeat motifs that are not present in the training data, but it may be stitching them together instead of memorizing them.
- A visualization shows how the model introduces tremolo and repeats it by looking at all events and time steps where it took place while skipping things in between.
Implications of Self-Attention
In this section, the speaker talks about the implications of self-attention beyond music generation.
Implications of Self-Attention
- Self attention is good at modeling constant path length between positions and allows for translational equivariance which extends naturally to graphs.
- Attention is invariant to permutations, but if you know the geometry beforehand, you can inject any geometry into the model. This makes it useful for classification on spheres or other bodies with known geometries.
- Self attention's content-based way of pulling information could be useful for image segmentation and video compression.
- There is ongoing work on less auto-regressive generation, character-level modeling combining recurrence and self attention, generating high-fidelity images with self attention and convolutions.