Transformer models: Encoders

Transformer models: Encoders

Understanding Encoder Architecture

In this section, we will learn about the encoder architecture and how it works. We will also understand the numerical representation of each word and its contextualized value.

Encoder Architecture

  • BERT is a popular encoder-only architecture that converts words into numerical representations.
  • The encoder outputs one sequence of numbers per input word.
  • The numerical representation is also called a "Feature vector" or "Feature tensor".
  • The dimension of the vector is defined by the model's architecture, for base BERT model it is 768.

Contextualized Value

  • Each vector contains a numerical representation of the corresponding word in question.
  • It takes into account the context around it, which includes words on both sides.
  • This contextualized value holds the "meaning" of that word in the text.

Self-Attention Mechanism

  • The self-attention mechanism relates to different positions (or different words) in a single sequence to compute a representation of that sequence.
  • The resulting representation of a word has been affected by other words in the sequence.

When to Use Encoders?

In this section, we will learn about when encoders should be used and their applications.

Standalone Models

  • Encoders can be used as standalone models in various tasks such as sequence classification, question answering tasks, masked language modeling, etc.
  • BERT was trained to predict hidden words in sequences using Masked Language Modeling (MLM).

Sequence Classification

  • Encoders are good at doing sequence classification tasks such as sentiment analysis where identifying sentiment from a given sequence is required.

Examples Where Encoders Shine

In this section, we will learn about some examples where encoders really shine.

Masked Language Modeling (MLM)

  • MLM is the task of predicting a hidden word in a sequence of words.
  • Encoders shine in this scenario as bidirectional information is crucial here.

Sequence Classification

  • Sentiment analysis is an example of a sequence classification task where encoders can be used to identify the sentiment of a given sequence.
Video description

A general high-level introduction to the Encoder part of the Transformer architecture. What is it, when should you use it? This video is part of the Hugging Face course: http://huggingface.co/course Related videos: - The Transformer architectutre: https://youtu.be/H39Z_720T5s - Decoder models: https://youtu.be/d_ixlCubqQw - Encoder-Decoder models: https://youtu.be/0_4KEb08xrE To understand what happens inside the Transformer network on a deeper level, we recommend the following blogposts by Jay Alammar: - The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/ - The Illustrated GPT-2: https://jalammar.github.io/illustrated-gpt2/ - Understanding Attention: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ Furthermore, for a code-oriented perspective, we recommend taking a look at the following post: - The Annotated Transformer, by Harvard NLP https://nlp.seas.harvard.edu/2018/04/03/attention.html Have a question? Checkout the forums: https://discuss.huggingface.co/c/course/20 Subscribe to our newsletter: https://huggingface.curated.co/