In this series of articles we are exploring a special type of sequence-to-sequence models – Transformers. They are big architectures with a lot ot parts and they are used used for language modeling, machine translation, image captioning and text generation.

Generally speaking, sequence-to-sequence models are a type of models that receives a sequence of input data and provides another sequence of data as an output. This is completely different from Standard Feed Forward Neural Networks and Convolutional Neural Networks, which are accepting a fixed-size vector as input and produce a fixed-sized vector as an output. For example if you want to translate sentence “You are awesome.” from English to Serbian, sequence-to-sequence model will receive word by word as an input and generate output “Ti si super”.

Sequence-to-sequence Model

This is more aligned with the way humans think as well. Meaning, we are not throwing everything away and start every thought from the scratch. We use context and the information we received beforehand. As you are reading this your understanding of every word is based on your understanding of previous words.

Of course, behind the cartons is where the spooky stuff is happening. These models are essentially created of two main components: Encoder and Decoder. The Encoder is the component which receives each part of the input sequence. Then it encodes it (duh!) into a vector, which is called – context. This context carries information about the whole sequence and it is sent over to the Decoder. This component of sequence-to-sequence model is able to understand the context and resolve it into meaningful output.

Encoder and Decoder

In this site, we already covered two predominant types of sequence-to-sequence models: Recurrent Neural Networks (RNN) and Long-Short Term Memory Networks (LSTM). However, these type of networks had several problems. In order to adress these problems, the Transformers were introduced in the paper with an awesome title Attention is all you need. You see, Transformers architecture builds on top of mentioned principles that the rest of sequence-to-sequence models are using, with concepts like Attention. In this series of articles, we are going through the nitty-gritty details of these monumental architectures and implement them using Python and TensorFlow.


If you want to know more about how is this all things are connected together, check out these articles:

Read more posts from the author at Rubik’s Code.