In one of the previous articles, we kicked off the Transformer architecture. Because they are massive systems we decided to split implementation into several articles and implement it part by part. In this one, we cover Encoder and Decoder.
In this and next couple of articles we will be able to see how one can implement one of these monumental architectures.
In this article, we explore the interesting architecture of Transformers, a special type of sequence-to-sequence models used for language modeling, machine translation, etc.