In one of the previous articles, we kicked off the Transformer architecture. Transformer is a huge system with many different parts. They are relying on the same principles like Recurrent Neural Networks and LSTMs, but are trying to overcome their shortcomings. Because they are massive architectures we decided to split implementation into several articles and implement it part by part. Thus far we handled “low level” elements, so to say. Since the main goal of our Transformer is to translate translate Russian into English, we first had to handle data and implement positional encoding and attention layers. Now, we can utilize those parts and implement other elements. As a reminder this is how Transformer architecture looks like:

High-level overview of Transformer architecture

Of course, this is just a high-level overview of this architecture. As you can see, there are multiple Encoder and Decoder layers stacked together and connected to each other. What you can not see from this image are the details of data prepossessing and the structure of each of these levels. This is all explained in an amazing “Attention is all you need” paper. In this paper the architecture of each encoder and decoder layers are presented like this:

Single Encoder-Decoder Layer Structure

In this article, we are pick up where we left off in the last time. To be more precise we build Encoder and Decoder layers and then Encoder and Decoder themselves. Apart from that, we will build so-called data processing layer, that will utilize layers created in the previous article.


In order to run the code from this and all articles in series, you need to have Python 3 installed on your local machine. In this example, to be more specific, we are using Python 3.7. The implementation itself is done using TensorFlow 2.0. The complete guide on how to install and use Tensorflow 2.0 can be found here. Another thing that you need to install is TensorFlow Datasets (TFDS) package. You can do so by running the command:

pip install tensorflow-datasets

This module contains a large database of data sets that can be used for training purposes. We will use one of these data sets for our model. Here is the list of modules that needs to be imported for the complete Transformer implementation:

import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Embedding, Dropout
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers.schedules import LearningRateSchedule
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import Mean, SparseCategoricalAccuracy
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
view raw hosted with ❤ by GitHub

Make sure that you have them all installed.

Pre-Processing Layer

So, lets first create layer that will utilize Embedding and Positional Encoding, we implemented in the previous article. As we mentioned there, Embedding is the process that maps text into a vector based on it’s semantic meaning. Words will be transferred into some sort of vector representation (or embedding) in n-dimensional latent space. In this latent space, vectors that are close to each other belong to the words that have similar semantic meaning. Embedding layer is available as a part of TensorFlow library.

Since semantic meaning of the word depends on the position of that word in a sentence and on relationship with other words in that same sentence as well. That is why information about relative position of every word in a sequence is required – positional encoding vector. This process is proposed in the Attention is all you need” paper. You can find more about how relative position can be found using current position of the word in this paper. Now, we need to combine positional encoding and Embedding like this:

Pre-Processing Layer

Here is the code:

class PreProcessingLayer(Layer):
def __init__(self, num_neurons, vocabular_size):
super(PreProcessingLayer, self).__init__()
# Initialize
self.num_neurons = num_neurons
# Add embedings and positional encoding
self.embedding = Embedding(vocabular_size, self.num_neurons)
positional_encoding_handler = PositionalEncoding(vocabular_size, self.num_neurons)
self.positional_encoding = positional_encoding.get_positional_encoding()
# Add embedings and positional encoding
self.dropout = Dropout(0.1)
def call(self, sequence, training, mask):
sequence_lenght = tf.shape(sequence)[1]
sequence = self.embedding(sequence)
sequence *= tf.math.sqrt(tf.cast(self.num_neurons, tf.float32))
sequence += self.positional_encoding[:, :sequence_lenght, :]
sequence = self.dropout(sequence, training=training)
return sequence

It is pretty straight-forward. We utilize Embedding Layer from tensorflow.keras.layers and use PositionalEncoding implementation from the previous article. Note that at the end of this structure we add dropout layer in order to avoid over-fitting. This is practice we use for other layers as well.

Encoder Layer

Encoder and Decoder layers have similar structures. Encoder layer is a bit simpler though. Here is how it looks like:

Encoder Layer Structure

Essentially, it utilizes Multi-Head Attention Layer and simple Feed Forward Neural Network. As you can see in the image there are also several normalization processes. Note that in this case this case this relates to the layer normalization. In order to reduce training time, instead of using batch normalization like we would use with standard feed forward neural networks, we use modified approach called layer normalization. This approach is used in all sequence-to-sequence models, since it is not obvious how to apply batch normalization to such models. In an essence, we compute the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. This is all hidden from us if we use LayerNormalization from tensorflow.keras.layers.

Effectively, this means that we use layer normalization after each Multi-Head Attention or Feed Forward Neural Network Layer. Also, as we mentioned, we use Dropout layer to avoid over-fitting. So, we create two helper functions that will retrieve these combos for us:

def build_multi_head_attention_layers(num_neurons, num_heads):
multi_head_attention_layer = MultiHeadAttentionLayer(num_neurons, num_heads)
dropout = tf.keras.layers.Dropout(0.1)
normalization = LayerNormalization(epsilon=1e-6)
return multi_head_attention_layer, dropout, normalization
def build_feed_forward_layers(num_neurons, num_hidden_neurons):
feed_forward_layer = tf.keras.Sequential()
feed_forward_layer.add(Dense(num_hidden_neurons, activation='relu'))
dropout = Dropout(0.1)
normalization = LayerNormalization(epsilon=1e-6)
return feed_forward_layer, dropout, normalization

Now we can build Encoder layer with ease:

class EncoderLayer(Layer):
def __init__(self, num_neurons, num_hidden_neurons, num_heads):
super(EncoderLayer, self).__init__()
# Build multi head attention layer and necessary additional layers
self.multi_head_attention_layer, self.attention_dropout, self.attention_normalization = \
build_multi_head_attention_layers(num_neurons, num_heads)
# Build feed-forward neural network and necessary additional layers
self.feed_forward_layer, self.feed_forward_dropout, self.feed_forward_normalization = \
build_feed_forward_layers(num_neurons, num_hidden_neurons)
def call(self, sequence, training, mask):
# Calculate attention output
attnention_output, _ = self.multi_head_attention_layer(sequence, sequence, sequence, mask)
attnention_output = self.attention_dropout(attnention_output, training=training)
attnention_output = self.attention_normalization(sequence + attnention_output)
# Calculate output of feed forward network
output = self.feed_forward_layer(attnention_output)
output = self.feed_forward_dropout(output, training=training)
# Combine two outputs
output = self.feed_forward_normalization(attnention_output + output)
return output

In the constructor, we created all necessary layers and then just connected them based on the schema we saw on the image above.

Decoder Layer

Decoder layer is somewhat more complicated, because it has additional Multi-Head Attention Layer:

Encoder Layer Structure

Thanks to the helper functions we can implement this layer fairly easy as well:

class DecoderLayer(Layer):
def __init__(self, num_neurons, num_hidden_neurons, num_heads):
super(DecoderLayer, self).__init__()
# Build multi head attention layers and necessary additional layers
self.multi_head_attention_layer1, self.attention_dropout1, self.attention_normalization1 =\
build_multi_head_attention_layers(num_neurons, num_heads)
self.multi_head_attention_layer2, self.attention_dropout2, self.attention_normalization2 =\
build_multi_head_attention_layers(num_neurons, num_heads)
# Build feed-forward neural network and necessary additional layers
self.feed_forward_layer, self.feed_forward_dropout, self.feed_forward_normalization = \
build_feed_forward_layers(num_neurons, num_hidden_neurons)
def call(self, sequence, enconder_output, training, look_ahead_mask, padding_mask):
attnention_output1, attnention_weights1 = self.multi_head_attention_layer1(sequence, sequence, sequence, look_ahead_mask)
attnention_output1 = self.attention_dropout1(attnention_output1, training=training)
attnention_output1 = self.attention_normalization1(sequence + attnention_output1)
attnention_output2, attnention_weights2 = self.multi_head_attention_layer2(enconder_output, enconder_output, attnention_output1, padding_mask)
attnention_output2 = self.attention_dropout1(attnention_output2, training=training)
attnention_output2 = self.attention_normalization1(attnention_output1 + attnention_output2)
output = self.feed_forward_layer(attnention_output2)
output = self.feed_forward_dropout(output, training=training)
output = self.feed_forward_normalization(attnention_output2 + output)
return output, attnention_weights1, attnention_weights2

The only difference is that we use two Multi-Head Attention Layers before Feed Forward Neural Network Layer. Ok, let’s now combine all these layers into Encoder and Decoder structures.

Encoder & Decoder

In the Attention is all you need” paper, authors suggested that we should use 6 Encoder Layers for building the Encoder and 6 Decoder Layers for building the Decoder. This is of course arbitrary, so we use parameter to define how many layers there should be. Here is how the Encoder class looks like:

class Encoder(Layer):
def __init__(self, num_neurons, num_hidden_neurons, num_heads, vocabular_size, num_enc_layers = 6):
super(Encoder, self).__init__()
self.num_enc_layers = num_enc_layers
self.pre_processing_layer = PreProcessingLayer(num_neurons, vocabular_size)
self.encoder_layers = [EncoderLayer(num_neurons, num_hidden_neurons, num_heads) for _ in range(num_enc_layers)]
def call(self, sequence, training, mask):
sequence = self.pre_processing_layer(sequence, training, mask)
for i in range(self.num_enc_layers):
sequence = self.encoder_layers[i](sequence, training, mask)
return sequence
view raw hosted with ❤ by GitHub

So, we first create PreProcessingLayer. This layer applies embedding and positional encoding to the input sequence. Then we create several EncoderLayer-s. The number is defined by the num_enc_layers parameter. In the overridden call function (note that we are still inheriting Layer class) we connect all of this into single unified Encoder.

In the same way we created Encoder we create Decoder as well:

class Decoder(Layer):
def __init__(self, num_neurons, num_hidden_neurons, num_heads, vocabular_size, num_dec_layers=6):
super(Decoder, self).__init__()
self.num_dec_layers = num_dec_layers
self.pre_processing_layer = PreProcessingLayer(num_neurons, vocabular_size)
self.decoder_layers = [DecoderLayer(num_neurons, num_hidden_neurons, num_heads) for _ in range(num_dec_layers)]
def call(self, sequence, enconder_output, training, look_ahead_mask, padding_mask):
sequence = self.pre_processing_layer(sequence, training, mask)
for i in range(self.num_dec_layers):
sequence, attention_weights1, attention_weights2 = self.dec_layers[i](sequence, enconder_output, training, look_ahead_mask, padding_mask)
attention_weights['decoder_layer{}_attention_weights1'.format(i+1)] = attention_weights1
attention_weights['decoder_layer{}_attention_weights2'.format(i+1)] = attention_weights2
return sequence, attention_weights
view raw hosted with ❤ by GitHub

The only difference in the constructor is that we use DecoderLayer instead of EncoderLayer. Apart from that, we have to return attention weights as the output of the overridden call function. This is because we need to use these values for the Transformer training process.


We got one step closer to finishing complete Transformer architecture. In this article we utilized Embedding, Positional Encoding and Attention Layers to build Encoder and Decoder Layers. Apart form that, we learned how to use Layer Normalization and why it is important for sequence-to-sequence models. Finally, we used created layers to build Encoder and Decoder structures, essential parts of the Transformer. In the next Transformer article, we will combine all these things together and create the complete model.

Thank you for reading!

Read more posts from the author at Rubik’s Code.