So far in our journey through the interesting architecture of Transformer we covered several topics. First we had a chance how this huge system looks like from the higher level. We saw how this type of sequence-to-sequence model harness the same principles like Recurrent Neural Networks and LSTMs, but we were also able to see which principles it utilizes to overcome their shortcomings. Because Transformers consist of so many parts we decided to split implementation into several articles. We started from the ground up and built low-level elements first.

So, we covered many topics already like pre-processing of data, attention layers, encoder and decoder. The main goal of our Transformer is to translate translate Russian into English, so the first thing we had to do was to implement positional encoding and attention layers. Then we used those “low-level” parts and combined them into Encoder and Decoder layers. After that, we stacked those layers and create big Encoder and Decoder components. Now we have to combine those elements too and get one step up. To sum it up, we are finalizing our Transformer which should look something like this:

High-level overview of Transformer architecture

Of course, this is just a high-level overview of this architecture. If you want to see how each individual Encoder and Decoder is build, check out our previous article. Encoder and Decoder layers stacked together and connected to each other.

It is important to notice that complete implementation is based on the amazing “Attention is all you need” paper, so we are relying heavily on the things that are defined there. We suggest you to read this paper if you are serious about doing any kind of development in the sequence-to-sequence modeling.

Transformer Class

Ok, so in the previous article we implemented big Encoder and Decoder blocks. We stacked a bunch of layers in the architecture that should look like this:

We also added pre-processing layers that are performing Embedding and Positional Encoding. So we created and connected each Encoder and Decoder individually and added data processing beforehand. Now let’s combine them into Transformer class and add final Linear layer on top of that. Linear layer is practically just one Dense layer. Here is how Transformer class looks like:

class Transformer(Model):
def __init__(self, num_layers, num_neurons, num_hidden_neurons, num_heads, input_vocabular_size, target_vocabular_size):
super(Transformer, self).__init__()
self.encoder = Encoder(num_neurons, num_hidden_neurons, num_heads, input_vocabular_size, num_layers)
self.decoder = Decoder(num_neurons, num_hidden_neurons, num_heads, target_vocabular_size, num_layers)
self.linear_layer = Dense(target_vocabular_size)
def call(self, transformer_input, tar, training, encoder_padding_mask, look_ahead_mask, decoder_padding_mask):
encoder_output = self.encoder(transformer_input, training, encoder_padding_mask)
decoder_output, attention_weights = self.decoder(tar, encoder_output, training, look_ahead_mask, decoder_padding_mask)
output = self.linear_layer(decoder_output)
return output, attention_weights
view raw hosted with ❤ by GitHub

Since we have done all the heavy lifting in previous articles, this one is a cake walk. We just instantiated Encoder and Decoder class we implemented in the previous article and added Dense layer on top of that. It is important to notice that we inherited Model class, so we are able to perform training and get predictions using this class. Apart from that, note that we need to pass on masks that Encoder and Decoder use during the training process.


Ok, now to the fun part – training. Since we follow “Attention is all you need” paper we use Adam optimizer as the authors of the article suggested. However, since in this paper learning rate variate, we need to create custom scheduler that is able to do this.

Scheduler and Optimizer

The formula used for changing learning rate during the training is:

In a nutshell, the learning rate is increasing in the first part of the training. Namely it is increasing until the number of training steps reaches the number – warmup_steps. After that it is decreasing proportionally to the inverse square root of the step number. In this paper, value 4000 is used for warmup_steps, so we are doing the same. This means that for the first 4000 steps the learning rate will increase and than it will slowly downgrade. Something like this:

Variable Learning Rate

The previously mentioned formula is implemented within Schedule class:

class Schedule(LearningRateSchedule):
def __init__(self, num_neurons, warmup_steps=4000):
super(Schedule, self).__init__()
self.num_neurons = tf.cast(num_neurons, tf.float32)
self.warmup_steps = warmup_steps
def __call__(self, step):
arg1 = tf.math.rsqrt(step)
arg2 = step * (self.warmup_steps ** 1.5)
return tf.math.rsqrt(self.num_neurons) * tf.math.minimum(arg1, arg2)
view raw hosted with ❤ by GitHub

Note that this class inherits LearningRateSchedule. Because of this we can pass on object of this class into the optimizer object and control the learning rate during the training process. Something like this:

learning_rate = Schedule(num_neurons)
optimizer = Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)
view raw hosted with ❤ by GitHub

Padding Loss Function

Since all sequences are padded, we need to apply padding mask when loss is calculated as well. As an objective function SparseCategoricalCrossentropy is used and it is padded in padded_lossfunction function like this:

loss_objective_function = SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def padded_loss_function(real, prediction):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss = loss_objective_function(real, prediction)
mask = tf.cast(mask, dtype=loss.dtype)
loss *= mask
return tf.reduce_mean(loss)
training_loss = Mean(name='training_loss')
training_accuracy = SparseCategoricalAccuracy(name='training_accuracy')
view raw hosted with ❤ by GitHub

There is nothing special about this function. We are calculating loss using predefined objective function and then we pad it with the mask.

Training Process

Finally we can start the training process. First we need to initialize all necessary parameters and instantiate the object of the Transformer class:

# Initialize helpers
data_container = DataHandler()
maskHandler = MaskHandler()
# Initialize parameters
num_layers = 4
num_neurons = 128
num_hidden_layers = 512
num_heads = 8
# Initialize vocabular size
input_vocablar_size = data_container.tokenizer_ru.vocab_size + 2
target_vocablar_size = data_container.tokenizer_en.vocab_size + 2
# Initialize learning rate
learning_rate = Schedule(num_neurons)
optimizer = Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)
# Initialize transformer
transformer = Transformer(num_layers, num_neurons, num_hidden_layers, num_heads, input_vocablar_size, target_vocablar_size)

Then we create train_step function. This is the TensorFlow function that is in charge of the training process. This choice was made because we wanted to speed up the execution using TensorFlow graph. Here is how it looks like:

train_step_signature = [
tf.TensorSpec(shape=(None, None), dtype=tf.int64),
tf.TensorSpec(shape=(None, None), dtype=tf.int64),
def train_step(input_language, target_language):
target_input = target_language[:, :1]
tartet_output = target_language[:, 1:]
# Create masks
encoder_padding_mask = maskHandler.padding_mask(input_language)
decoder_padding_mask = maskHandler.padding_mask(input_language)
look_ahead_mask = maskHandler.look_ahead_mask(tf.shape(target_language)[1])
decoder_target_padding_mask = maskHandler.padding_mask(target_language)
combined_mask = tf.maximum(decoder_target_padding_mask, look_ahead_mask)
# Run training step
with tf.GradientTape() as tape:
predictions, _ = transformer(input_language, target_input, True, encoder_padding_mask, combined_mask, decoder_padding_mask)
total_loss = padded_loss_function(tartet_output, predictions)
gradients = tape.gradient(total_loss, transformer.trainable_variables)
optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
training_accuracy(tartet_output, predictions)
view raw hosted with ❤ by GitHub

In an essence, this function receives two inputs, ie. two sequences, which are defined in the signature. Shapes are broadly defined to avoid variable re-tracing. In the beginning we need to create masks for Encoder and Decoder. They are passed on to the call of transformer function. Then we utilize GradientTape and run the Transformer.

We pick up the predictions and use them to calculate loss. For that, we use padded function we defined previously. Once that is done, we utilize optimizer and modify Transformers trainable parameters. In the end we call training_loss and training_accuracy. After this we can easily start Transformer training:

for epoch in tqdm(range(20)):
for (batch, (input_language, target_language)) in enumerate(data_container.train_data):
train_step(input_language, target_language)
print ('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch, train_loss.result(), train_accuracy.result()))
view raw hosted with ❤ by GitHub

The output looks something like this:

Training process output

When we run transformer here are the results:

Input:  это проблема, которую мы должны решить.
Predicted: this is a problem that we have to solve .
Real: this is a problem we have to solve .


In this article we finalized our journey through the world of Transformers. We finally put all pieces from previous articles together and run this massive architecture. We saw how we can create a scheduler which can control learning rate in the optimizer and we saw how we can create training process for this structure. In the end we got really good results as expected.

Thank you for reading!

Read more posts from the author at Rubik’s Code.