Transformer with Python and TensorFlow 2.0 – Encoder & Decoder

In one of the previous articles, we kicked off the Transformer architecture. Transformer is a huge system with many different parts. They are relying on the same principles like Recurrent Neural Networks and LSTMs, but are trying to overcome their shortcomings. Because they are massive architectures we decided to split implementation into several articles and implement it part by part. Thus far we handled “low level” elements, so to say. Since the main goal of our Transformer is to translate translate Russian into English, we first had to handle data and implement positional encoding and attention layers. Now, we can utilize those parts and implement other elements. As a reminder this is how Transformer architecture looks like:

High-level overview of Transformer architecture

Of course, this is just a high-level overview of this architecture. As you can see, there are multiple Encoder and Decoder layers stacked together and connected to each other. What you can not see from this image are the details of data prepossessing and the structure of each of these levels. This is all explained in an amazing “Attention is all you need” paper. In this paper the architecture of each encoder and decoder layers are presented like this:

*Single Encoder-Decoder Layer Structure*

In this article, we are pick up where we left off in the last time. To be more precise we build Encoder and Decoder layers and then Encoder and Decoder themselves. Apart from that, we will build so-called data processing layer, that will utilize layers created in the previous article.

Prerequisites

In order to run the code from this and all articles in series, you need to have Python 3 installed on your local machine. In this example, to be more specific, we are using Python 3.7. The implementation itself is done using TensorFlow 2.0. The complete guide on how to install and use Tensorflow 2.0 can be found here. Another thing that you need to install is TensorFlow Datasets (TFDS) package. You can do so by running the command:

pip install tensorflow-datasets

This module contains a large database of data sets that can be used for training purposes. We will use one of these data sets for our model. Here is the list of modules that needs to be imported for the complete Transformer implementation:

	import tensorflow_datasets as tfds
	import tensorflow as tf
	from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Embedding, Dropout
	from tensorflow.keras.models import Sequential, Model
	from tensorflow.keras.optimizers.schedules import LearningRateSchedule
	from tensorflow.keras.optimizers import Adam
	from tensorflow.keras.losses import SparseCategoricalCrossentropy
	from tensorflow.keras.metrics import Mean, SparseCategoricalAccuracy

	from tqdm import tqdm
	import numpy as np
	import matplotlib.pyplot as plt

view raw import.py hosted with ❤ by GitHub

Make sure that you have them all installed.

Pre-Processing Layer

So, lets first create layer that will utilize Embedding and Positional Encoding, we implemented in the previous article. As we mentioned there, Embedding is the process that maps text into a vector based on it’s semantic meaning. Words will be transferred into some sort of vector representation (or embedding) in n-dimensional latent space. In this latent space, vectors that are close to each other belong to the words that have similar semantic meaning. Embedding layer is available as a part of TensorFlow library.

Since semantic meaning of the word depends on the position of that word in a sentence and on relationship with other words in that same sentence as well. That is why information about relative position of every word in a sequence is required – positional encoding vector. This process is proposed in the “Attention is all you need” paper. You can find more about how relative position can be found using current position of the word in this paper. Now, we need to combine positional encoding and Embedding like this:

Here is the code:

	class PreProcessingLayer(Layer):
	def __init__(self, num_neurons, vocabular_size):
	super(PreProcessingLayer, self).__init__()

	# Initialize
	self.num_neurons = num_neurons

	# Add embedings and positional encoding
	self.embedding = Embedding(vocabular_size, self.num_neurons)
	positional_encoding_handler = PositionalEncoding(vocabular_size, self.num_neurons)
	self.positional_encoding = positional_encoding.get_positional_encoding()

	# Add embedings and positional encoding
	self.dropout = Dropout(0.1)

	def call(self, sequence, training, mask):
	sequence_lenght = tf.shape(sequence)[1]
	sequence = self.embedding(sequence)

	sequence *= tf.math.sqrt(tf.cast(self.num_neurons, tf.float32))
	sequence += self.positional_encoding[:, :sequence_lenght, :]
	sequence = self.dropout(sequence, training=training)

	return sequence

view raw PreProcessing.py hosted with ❤ by GitHub

It is pretty straight-forward. We utilize Embedding Layer from tensorflow.keras.layers and use PositionalEncoding implementation from the previous article. Note that at the end of this structure we add dropout layer in order to avoid over-fitting. This is practice we use for other layers as well.

Encoder Layer

Encoder and Decoder layers have similar structures. Encoder layer is a bit simpler though. Here is how it looks like:

Essentially, it utilizes Multi-Head Attention Layer and simple Feed Forward Neural Network. As you can see in the image there are also several normalization processes. Note that in this case this case this relates to the layer normalization. In order to reduce training time, instead of using batch normalization like we would use with standard feed forward neural networks, we use modified approach called layer normalization. This approach is used in all sequence-to-sequence models, since it is not obvious how to apply batch normalization to such models. In an essence, we compute the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. This is all hidden from us if we use LayerNormalization from tensorflow.keras.layers.

Effectively, this means that we use layer normalization after each Multi-Head Attention or Feed Forward Neural Network Layer. Also, as we mentioned, we use Dropout layer to avoid over-fitting. So, we create two helper functions that will retrieve these combos for us:

	def build_multi_head_attention_layers(num_neurons, num_heads):
	multi_head_attention_layer = MultiHeadAttentionLayer(num_neurons, num_heads)
	dropout = tf.keras.layers.Dropout(0.1)
	normalization = LayerNormalization(epsilon=1e-6)
	return multi_head_attention_layer, dropout, normalization

	def build_feed_forward_layers(num_neurons, num_hidden_neurons):
	feed_forward_layer = tf.keras.Sequential()
	feed_forward_layer.add(Dense(num_hidden_neurons, activation='relu'))
	feed_forward_layer.add(Dense(num_neurons))

	dropout = Dropout(0.1)
	normalization = LayerNormalization(epsilon=1e-6)
	return feed_forward_layer, dropout, normalization

view raw mha_ffnn_helper_functions.py hosted with ❤ by GitHub

Now we can build Encoder layer with ease:

	class EncoderLayer(Layer):
	def __init__(self, num_neurons, num_hidden_neurons, num_heads):
	super(EncoderLayer, self).__init__()

	# Build multi head attention layer and necessary additional layers
	self.multi_head_attention_layer, self.attention_dropout, self.attention_normalization = \
	build_multi_head_attention_layers(num_neurons, num_heads)

	# Build feed-forward neural network and necessary additional layers
	self.feed_forward_layer, self.feed_forward_dropout, self.feed_forward_normalization = \
	build_feed_forward_layers(num_neurons, num_hidden_neurons)

	def call(self, sequence, training, mask):

	# Calculate attention output
	attnention_output, _ = self.multi_head_attention_layer(sequence, sequence, sequence, mask)
	attnention_output = self.attention_dropout(attnention_output, training=training)
	attnention_output = self.attention_normalization(sequence + attnention_output)

	# Calculate output of feed forward network
	output = self.feed_forward_layer(attnention_output)
	output = self.feed_forward_dropout(output, training=training)

	# Combine two outputs
	output = self.feed_forward_normalization(attnention_output + output)

	return output

view raw encoder_layer.py hosted with ❤ by GitHub

In the constructor, we created all necessary layers and then just connected them based on the schema we saw on the image above.

Decoder Layer

Decoder layer is somewhat more complicated, because it has additional Multi-Head Attention Layer:

Thanks to the helper functions we can implement this layer fairly easy as well:

	class DecoderLayer(Layer):
	def __init__(self, num_neurons, num_hidden_neurons, num_heads):
	super(DecoderLayer, self).__init__()

	# Build multi head attention layers and necessary additional layers
	self.multi_head_attention_layer1, self.attention_dropout1, self.attention_normalization1 =\
	build_multi_head_attention_layers(num_neurons, num_heads)

	self.multi_head_attention_layer2, self.attention_dropout2, self.attention_normalization2 =\
	build_multi_head_attention_layers(num_neurons, num_heads)

	# Build feed-forward neural network and necessary additional layers
	self.feed_forward_layer, self.feed_forward_dropout, self.feed_forward_normalization = \
	build_feed_forward_layers(num_neurons, num_hidden_neurons)

	def call(self, sequence, enconder_output, training, look_ahead_mask, padding_mask):

	attnention_output1, attnention_weights1 = self.multi_head_attention_layer1(sequence, sequence, sequence, look_ahead_mask)
	attnention_output1 = self.attention_dropout1(attnention_output1, training=training)
	attnention_output1 = self.attention_normalization1(sequence + attnention_output1)

	attnention_output2, attnention_weights2 = self.multi_head_attention_layer2(enconder_output, enconder_output, attnention_output1, padding_mask)
	attnention_output2 = self.attention_dropout1(attnention_output2, training=training)
	attnention_output2 = self.attention_normalization1(attnention_output1 + attnention_output2)

	output = self.feed_forward_layer(attnention_output2)
	output = self.feed_forward_dropout(output, training=training)
	output = self.feed_forward_normalization(attnention_output2 + output)

	return output, attnention_weights1, attnention_weights2

view raw decoder_layer.py hosted with ❤ by GitHub

The only difference is that we use two Multi-Head Attention Layers before Feed Forward Neural Network Layer. Ok, let’s now combine all these layers into Encoder and Decoder structures.

Encoder & Decoder

In the “Attention is all you need” paper, authors suggested that we should use 6 Encoder Layers for building the Encoder and 6 Decoder Layers for building the Decoder. This is of course arbitrary, so we use parameter to define how many layers there should be. Here is how the Encoder class looks like:

	class Encoder(Layer):
	def __init__(self, num_neurons, num_hidden_neurons, num_heads, vocabular_size, num_enc_layers = 6):
	super(Encoder, self).__init__()

	self.num_enc_layers = num_enc_layers

	self.pre_processing_layer = PreProcessingLayer(num_neurons, vocabular_size)
	self.encoder_layers = [EncoderLayer(num_neurons, num_hidden_neurons, num_heads) for _ in range(num_enc_layers)]

	def call(self, sequence, training, mask):

	sequence = self.pre_processing_layer(sequence, training, mask)
	for i in range(self.num_enc_layers):
	sequence = self.encoder_layers[i](sequence, training, mask)

	return sequence

view raw encoder.py hosted with ❤ by GitHub

So, we first create PreProcessingLayer. This layer applies embedding and positional encoding to the input sequence. Then we create several EncoderLayer-s. The number is defined by the num_enc_layers parameter. In the overridden call function (note that we are still inheriting Layer class) we connect all of this into single unified Encoder.

In the same way we created Encoder we create Decoder as well:

	class Decoder(Layer):
	def __init__(self, num_neurons, num_hidden_neurons, num_heads, vocabular_size, num_dec_layers=6):
	super(Decoder, self).__init__()

	self.num_dec_layers = num_dec_layers

	self.pre_processing_layer = PreProcessingLayer(num_neurons, vocabular_size)
	self.decoder_layers = [DecoderLayer(num_neurons, num_hidden_neurons, num_heads) for _ in range(num_dec_layers)]

	def call(self, sequence, enconder_output, training, look_ahead_mask, padding_mask):

	sequence = self.pre_processing_layer(sequence, training, mask)

	for i in range(self.num_dec_layers):

	sequence, attention_weights1, attention_weights2 = self.dec_layers[i](sequence, enconder_output, training, look_ahead_mask, padding_mask)

	attention_weights['decoder_layer{}_attention_weights1'.format(i+1)] = attention_weights1
	attention_weights['decoder_layer{}_attention_weights2'.format(i+1)] = attention_weights2

	return sequence, attention_weights

view raw decoder.py hosted with ❤ by GitHub

The only difference in the constructor is that we use DecoderLayer instead of EncoderLayer. Apart from that, we have to return attention weights as the output of the overridden call function. This is because we need to use these values for the Transformer training process.

Conclusion

We got one step closer to finishing complete Transformer architecture. In this article we utilized Embedding, Positional Encoding and Attention Layers to build Encoder and Decoder Layers. Apart form that, we learned how to use Layer Normalization and why it is important for sequence-to-sequence models. Finally, we used created layers to build Encoder and Decoder structures, essential parts of the Transformer. In the next Transformer article, we will combine all these things together and create the complete model.

Thank you for reading!

Read more posts from the author at Rubik’s Code.

Trackbacks/Pingbacks

Dew Drop – August 19, 2019 (#3012) | Morning Dew - […] Transformer with Python and TensorFlow 2.0 – Encoder & Decoder (Nikola Živković) […]
Transformer with Python and TensorFlow 2.0 – Encoder & Decoder – معتز خالد سعد | Motaz Saad - […] https://rubikscode.net/2019/08/19/transformer-with-python-and-tensorflow-2-0-encoder-decoder/ […]

Transformer with Python and TensorFlow 2.0 – Encoder & Decoder

Prerequisites

Pre-Processing Layer

Encoder Layer

Decoder Layer

Encoder & Decoder

Conclusion

Trackbacks/Pingbacks

Leave a ReplyCancel reply

Feel Free To Message Us

Contact Info

Visit Us

Email Us

Call Us

Ultimate Guide to Machine Learning with Python

Transformer with Python and TensorFlow 2.0 – Encoder & Decoder

Prerequisites

Pre-Processing Layer

Encoder Layer

Decoder Layer

Encoder & Decoder

Conclusion

Trackbacks/Pingbacks

Leave a ReplyCancel reply

Discover more from Rubix Code