All neural networks architectures lay on the same principles. There are neurons with biases and activation functions connected with weighted connections. However, different problems require the different mix of those neurons and connections. That is how we ended up with a big “neural networks zoo” with different architectures and learning processes. So far in our artificial neural network series, we covered several of those architectures, like Convolutional Neural Networks, Long Short-Term Neural Networks, Self-Organizing Maps, Restricted Boltzmann Machine, etc. This time we are diving into the world of Autoencoders. The idea of Autoencoders can be traced back to 1987. and the work of LeCun. In the beginning, these networks were used for the dimensionality reduction, but recently they are used for generative modeling as well.

The architecture of the Autoencoders is really similar to the one of the standard feed-forward neural networks, but its main goal differs from them. Because of this similarity, same learning techniques can be used on this type of network as well. Just like Self-Organizing Maps and Restricted Boltzmann Machine, Autoencoders utilize unsupervised learning. Neural networks that use this type of learning get only input data and based on that they generate some form of output. In the case of Autoencoders, they try to get copy input information to the output during their training.

Undercomplete & Overcomplete

The last sentence might be a bit confusing, so let’s observe the image below to clear things up. Autoencoders are called like that because essentially they automatically encode information. By this, I don’t mean some form of encryption, but more of a compression. As you can see the hidden layer in the middle is having fewer neurons than the input and output one. In practice, we can have more hidden layers. The closer layers are to the middle of the network the fewer neurons they will have. These symmetrical, hourglass-like autoencoders are often called Undercomplete Autoencoders.

We can observe this mathematically too. The first section, up until the middle of the architecture, is called encoding – f(x). The hidden layer in the middle is called the code and it is the result of the encoding – h = f(x). The last section is called decryption (shocking :)), and it produces the reconstruction of the data – y = g(h) = g(f(x)). So, the Autoencoder gets the information on the input layer, propagates it to the middle layer and then returns the same information on the output. That is why the information on the neurons of the middle layer is actually the most interesting one because it represents encoded information.

However, sometimes the middle layer(s) is(are) not used like this. Meaning, sometimes the middle layer has more neurons than the input and output layer. In that situation, we are trying to get more information from the input layer than it is presented in the input itself. This kind of Autoencoders are presented on the image below and they are called Overcomplete Autoencoders.

As mentioned the goal of this kind of Autoencoders is to extract more information from the input information than it is given on the input. However, in this situation, it is nothing stops the Autoencoder form just move information from the input to the output. Rather than limiting the size of the middle layer, we are using different techniques for regularization which encourages Autoencoder to have other properties. This way Autoencodes can be overcomplete and non-linear. Some of the techniques that are used for this are sparsity and denoising, which we will examine later in the article. First, we need to cover the learning process itself.

How do they learn?

We saw that the architecture of the Autoencoders reminds us of the one from the feed-forward neural networks. Essentially, this means that we can use the same techniques that we used on them. However, we also said that we will use unsupervised learning. The trick is that we will use backpropagation and other learning techniques from the supervised learning models, but using only input data since our input data is the same as the output data. That is how Autoencoders put all these topics under one roof.

Let’s observe Autoencoder shown on the image below. It is one pretty simple architecture, five neurons in the input and the output layer and two in the middle layer. The idea is to present information about five inputs with just two values. It is important to note that for the Autoencoders neurons in the middle layer(s) are usually use tahn activation function, and neurons from the output layer are using softmax activation function. This configuration is giving the best results, so we used it in our example as well. Note that connections that are marked with the green color have weight 1 and the rest are having weight value -1. Another thing that I need to mention before we proceed, is that I used binary values for the input in the example, which is not mandatory.

Now let’s put some input data in our input level. For this example, we will put an array of values [1, 0, 0, 0, 0] and end up with this situation:

Of course, because we are using softmax activation function on the neurons of the output layer, we will end up with the same values as we are having on the input [1, 0, 0, 0, 0]. That is exactly what we wanted.

Still, to get the correct values for weights, which are given in previous example we need to train the Autoencoder. To do so, we need to follow these steps:

  1. Set the input vector on the input layer
  2. Encode input vector into the vector of lower dimensionality – code
  3. Deconstruct input vector by decoding code vector
  4. Calculate reconstruction error d(x, y) = ||x – y||
  5. Backpropagate the error
  6. Minimize reconstruction error over several epochs

To sum it up, we can use the following function to represent the learning process

where L is the loss function which is minimizing reconstruction error. For me, the coolest part of Autoencoders is the way they combine techniques from supervised learning for unsupervised learning. If you need to learn more about how feedforward neural networks learn, what are activation functions and how the backpropagation works, check out given links.


As mentioned before Overcomplete Autoencoders are using different regulation techniques for limiting the direct propagation of the data. This includes adding certain bias to the reconstruction error called sparsity driver – Ω(h). The learning process is then represented with the function

In general, the idea is to use this additional value as a threshold. This means that only certain errors will be used in backpropagation, while the other will be viewed as irrelevant. This kind of Autoencoders is usually used before other supervised learning models to extract additional information from the input data. Often they are presented with this kind of images:


Another way to regulate overcomplete Autoencoders is so-called Denoising technique. This technique is adding some random noise to the input values that are used for calculating reconstruction error. Meaning, instead of adding some value to the result of the cost function, like sparse Autoencoders do, we are adding some value to the terms of that cost function. Mathematically, that can be presented like this:

where x’ is the copy of the input x that has some noise. This kind of Autoencoders is presenting a quite nice example of how overcomplete Autoencoders can be used as long we take care that they don’t learn identity function.


Autoencoders are definitely special kind of neural networks. Even though their structure resembles other architectures and uses similar learning techniques, they are able to give a different view and point to the whole field. We can see that we can not only use neural networks for regression and classification, but for compression and feature extraction. This is why these networks are often used before classification or regression models. Also, they are utilizing methods used for supervised learning for unsupervised learning, which is a unique case in the world of neural networks.

Thank you for reading!

This article is a part of  Artificial Neural Networks Series, which you can check out here.

Read more posts from the author at Rubik’s Code.