This year was important year for deep learning and machine learning in general. Things are happening pretty quickly and the number of application of these technologies is growing. We crossed the chasm and deep learning is in the Early Majority phase. Today, we even have books about neural networks for babies (and for programmers for that matter :)), which is fascinating. The best way to stay current in this crazy world, apart from reading cool books, is reading important papers on the subject. In this article, we will focus on the 5 papers that left a really big impact on us in this year.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Read the complete paper here.

Code that accompanies this paper can be found here.

This one kinda blew us away, right? The field of NLP has been ruled by BERT (Bidirectional Encoder Representations from Transformers) since last year, but in 2019 we got the new king – XLNet. This new architecture by researchers from CMU and Google outperforms BERT on 20 tasks, often by large margin. Exactly, minds – blown 🙂 The problem is that BERT is trained on on corrupted input and this causes pretrain-finetune discrepancy. In a nutshell, certain number of tokens in input token sequence are replaced by a special symbol [MASK], and the BERT is than trained to recover the original tokens from the corrupted input using bidirectional contexts for reconstruction.

XLNet is still using this this autoencoding approach, but in combination with autoregressive language modeling. This type of language modeling is using the context to predict the next word. However, this context is constrained in to direction, it can either be forward or backward. Meaning, if we try to predict some word (token) in a sentence we look into words that come before or after that word to predict it. The most famous autoregressive language model is Transformer. XLNet is using even more advanced Transformer-XL architecture.

Essentially, autoregressive language modeling and BERT possess advantages over the other and XLNet brings the advantages of both while avoiding their weaknesses in a clever way. Just like BERT, XLNet utilizes bidirectional context, which means that words before and after token that should be predicted are taken into consideration. On the other hand, as an autoregressive language model XLNet doesn’t rely on input data corruption, and thus avoids BERT’s limitations.

Network Pruning via Transformable Architecture Search

Read the complete paper here.

Code that accompanies this paper can be found here.

Network pruning is one fascinating area of deep learning. The idea of this approach is to analyse the structure of neural network and find “dead” and useful parameters in it. For example, maybe some layers are actually increasing the loss. Then the new architecture, called pruned network, can be proposed, with estimated depth and width. After that, useful parameters from the original network can be transferred to the new network. This is especially useful for Deep Convolutional Neural Networks which can get quite big and impractical for deployment in embedded systems. In this case network pruning can reduce the computation cost of overparameterized CNNs.

Traditional approach to network pruning looks something like this:

This paper actually suggests this approach:

Essentially, in the beginning the training of large network is done. Then searching for the depth and width of a small network is proposed by Transfer Architecture Search (TAS). Finaly the knowledge from the large network is transferred to the small network using knowledge distillation.

Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed

Read the complete paper here.

Code that accompanies this paper can be found here.

When some song is recorded, each instrument is recorded separately into separate track or stem. Later, during mixing and mastering phases those stems are merged together and song is created. The goal of this paper is to find a way to reverse that process, meaning that each individual stem is extracted from the completed song. Source of the inspiration for this problem can be found in so called “cocktail party effect”. This is the ability of the the human brain is able to separate and focus on a single conversation out of a surrounding noise from a room full of people chatting.

The proposed architecture merges ideas from SING neural network architecture and from Wave-U-Net. The first one is used for Symbol-to-Instrument music synthesis, while the other one is one of the approaches for extracting stems from the mix. Essentially, combination of LSTM and convolutional layers are combined with U-Net architecture. Convolutional layers are used for the encoding section of the architecture, while LSTM layers are used for decoding section. To speed up model performance batch normalization layers are not used. How does this model performs against other architectures? The results can be seen here.

StarGAN v2: Diverse Image Synthesis for Multiple Domains

Read the complete paper here.

Code that accompanies this paper can be found here.

We love GANs! Especially when it comes to image creation and manipulation. One very interesting problem in this area is so called Image-to-Image translation problem, where we want to transfer characteristics from one image domain to the other. Here, image domain stands for a set of images that can be grouped as a visually distinctive category. We love solutions that aim to solve this problem like CycleGAN and StarGAN, so you can imagine how excited we were when couple a days ago we saw StarGAN v2 paper.

This paper attacks one more problem as well – scalability of domains. Meaning it solves this problem for multiple image domains at once. In an essence, this architecture relies on success of earlier version of StarGAN and adds style layers to it. It is composed of four modules. First module is generator and it is in charge of converting input image into an output image reflecting the specific style of the domain. Next is Mapping Network Transformer, which transforms latent code into style code for multiple domains. The third is Style Encoder, which extracts the style of an image and provides it to the generator. Finally, discriminator distinguishes between real and fake images from multiple domains.

Depth-Aware Video Frame Interpolation

Read the complete paper here.

Code that accompanies this paper can be found here.

Video frame synthesis is interesting sub-field of signal processing. In general, it is all about synthesizing video frames within an existing video. If his is done in between video frames, it is called interpolation, and it is after video frame, it is called extrapolation. Video frame interpolation is a long-standing topic and has been extensively studied in the literature. In this chapter, we explore one interesting paper that utilizes deep learning techniques for it. Often the quality of interpolation is reduced due to large object motion or occlusion. In this paper, authors used deep learning to detect the occlusion by exploring the depth information.

In fact, they created architecture which is called Depth-Aware video frame INterpolation or DAIN. This model utilizes depth maps, local interpolation kernels and contextual features to generate video frames. Essentially, DAIN construct the output frame by merging input frame, depth maps and contextual features based on optical flow and local interpolation kernels.


In this article, we had a chance to see some interesting papers and advancements made in the world of deep learning. The field is constantly growing and we expect to have even more interesting 2020.

Thank you for reading!

Read more posts from the author at Rubik’s Code.

Ultimate Guide to Machine Learning with Python

Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.

Become a Machine Learning Superhero TODAY!