At the beginning of every month, we decipher three research papers from the fields of machine learning, deep learning and artificial intelligence, that left the biggest impact on us in the previous month. Apart from that, at the end of the article, we add links to other papers that we have found interesting but were not in our focus that month. So, you can check those as well. In February, we explored papers that, as we see it, are going to leave a big impact on the future of machine learning and deep learning. In essence, we think that these proposals are going to change the way we do our jobs. Have fun!

Eager to learn how to build Deep Learning systems using Tensorflow 2 and Python? Get the ebook here!

The Tree Ensemble Layer: Differentiability meets Conditional Computation

It is an understatement to say that this paper blew us away. It is probably the most elegant solution we read this the whole year. Authors of this paper propose one very interesting merging two very popular and successful machine learning approaches. In a nutshell, they incorporate decision threes within the neural network structure and try to eliminate the flaws of both approaches. If you are a machine learning engineer, chances are that you have used decision trees ensembles in your life once or twice. They give good results in various applications and are considered to be the best out-of-the-box learners. One of their features that we are interested in is conditional computation, which refers to their ability to route each sample through a small number of nodes. This feature is specifically important because this enables these models to activate the only a fraction of their architecture, which leads to both statistical and computational benefits.

However, as any classical machine learning approach, their performance heavily depends on feature engineering. On the other hand, neural networks earned their good name exactly because of their ability to perform feature engineering on their own. They have a good mechanism for representation learning, but they are harder to tune and they don’t support conditional computation. That is where the idea for combining these two approaches originated. The authors propose Tree Ensemble Layer (TEL) for neural networks. This layer is an additive model of differentiable decision trees and it is trained along with gradient-based optimization methods like stochastic gradient descend. TEL is accompanied by new sparse activation function for sample routing and specialized forward and backward propagation algorithms.

To be precise, this algorithm uses improved soft trees, which are a variant of decision threes that perform soft routing. Standard decision trees perform hard routing – a sample is routed to exactly one direction at every internal node. This approach is bad for optimization. In the soft routing approach, a sample can be routed left and right simultaneously with different proportions. This makes soft trees differentiable, meaning that we can apply stochastic gradient descend to them, but doesn’t support conditional computation because they cant route strictly to the left or strictly to the right. That is why a new activation function is introduced – smooth-step activation function, which can output exact zeroes and ones, unlike the Sigmoid function which is often used in soft decision trees.

This approach, of course, requires some changes in the forward and back propagation tasks and involvement of conditional computation into it. The algorithm for forward pass looks like this:

While the modified algorithm for backpropagation looks like this:

In the multiple experiments with 23 datasets this approach got 10x speed up against decision trees algorithms.

Read the complete paper here.

The code that accompanies this paper can be found here.

The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding

As we are aware, the world of deep learning is dominated by supervised learning. However, this situation is slowly changing. A lot of breakthroughs in the field caused resulted with the creation of the models become huge and hard to train. This is why transferred learning emerged. Today engineers use models that someone else pre-trained and then customize them for a specific task. In Neural Language Understanding (NLU) this paradigm shift is most noticeable and the development of these projects differs a lot from the current state of the rest of the field. This subfield follows a different paradigm. The process of building NLU models is based on the pre-training stage, which trains large neural networks like Transformers in self-supervision mode on a large unlabeled text corpus. The pre-training stage is followed by the fine-tuning stage, where the pre-trained model is applied to the specific task.

Since these models are huge, they are quite tricky to handle when it comes to deployment and performance. That is why the final stage of knowledge distillation is of utter importance. If done properly, this stage can compress a large model and ease up our deployment process. There are many tools out there that cover some parts of this process. However, none of these tools cover all the necessary steps. Often they are not providing adversarial training which is key when it comes to NLU fine-tuning tasks. This paper presents MT-DNN is a new open-source PyTorch based tool that comes from Microsoft. It is available here.

To install it, you need to have Python 3.6 and PyTorch installed on your local environment. Then you need to install requirements with the command:

pip install -r requirements.txt

After requirements are installed, pull Docker container with command:

docker pull allenlao/pytorch-mt-dnn:v0.5

Finally, run Docker container with:

docker run -it --rm --runtime nvidia allenlao/pytorch-mt-dnn:v0.5 bash

You can try it by training a toy MT-DNN model. First you need to download data using command:


Pre-processing is done with the command:

sh experiments/glue/

Finally, run the training with:


One of the key advantages of this toll is out-of-the-box adversarial training, multi-task learning and knowledge distillation. Users can also perform pre-training from scratch. Apart from that, this tool provides a number of the pre-trained NLU models, like BERT, RoBERT and UniLM. The workflow for creating models using MT-DNN goes as follows:

  • Train a neural language model on a large amount of unlabeled raw text to obtain general contextual representations.
  • Fine-tune the learned contextual representation on downstream tasks, e.g. GLUE (Wang et al., 2018).
  • Distill this large model to a lighter one for online deployment. In the latter two phrases, we can leverage multi-task learning and adversarial training to further improve performance.

Adversarial training is available at fine-tuning and knowledge distillation tasks. The process of knowledge distillation is very interesting. It all starts with selecting a set of task-specific labeled training data. After that, an ensemble of neural nets, the so-called teacher, is trained for each task. The teacher is used to generate for each task-specific training sample a set of soft targets. A single MT-DNN, so-called student is trained using multi-task learning and backpropagation using these soft targets. The overall system architecture is based on Lexicon Encoders. Here is how it looks like:

In the input, which can be a sentence of a set of sentences, embedding is applied. As a result, this input is represented as the embedding vector. Then the encoder (Transformer or LSTM) captures contextual information for each word and generates contextual embedding vectors. In the end, for each task an additional task-specific layer is applied along with the operation for classification, similarity scoring or relevance ranking.

Read the complete paper here.

The code that accompanies this paper can be found here.

SUOD: Toward Scalable Unsupervised Outlier Detection

One of the basics steps when it comes to the analysis of the data is outlier detection, ie. detection of samples that are deviating from the general data distribution. During this step of data analysis abnormalities in the dataset are detected. This process is often driven by some kind of unsupervised model, that will eventually acquire ground truth. However, these models are unstable. This means that using just one unsupervised model is risky. That is why data scientists usually choose to build a number of models to get the data that they can additionally analyze. In essence, that is how outlier ensemble methods were developed. This approach has several flows, like scalability and computational costs. Apart from that, these algorithms (kNN, Local Outlier Factor, Local Operate Probabilities) work in Euclidean space, which means that dimensionality is a problem as well.

That is why the authors of this paper propose a three-module acceleration framework – SUOD, to speed up the training and prediction with a large number of unsupervised models. This framework generates a random low-dimensional subspace for each unsupervised model, on which the model is then trained. Also, balanced parallel scheduling heuristics are proposed for increasing efficiency in distributed systems, meaning for each model SUOD predicts running time and based on that distributes workload among workers. Finally, the third feature of SOUD is lower cost supervised regressors for the approximation of unsupervised models. As you are probably aware, supervised models are faster for prediction and easier for interpretation. This can be compared with some of the knowledge distilling techniques that are used for neural networks. The whole algorithm of SUOD goes like this:

Read the complete paper here.

The code that accompanies this paper can be found here.

Other Amazing Papers from this Month


In this article, we had a chance to real about three really cool papers that, as we see it, are going to change the way we perform our jobs. First, we saw how one can utilize differentiable decision trees within neural networks. Then we explored the new Microsoft tool for NLU. Finally, we saw how we can optimize our outlier search. Did you have any favorites this month? Let us know.

Thank you for reading!

Nikola M. Zivkovic

Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.