Well, it is not just end of the year, but end of the decade as well. So, for the last article of 2019, we decided to write about awarded research papers from Neural Information Processing Systems (NeurIPS) conference. Looking at the logo of NeurIPS conference, my wife asked me “What is that? Is that a witchcraft conference?”. Close, but no, it is one of the most important machine learning conferences. In fact, this seems like appropriate end of the decade blog post, since NeurIPS awards is something like the Oscars in the world of machine learning. Every year a bunch of papers are proposed to and best papers are awarded. This was thirty-third NeurIPS conference and it was held in Vancouver from December 8th to 14th, 2019. Interesting thing is that this years’ conference had biggest number of submissions ever. Record-breaking 6743 submissions this year were made this year, which says a lot about the state of machine learning industry. In fact field grew so much, that tickets for this year conference were sold out in under twelve minutes. I know, mind blowing.

From huge number of submitted papers, 1428 were accepted and 3 were awarded. In general there are three categories:

  • Outstanding Paper Award – Best research paper of the conference
  • Outstanding New Directions Paper Award – Research paper that is setting a path for future research.
  • Test of Time Award – Research paper that was published 10 years ago at NeurIPS and that had a lasting impact on our community.

The NeurIPS committee was guided by several criteria. The best paper has to be revolutionary, creative and have certain elegance, but it has feasible, realistic and reproducible as well. It also shouldn’t be over complicated and inefficient. In our opinion committee have done an awesome job 🙂

Outstanding Paper Award

Distribution-Independent PAC Learning of Halfspaces with Massart Noise

Now this is one amazing paper! What strikes us the most is how this paper proposes an elegant new approach to the old problem. In a nutshell, this paper attacks one of the most influential machine learning problems – the problem of learning an unknown halfspace. Or to be more precise, it focuses on an algorithm for learning halfspaces in the distribution-independent PAC model with Massart Noise. Let’s decipher this a little bit. Halfspaces are functions that separate two classes, positive samples and negative samples, by a hyperplane. Basically, binary classification. They are also called Linear Threshold Functions (LTFs), are can be represented like this:

where sign(u) = 1 if u ≥ 0 and sign(u) = −1 if u < 0, w are the weights and x are the features. In a nutshell, they are boolean functions that separate data into two spaces. If you want to observe this from the deep learning perspective it is the problem that Rosenblatt’s Perceptron tried to solve as well. The main issue here is that if data is corrupted, the result depends on the underlying noise model.

One of the algorithms that attack this problem of binary classification is probably approximately correct (PAC) learning. This model analyzes whether and under what conditions a learning agent will output an approximately correct classification. In it’s unsupervised, non-parametric statistical technique primarily used for dimensionality reduction.
Massart Noise extends this approach. This is achieved by flipping the label of each sample/record with some small probability that is unknown to the learning agent. Will the label be flipped or not is decided by factor n. This factor has value smaller than 1/2. In this research, a polynomial-time of 1/ε was proved with an excess risk equal to the Massart Noise level plus ε.

Read the complete paper here.

Honorable mentions:

Outstanding New Directions Paper Award

Uniform Convergence may be Unable to Explain Generalization in Deep Learning

As you know, we at Rubik’s Code love deep learning, so this paper straight up blew us away. We are facing that neural networks are today used in different industries for variety of problems. However, that was not always the case. In fact, a lot of industries are still skeptical about deep learning. They prefer standard machine learning models because they are explainable. And they have a good reason for that. Questions like “Why over-parameterized neural networks generalize well?” are still opened. How neural networks can perform well on data that was never seen, after they are trained on large real training dataset?

Variety of generalization bounds for neural networks has been developed for that particular reason. A generalization bound is a statement about the predictive performance of a learning algorithm, in this case neural network. Basically, neural network is observed as a procedure that takes some finite training data as input and returns a predicted labels on new data. Since we assume that all data, both training and evaluation, has fixed distribution, the quality of mentioned predictions can be measured in terms of its risk. This means that predictions are compared with the distribution of data, and risk represents level of it’s incompatibility. To sum it up, a generalization bound is a probabilistic bound on the defect.

Majority of generalization bounds are based on uniform convergence, which can be defined like this.

A sequence of functions fn converges uniformly to a limiting function f on a set E if, given any arbitrarily small positive number ε, a number N can be found such that each of the functions fn, fn+1, fn+2, etc. differ from f by no more than ε at every point x in E.

Now, this paper challenges this assumption. It proposes set of experiments that proves how uniform convergence can’t explain generalization in deep learning. The experiments are done on MNIST dataset with three over-paramterized models and different hyperparameter settings for different training set sizes is tested. All models are trained on the Stohastic Gradient Descend. To be more precise, overparameterized neural network with only one hidden layer (with 100k neurons) is trained using Stohastic Gradient Descend on dataset with 100 dimensions. Now, if we increase training dataset size test error decreases and generalization improves. However, this paper proves that the decision boundary is not simple and that uniform convergence increases the bounds when training size is increased. This means that uniform convergence can’t completely explain the generalization and that we should develop algorithm-independent techniques that don’t restrict themselves.

Read the complete paper here.

Honorable mentions:

Test of Time

Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Final award went to the paper published on NeurIPS that stood test of time and left enduring impact on machine learning community. Basically, committee overviews a list of accepted papers to NeurIPS ten years ago, that have had the most citations. This year that was a paper by Lin Xiao, who’s research explores the fundamental concepts of modern machine learning. This paper proposed a Regularised Dual Averaging Method (RDA), an optimization technique used for solving online convex optimization problems. Online convex optimization has goal same as Stochastic Gradient Descent – to minimize loss, however it is performed differently. In essence, it is simulated as a game where the player at each timestamp t, predicts a weight vector and the loss.

This approach had many problems before this paper. Probably the biggest impact that this paper had is the optimization method – Batch Optimization. Meaning, that initially only fraction of samples is available. Then the player calculates weight vector in time step t. Once this is done, loss is calculated with sub-gradient based on current weights. This is repeated in the next time step t+1.

Read the complete paper here.


In this article, we explored the most interesting papers from NeurIPS Conference. In our humble opinion, they will shake the world of machine learning in the years to come.

Thanks for reading!

Read more posts from the author at Rubik’s Code.