DeepMind reveals PonderNet: Pondering or Fake Hype?

Last month, DeepMind presented an interesting concept which they called – PonderNet. This neural network already made some controversy, mostly because of its name. As many authors already suggested the term “Pondering” is a bit misleading word used to describe the behavior of this network.

Some said that it is even arrogant to call it like that when PonderNet cannot really do any real pondering. So, what is all the commotion about? What is PonderNet all about and what can it really do? Let’s see.

Ultimate Guide to Machine Learning with Python

This bundle of e-books is specially crafted for beginners.
Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.
Become a Machine Learning Superhero TODAY!

In this article we explore:

1. What is DeepMind’s PonderNet all about?
2. PonderNet Architecture and Methods
3. PonderNet Results

1. What is DeepMind’s PonderNet all about?

In general, machine learning algorithms are not taking care of the complexity of the problem or the computational budget and resources. At least not out of the box. They aim to solve the problem presented and use inputs to adapt their inner state. The task of taking the complexity of a task at hand and balancing the computational budget is usually a manual task done by a machine learning engineer.

How many times have you run the experiments and a neural network converged long before training is done? On the other hand, how many times have you realized that you need to train your neural network even more than you initially intended?

In standard neural networks the amount of computation used grows with the size of the inputs, but not with the complexity of the problem being learnt.

Andrea Banino

PonderNet Author, DeepMind

1.1 Ideas and Concepts

DeepMind’s PonderNet aims to solve this problem. In essence, PonderNet has the ability to stop its computation effort if it seems that the neural network already learned what it needs to learn and further training will not produce better results. In the same manner, PonderNet extends its calculations if doing so may produce further improvement. Basically, PonderNet learns to adapt the amount of computation based on the complexity of the problem at hand.

Authors from DeepMind have worked on similar solutions for the same problem. In the past, there were architectures that are automating the process of minimizing the required computation time. These architectures are focused on the halting process, meaning they try to evaluate are further calculations beneficial, and if not they stop the training.

Basically, they used some discrete latent variable to dynamically adjust the number of computational steps. Some of these architectures are Adaptive Computation Time and Adaptive Early Exit Networks. There were some really interesting experiments with reinforcement learning and RNNs as well. All these techniques paved the way for DeepMind’s PonderNet.

1.2 What is new then?

PonderNet is an interesting architecture built on top of these past ideas and utilizes them to a certain degree. However, the novel approach that PonderNet proposes allows for low-variance gradient estimates and unbiased gradient estimates unlike the previous attempts to solve this problem. DeepMind’s PonderNet uses one very interesting trick and defines halting policy as a probabilistic model.

Now, you may see why this architecture brought controversy to the community. Having halting procedure and policy is hardly a higher level of intelligence, which term pondering is insinuating. The Merriam Webster dictionary defines the word ponder as being “to weigh in the mind.“, “to reflect on” and “to think about.“

In the author’s defense, this term was used in the past for the process of balancing the computational budget, however, we can see why some people are having a hard time accepting it.

Ok, back to science! Let’s see what changes PonderNet suggests.

2. PonderNet Architecture and Methods

There are two big suggestions that PonderNet net proposes: a new architecture for neural networks which modifies the forward pass and a new training loss function. The architecture predicts the probability of halting conditional on not having halted before. It is interesting that overall halting probability at each step is observed as geometric distribution, as we will see in a bit. On the other hand, the loss function doesn’t aim to minimize the number of computing steps but to encourage exploration.

2.1 PonderNet Architecture

PonderNet halting policy is injected in the activation function. This way this method can be applied to any neural network architecture, from simple MLP to LSTMs to more complicated architectures like Transformers. New step function in the form of:

is proposed, along with the initial state h0. In the equation from above ^yn is the output of the network, ie. the predictions it made conditioned on the number of steps n. Note that the final output of the PonderNet is the prediction made at step n at which it halts. The λn is the probability of halting at step n, which drives the network to learn the optimal value of n.

Here things get interesting and may remind us of reinforcement learning. Authors define a Bernoulli random variable Λn, which has two states “continue” (Λn = 0) and “halt” (Λn = 1). The decision process starts from state “continue” (Λ0 = 0) and “halt” (Λn = 1) is a terminal state. Now, the conditional probability of entering state “halt” at step n conditioned that there has been no previous halting can be defined as:

From this probability distribution pn as a generalization of the geometric distribution is made:

The value of N, the maximum number of steps here is also important. For training, the authors defined a minimum cumulative probability of halting, and then used this information to extrapolate N. For evaluation, N can be set as a constant, or not set at all.

2.2 PonderNet Loss Function

The second improvement this DeepMind’s architecture proposes is in the area of the loss function. This function is split into two parts:

The L in the equation above is any loss function that is used for training purposes, like mean squared error or cross-entropy. The λp is a hyper-parameter that defines a geometric prior distribution pG(λp) on the halting policy.

The first part of the PonderNet loss is called reconstruction – LRec. This value represents the expectation of loss L across halting steps. The second part of the PonderNet loss is called regularization – LReg. In its essence, it is KL divergence between the distribution of halting probabilities pn and the prior geometric distribution truncated at N and parameterized with λp. This part of the equation is the one that drives the neural network to explore. In essence, this regularisation loss puts pressure on the network to use computation efficiently.

3. PonderNet Results

The authors provided results on three tasks: parity task, the bAbI question answering dataset, and Paired associative inference task. Parity task was introduced with some of the older architectures we already mentioned – Adaptive Computation Time (ACT). PonderNet showed better performance and higher accuracy than the original ACT model on three variations.

In the image below you can see how PonderNet achieved better accuracy on Interpolation and Extrapolation tasks. Also, in the third graph, you can see the total number of computing steps calculated as the number of actual forward passes performed by each network. Blue is PonderNet, Green is ACT and Orange is an RNN without adaptive compute.

The bAbI question answering dataset contains 20 different tasks and it is a hard task for neural network architectures that do not employ adaptive computation. In comparison to other architectures, PonderNet achieved those of the current SOTA, however, it obtained them faster and with lower average error.

Finally, PonderNet was tested on Paired associative inference task. The task is designed in a way to measure if a neural network is able to learn relationships among elements distributed across multiple facts or memories. PonderNet achieved just slightly lower accuracy than the MEMO, architecture specifically made for this task. It also showed better results than Universal Transformer, even though they used the same architecture.

If you want to learn more about PonderNet, check out the full paper here.

Conclusion

PonderNet provided an interesting concept that has the potential to cross over to the industry and become the new standard. Its universal approach (it can be applied to any neural network architecture), wide opens that possibility. Apart from that, a bit of controversy it brought is a good thing for technology. There is no bad publicity, right?

Thank you for reading.

Nikola M. Zivkovic

Nikola M. Zivkovic is the author of books: Ultimate Guide to Machine Learning and Deep Learning for Programmers. He loves knowledge sharing, and he is an experienced speaker. You can find him speaking at meetups, conferences, and as a guest lecturer at the University of Novi Sad.

DeepMind reveals PonderNet: Pondering or Fake Hype?