Last month, DeepMind presented an interesting concept which they called – **PonderNet**. This neural network already made some controversy, mostly because of its name. As many authors already suggested the term “*Pondering*” is a bit misleading word used to describe the behavior of this network.

Some said that it is even arrogant to call it like that when PonderNet cannot really do any **real** pondering. So, what is all the commotion about? What is **PonderNet** all about and what can it really do? Let’s see.

This bundle of e-books is specially crafted for **beginners**.

Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.

Become a Machine Learning Superhero **TODAY**!

In this article we explore:

**1. What is DeepMind’s PonderNet all about?**

**2. PonderNet Architecture and Methods **

**3. PonderNet Results**

## 1. What is DeepMind’s PonderNet all about?

In general, machine learning algorithms are not taking care of the **complexity** of the problem or the computational budget and resources. At least not out of the box. They aim to solve the problem presented and use inputs to adapt their inner state. The task of taking the complexity of a task at hand and balancing the computational budget is usually a manual task done by a machine learning engineer.

How many times have you run the experiments and a neural network converged long before training is done? On the other hand, how many times have you realized that you need to train your neural network even more than you initially intended?

In standard neural networks the amount of computation used grows with the size of the inputs, but not with the complexity of the problem being learnt.

### 1.1 Ideas and Concepts

DeepMind’s **PonderNet** aims to solve this problem. In essence, PonderNet has the ability to stop its computation effort if it seems that the neural network already learned what it needs to learn and further training will **not produce** better results. In the same manner, PonderNet **extends** its calculations if doing so may produce further improvement. Basically, PonderNet learns to **adapt** the amount of computation based on the **complexity** of the problem at hand.

Authors from *DeepMind* have worked on similar solutions for the same problem. In the past, there were architectures that are automating the process of **minimizing** the required computation time. These architectures are focused on the halting process, meaning they try to evaluate are further calculations **beneficial**, and if not they stop the training.

Basically, they used some *discrete* latent variable to dynamically adjust the number of computational steps. Some of these architectures are Adaptive Computation Time and Adaptive Early Exit Networks. There were some really interesting experiments with reinforcement learning and **RNNs **as well. All these techniques paved the way for *DeepMind’s PonderNet*.

### 1.2 What is new then?

*PonderNet* is an interesting architecture built on top of these past ideas and utilizes them to a certain degree. However, the novel approach that *PonderNet* proposes allows for **low-variance** gradient estimates and **unbiased** gradient estimates unlike the previous attempts to solve this problem. *DeepMind’s PonderNet* uses one very interesting trick and defines halting policy as a **probabilistic model**.

Now, you may see why this architecture brought **controversy** to the community. Having halting procedure and policy is hardly a higher level of intelligence, which term *pondering* is insinuating. The Merriam Webster dictionary defines the word ponder as being “*to weigh in the mind.*“, “*to reflect on*” and “*to think about.*“

In the author’s defense, this term was used in the past for the **process** of balancing the computational budget, however, we can see why some people are having a hard time accepting it.

Ok, back to **science**! Let’s see what changes *PonderNet* suggests.

## 2. PonderNet Architecture and Methods

There are two big suggestions that PonderNet net proposes: a new **architecture** for neural networks which modifies the forward pass and a new training **loss** function. The architecture predicts the **probability** of halting conditional on not having halted before. It is interesting that overall halting probability at each step is observed as geometric distribution, as we will see in a bit. On the other hand, the loss function doesn’t aim to minimize the number of computing steps but to encourage **exploration**.

### 2.1 PonderNet Architecture

PonderNet halting policy is injected in the activation function. This way this method can be applied to any neural network architecture, from simple **MLP** to **LSTMs** to more complicated architectures like **Transformers**. New step function in the form of:

is proposed, along with the initial state *h0*. In the equation from above *^yn* is the output of the network, ie. the **predictions** it made conditioned on the number of steps *n*. Note that the final output of the *PonderNet* is the prediction made at step *n* at which it halts. The *λn* is the probability of halting at step *n*, which drives the network to learn the optimal value of *n*.

Here things get interesting and may remind us of reinforcement learning. Authors define a Bernoulli random variable *Λn*, which has two states “continue” *(Λn = 0)* and “halt” *(Λn = 1)*. The decision process starts from state “continue” *(Λ0 = 0)* and “halt” *(Λn = 1)* is a **terminal** state. Now, the conditional probability of entering state “halt” at step *n* conditioned that there has been no previous halting can be defined as:

From this probability distribution *pn* as a generalization of the **geometric distribution** is made:

The value of *N*, the maximum number of steps here is also important. For training, the authors defined a minimum cumulative probability of halting, and then used this information to extrapolate *N*. For evaluation, *N* can be set as a constant, or not set at all.

### 2.2 PonderNet Loss Function

The second improvement this DeepMind’s architecture proposes is in the area of the loss function. This function is split into two parts:

The *L *in the equation above is any loss function that is used for training purposes, like mean squared error or **cross-entropy**. The *λp* is a **hyper-parameter** that defines a geometric prior distribution pG(λp) on the halting policy.

The first part of the *PonderNet* loss is called **reconstruction** – **LRec**. This value represents the expectation of loss *L* across halting steps. The second part of the *PonderNet* loss is called **regularization** – **LReg**. In its essence, it is **KL divergence** between the distribution of halting probabilities *pn* and the prior geometric distribution truncated at *N* and parameterized with *λp*. This part of the equation is the one that drives the neural network to **explore**. In essence, this regularisation loss puts **pressure** on the network to use computation efficiently.

## 3. PonderNet Results

The authors provided results on three tasks: *parity task*, the *bAbI* question answering dataset, and *Paired associative* inference task. Parity task was introduced with some of the older architectures we already mentioned – *Adaptive Computation Time (ACT)*. *PonderNet* showed better performance and higher **accuracy** than the original ACT model on three variations.

In the image below you can see how PonderNet achieved better accuracy on Interpolation and Extrapolation tasks. Also, in the third graph, you can see the total number of computing steps calculated as the number of actual forward passes performed by each network. Blue is PonderNet, Green is ACT and Orange is an RNN without adaptive compute.

The *bAbI* question answering dataset contains 20 different tasks and it is a hard task for neural network architectures that do not employ adaptive computation. In comparison to other architectures, PonderNet achieved those of the current SOTA, however, it obtained them **faster** and with** lower average error**.

Finally, *PonderNet* was tested on *Paired associative inference task*. The task is designed in a way to **measure** if a neural network is able to learn relationships among elements distributed across multiple facts or **memories**. *PonderNet* achieved just slightly lower accuracy than the MEMO, architecture specifically made for this task. It also showed better results than *Universal Transformer*, even though they used the same architecture.

If you want to learn more about PonderNet, check out the full paper here.

## Conclusion

PonderNet provided an interesting concept that has the potential to cross over to the industry and become the new standard. Its universal approach (it can be applied to any neural network architecture), wide opens that possibility. Apart from that, a bit of controversy it brought is a good thing for technology. There is no bad publicity, right?

Thank you for reading.

This bundle of e-books is specially crafted for **beginners**.

Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.

Become a Machine Learning Superhero **TODAY**!

#### Nikola M. Zivkovic

Nikola M. Zivkovic is the author of books: **Ultimate Guide to Machine Learning** and **Deep Learning for Programmers**. He loves knowledge sharing, and he is an experienced speaker. You can find him speaking at meetups, conferences, and as a guest lecturer at the University of Novi Sad.

## Trackbacks/Pingbacks