Make Every feature Binary: Microsoft’s 135B Parameter Model

A couple of weeks ago, I wrote about Perceiver, a transformer-based neural network coming from Deep Mind, which can process different types of data. While this is a very specific architecture, it displays where the AI field is heading to. In general, transformers and transformer-based models gained a lot of popularity in the past couple of years. It seems that researchers with enough GPU power can do pretty much anything with them.

Ultimate Guide to Machine Learning with Python

This bundle of e-books is specially crafted for beginners.
Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.
Become a Machine Learning Superhero TODAY!

The best example is GPT-3, 175 billion parameters neural network coming from Google. Microsoft utilizes Transformers as well. They use it for Bing Search. This month Microsoft Research team proposes a novel approach – Make every feature Binary (MEB). Built on top of their existing Transformer-based model is the largest universal model served at Microsoft with 135 billion parameters.

In this article we cover:

1. What is “Make Every feature Binary” trying to improve?

2. MEB Architecture

3. MEB Data and Training

4. MEB Results

1. What is “Make Every feature Binary” trying to improve?

In general, the idea behind this complement is to give a more nuanced understanding of data to the Transformer-based model. What do I mean by this? Well, the problem with many NLP models is that they can overgeneralize. For example, the majority of NLP models will fill this sentence: “(blank) can fly.” with the word “birds”. However, not all birds fly.

It seems that these models are missing something, right? That is why MEB assigns each fact to a feature, and this gives the Transformer-Based model the power to assign weights to each feature and come up with smarter answers like “birds can fly, except penguins, etc.”

1.1 Increased Model Capacity

This approach brings another benefit. It is using a vast amount of data more efficiently. This means that models for ranking web results usually converge after hundreds of millions of rows, due to the limited feature representation and model capacity.

Now, Microsoft has a lot of Bing search results, meaning they have a lot of data. They want their model to continue to learn even after using hundreds of millions of rows. MEB with a Transformer-based model is able to do just that. Even though it is trained on three years of Bing Search data, it continues to learn with more data added.

1.2 Uncovering Hiddent Intent

Another cool feature of this model is that it can learn beyond semantic relationships. It seems that this “feature” is the consequence of increased model capacity. In essence, it can learn hidden intents between query and document. Microsoft provided this table:

The example from the table above shows that this model learned that the term “Hotmail” is strongly correlated to the term “Microsoft Outlook,” even though they are not close to each other in terms of semantic meaning.

2. MEB Architecture

The architecture of MEB looks like this:

This model is composed of 5 layers. There is a binary feature input layer, a feature embedding layer, a pooling layer, and two dense layers. The input layer contains 9 billion features, generated from 49 feature groups, with each binary feature encoded into a 15-dimension embedding vector. This input vector is used to eventually create a click probability estimation.

3. MEB Data and Training

Feature engineering and training are the keys to the success of this platform. MEB is trained on 3 years of Bing Search data. The data itself is composed of specific kinds of key-value pairs, where the key is query and value is a document with the sentiment of the user (was the user satisfied with the search or not). For each of these pairs, binary features are extracted from the query text, the document URL, title, and body text.

The features themselves are defined with the so-called N-gram–level relationship between queries and the document. N-grams are basically sequences of N terms. All features are binary and there are three main types of features:

Query and Document N-gram pair features
One-hot encoding of bucketed numeric features
One-hot encoding of categorical features

Woodblock, Microsoft’s large-scale training platform is used for training this model. This platform is built on top of Tensorflow. For training purposes, continuous training is used as well. MEB model is trained daily with the new data coming from Bing and the model is automatically deployed. The whole process is illustrated in the image above.

4. MEB Results

This model provided some interesting results for the Bing Search. Namely:

An almost 2 percent increase in clickthrough rate (CTR) on the top search results. Those results were found “above the fold” without the need to scroll down.
A reduction in manual query reformulation by more than 1 percent. Users needing to manually reformulate queries means they didn’t like the results they found with their original query.
A reduction of clicks on pagination by over 1.5 percent. Users needing to click on the “next page” button means they didn’t find what they were looking for on the first page.

Conclusion

In this article, we explored how Microsoft utilizes the “Making Every Feature Binary” approach along with the Transformer-based model to increase the performance of the Bing Search.

Thanks for reading!

Nikola M. Zivkovic

Nikola M. Zivkovic is the author of books: Ultimate Guide to Machine Learning and Deep Learning for Programmers. He loves knowledge sharing, and he is an experienced speaker. You can find him speaking at meetups, conferences, and as a guest lecturer at the University of Novi Sad.

Make Every feature Binary: Microsoft’s 135B Parameter Model