So far, in our ML.NET journey, we were focused on computer vision problems like Image Classification and Object Detection. In this article, we change a direction a bit and explore NLP (Natural Language Processing) and the set of problems we can solve with machine learning.

Natural language processingĀ (NLP) is a subfield of artificial intelligence with the main goal to help programs understand and process natural language data. The output of this process is a computer program that can “understand” language.

Are you afraid that AI might take your job? Make sure you are the one who is building it.


Back in 2018, Google presented a paper with a deep neural network calledĀ BidirectionalĀ EncoderĀ Representations fromĀ TransformersĀ or BERT. Because of its simplicity, it became one of the most popular NLP algorithms.Ā With this algorithm, anyone can train their own state-of-the-art question answering system (or a variety of other models) in just a few hours. In this article, we will do just that, use BERT to create a question and answering system.

Image Classification vs Object Detection

BERT is aĀ neural network that is based on Transformers architecture. That is why in this article we will first explore that architecture a bit and then continue to a more advanced understanding of BERT:

  1. Prerequisites
  2. Understanding Transformers Architecture
  3. BERT Intuition
  4. ONNX Model
  5. Implementation with ML.NET

1. Prerequisites

The implementations provided here are done in C#, and we use the latest .NET 5. So make sure that you have installed this SDK. If you are using Visual Studio this comes with version 16.8.3. Also, make sure that you have installed the following packages:

You can do the same from the Package Manager Console:

You can do a similar thing using Visual Studioā€™sĀ Manage NuGetPackageĀ option:

ML.NET NuGet Package

If you need to catch up with the basics of machine learning with ML.NET check out this article.

2. Understanding Transformers Architecture

Language is sequential data. Basically, you can observe it as a stream of words, where the meaning of each word is depending on the words that came before it and from the words that come after it. That is why computers have such a hard time understanding language because in order to understand one word you need a context.

Also, sometimes as the output, you need to provide a sequence of data (words) as well.Ā A good example to demonstrate this is the translation of English into Serbian. As an input to the algorithm, we use a sequence of words and for the output, we need to provide a sequence as well.

In this example, an algorithm needs to understand English and to understand how to mapĀ  English words to Serbian words (inherently this means that some understanding of Serbian must exist as well). Over the years there were many deep learning architectures used for this purpose., like Recurrent Neural Networks and LSTMs. However, it was the use of Transformer architecture that changed everything.

YOLO Architecture

RNNs and LSTMs networks didnā€™t fully satisfy the need, since they are hard to train and prone to vanishing (and exploding) gradient. TransformersĀ aimed to solve these problems and bring better performance and a better understanding of the language. TheyĀ were introduced back in 2017. in the legendary paper ā€œAttention is all you needā€œ.Ā 

In a nutshell, they used Encoder-Decoder structure and self-attention layers to better understand language. If we go back to the translation example, the Encoder is in charge of understanding English, while the Decoder is in charge of understanding Serbian and mapping English to Serbian.

YOLO Architecture

During the training, process Encoder is supplied with word embeddings from the English language. Computers don’t understand words, they understand numbers and matrixes (set of numbers). That is why we convert words into some vector space, meaning we assign certain vectors (map them to some latent vector space) to each word in the language. These are word embeddings. There are many available word embeddings like Word2Vec.

However, the position of the word in the sentence is also important for the context. That is why positional encoding is done. That is how the encoder gets information about the word and its context. The Self-attention layer of the Encoder is determining the relation between the words and gives us information about how relevant each word of the sentence is. This is how Encoder understands English. Data then goes to the deep neural network and then to the Mapping-Attention Layer of the Decoder.

However, before that, the Decoder gets the same information about the Serbian language. It learns how to understand the Serbian language, in the same way, using word embeddings, positional encoding and self-attention.Ā Mapping-Attention Layer of the Decoder then has both information, about the English language and about Serbian language and it just learns how to words from oneĀ language to another. To learn more about Transformers, check out this article.

3. BERT Intuition

BERT used this Transformers architecture to understand language. To be more precise it utilizes Encoder. This architecture achieved two big milestones. First, it achieved bidirectionality. This means that every sentence is learned in both ways, and context is better learned, both previous context and future context.Ā BERT is the firstĀ deeply bidirectional,Ā unsupervisedĀ language representation, pre-trained using only a plain text corpus (Wikipedia). It is also one of the first pre-trained models for NLP. We learned about transfer learning for computer vision. However, before BERT this concept didn’t really pick up in the world of NLP.

This makes a lot of sense because you can train a model on a lot of data and once it understands the language, you can fine-tune it for more specific tasks. That is why the training of BERT can be separated into two phases: Pre-training and Fine Tuning.

YOLO Architecture

In order to achieve bidirectionality BERT is pre-trained with two methods:

  • Masked Language Modeling – MLM
  • Next Sentence Prediction – NSP

The Masked Language Modeling uses masked input. This means that some words in the sentence are masked and it is BERT’s job to fill in the blanks. Next Sentence Prediction is giving two sentences as an input and expects from BERT to predict is one sentence following another. In reality, both of these methods happen at the same time.

YOLO Architecture

During the fine tuning phase we train BERT for specific task. This means that if we want to create question answering solution, we need to train just additional layers of BERT. This is exactly what we do in this tutorial. All we need to do is to replace output layers of the network, with the fresh set of layers designed for our specific purpose. As an input we will have a passage of the text (or context) and a question, and as the output we expect the answer to the question.

YOLO Architecture

For example, our system that should use two sentences: “Jim is walking through the woods.” (passage or context) and “What is his name?” (question) to provide the answer “Jim”.

4. ONNX Models

Before we dive into the implementation of object detection application with ML.NET we need to cover one more theoretical thing. That is theĀ Open Neural Network ExchangeĀ (ONNX) file format. This file format is anĀ open-source format for AI models and it supports interoperability between frameworks.

Basically, you can train a model in one machine learning framework like PyTorch, save it and convert it into ONNX format. Then you can consume that ONNX model in a different framework like ML.NET. That is exactly what we do in this tutorial. You can find more information on theĀ ONNX website.

ONNX Model

In this tutorial, we use the pre-trained BERT model. This model is available hereĀ a the BERT SQUAD. In essence, we import this model into ML.NET and run it within our application.

One very interesting and useful thing we can do with the ONNX model is that there are a bunch of tools we can use for a visual representation of the model. This is very useful when we use pre-trained models as we do in this tutorial.

We often need to know the names of input and output layers, and this kind of tool is good for that. So, once we download the BERT model, we can load it with one of the tools for visual representation. In this guide, we use NetronĀ and here is just the part of the output:

ONNX Model

I know, it insane, BERT is a big model. You might wonder how can I use this and why do I need it? However, in order to work with ONNX models, we usually need to know the names of input and output layers of the model. Here is how that looks for BERT:

ONNX Model

5. Implementation with ML.NET

If you take a look at the BERT-Squad repository from which we have downloaded the model, you will notice somethin interesting in the dependancy section. To be more precise, you will notice dependancy ofĀ This means that we need to perform tokenization on our own.Ā Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis. There are many ways to do this.Ā 

Effectively, we perform word encoding and for that we use Word-Piece Tokenization as explained in this paper. It is ported version from theĀĀ To implement this complute solution we structured ur solution like this:

ONNX Model

Here in the Assets folder, you can find the downloaded .onnx model and folder with vocabulary on which we want to train our model.Ā TheĀ Machine Learning folder contains all the necessary code that we use in this application.Ā TheĀ Trainer andĀ Predictor classes are there, just like theĀ classes which are modeling data. In the separate folder, we can findĀ helper class for loading files and extension classes for Softmax on Enumerable type and splitting of the string.

Image Classification vs Object Detection

This solution is inspired by the implementation of Gjeran Vlot which can be found here.

5.1 Data Models

You may notice that in theĀ DataModel folder we have two classes, forĀ input and predictionsĀ of BERT.Ā TheĀ BertInputĀ class is there to represent the input. They are names and sized like the layers from the model:

The Bertpredictions class uses BERT output layers:

5.2 Trainer

TheĀ Trainer class is quite simple, it has only one methodĀ BuildAndTrain which uses the path to the pre-trained model.

In the mentioned method, we build the pipeline. Here we apply the ONNX model and connect data models to the layers of the BERT ONNX model. Note that we have a flag that we can use to train this model on CPU or on GPU. Finally, we fit this model to empty data. We do this, so we can load the data schema, ie. to load the model.

5.3 Predictor

TheĀ Predictor class is even more simple. It receives a trained and loaded model and creates a prediction engine. Then it uses this prediction engine to create predictions on new images.

5.4 Helpers and Extensions

There is one helper class and two extension classes. The helper classĀ FileReader has a method for reading text file. We use it later to load vocabulary from file. It is very simple:

There are two extension classes. One for performing Softmax operation on collection of elements and another one for spliting the string and yealding one result at a time.

5.4 Tokenizer

Ok, so far we explored simple parts of the soluton. Let’s proceed with the more complicated and important ones, let’s check out how we implemted tokenization. First, we define the list of default BERT tokens. For example, two sentences should always be separated with the tokenĀ [SEP]Ā to differentiate them. TheĀ [CLS]Ā token always appears at the start of the text, and is specific to classification tasks.

The process of Tokenization is done within Tokenizer class. There are two public methods:Ā TokenizeĀ andĀ Untokenize.Ā The first one first splits received text into sentences. Then for each sentence each word is transformed into embedding. Note it can happen that one word is represened with multiple tokens.

For example, word ā€œembeddingsā€ is represented as array of tokensĀ ['em', '##bed', '##ding', '##s']. The word has been split into smaller subwords and characters. The two hash signs preceding some of these subwords are just our tokenizerā€™s way to denote that this subword or character is part of a larger word and preceded by another subword.

So, for example, the ā€˜##bedā€™ token is separate from the ā€˜bedā€™ token.Ā Another thing thatĀ Tokenize method is doing is returning Vocabulary Index and Segmentation Index. Both are used as an BERT input. To learn more why is this done this way, check out this article.

Another public method isĀ Untokenize. This method is used to reverse the process. Basically, as the output of BERT we will get varous embeddings. The goal of this method is to convert this information into meaningfull sentences.

This class has multiple methods that are enabeling this process.

5.5 BERT

TheĀ Bert class puts all these things together. In the constructor, we read the vocabulary file and instantiateĀ Train, Tokenizer andĀ Predictor objects. There is only one public method – Predict. This method receives context and the question. As the output, the answer with the probability is retrieved:

TheĀ Predict method performs several steps. Let’s explore it in more details.

First, this method performs tokenization of the question and passed context (passage based on which BERT should give answer). Then we buildĀ BertInput from this information. This is done inĀ BertInput method. Basically, all the tokenized information is padded so it can be used as an BERT input and used to initializeĀ BertInput object.

Then we get the predictions of the model fromĀ Predictor. This information is then additionaly processed and best predictions from the context is found. Meaning, BERT picks word from the context that are most likely the answer and we pick the best ones. Finally, these words are untokenized.

5.5 Program

Program is utilizing what we implemented in the Bert class. First, let’s define launch settings:

We define two command line arguments: “Jim is walking throught the woods.” and “What is his name?”. As we already mentioned, the first one is context and the second one is question. The Main method is minimal:

Technically we createĀ BertĀ object with the paths to vocabulary file and path to the model. Then we call Predict method with comand line arguments. As the output we get this:

We can see that BERT is 91% sure that the answer to the question is ‘Jim’ and it is correct.


In this article, we learned how BERT works. To be more specific, we had a chance to explore different how Transformers architecture works and how BERT utilizes that architecture to understand language. Finally, we learned about ONNX model format and how we can use it with ML.NET.

Thanks for reading!

Nikola M. Zivkovic

Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

Rubikā€™s CodeĀ is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out theĀ servicesĀ we provide.

Ultimate Guide to Machine Learning with Python

This bundle of e-books is specially crafted forĀ beginners.

Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.

Become a Machine Learning SuperheroĀ TODAY!