In the previous article, I wrote about how you can use Huggingface transformers with ML.NET. The process is fairly simple, you need to convert the Huggingface model to the ONNX file format, and load it with ML.NET.

However, while working with BERT Models from Huggingface in combination with ML.NET, I stumbled upon several challenges. The problem was not in loading the model and using it, but in preparing the data and building all the other helpers for that process.

The biggest challenge by far was that I needed to implement my own tokenizer and pair them with the correct vocabulary. So, I decided to extend it with other possible cases and publish my implementation as an open-source project. If you find this useful, feel free to contribute. In this article, I will focus on supported models and use cases.Ā 

Ultimate Guide to Machine Learning with Python

This bundle of e-books is specially crafted for beginners. Everything from Python basics to the deployment of Machine Learning algorithms to production in one place. Become a Machine Learning Superhero TODAY!

In this article we cover:

1. Motivation for this project

2. BERT Tokenizer NuGet Package

3. Use case Example

1. Motivation for this project

If we were usingĀ Huggingface model inĀ Python we could load both tokenizer and model from Huggingface like this:

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True,return_tensors='pt')

# Use the model
with torch.no_grad():
    model_output = model(**encoded_input)
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModel.from_pretrained('bert-base-cased')

And then use it like this:

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True,return_tensors='pt')

# Calculate embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

As I mentioned, I wanted to use BERT models from Huggingface within ML.NET. However, in ML.NET we don’t have that nice options. Thanks to the available tools it was easy to export Huggingface models into ONNX files and from there import them into ML.NET. The real problems come from tokens since no Tokenizer is available in ML.NET.

BERT models require specifically structured data. The worst part is that different models use different vocabularies for tokenization. For example, BERT for the German language, will not understand the same tokens as BERT Multilingual Model. This means that one Tokenizer implementation is not good enough for another.Ā 

ONNX Model

On top of that, some Huggingface BERT models use cased vocabularies, while other use uncased vocabularies. There is a lot of space for mistakes and too little flexibility for experiments. For example, let’s analyzeĀ BERT Base Model, from Huggingface.

Its “official” name isĀ bert-base-cases. The name indicates that it uses cased vocabulary, ie. the model makes difference between lower and upper letters.Ā Its outputs and outputs are:

Bert Base Huggingface Model

Let’s focus on the inputs. We need to provide tokens (input_ids),Ā along with an attention mask (attention_mask) and segmentation indexes.Ā 

In ML.NET, first, we need to build classes that handle input and output from the model and use theĀ ApplyONNXModelĀ function to load the model. You can find out more about this in the previous article:

public class ModelInput
    {
        [VectorType(1, 32)]
        [ColumnName("input_ids")]
        public long[] InputIds { get; set; }

        [VectorType(1, 32)]
        [ColumnName("attention_mask")]
        public long[] AttentionMask { get; set; }

				[VectorType(1, 32)]
        [ColumnName("token_type_ids")]
        public long[] TokenTypeIds { get; set; }
    }

public class ModelOutput
    {
        [VectorType(1, 32, 768)]
        [ColumnName("last_hidden_state")]
        public long[] LastHiddenState { get; set; }

        [VectorType(1, 768)]
        [ColumnName("poller_output")]
        public long[] PollerOutput { get; set; }
    }
var pipeline = _mlContext.Transforms
                            .ApplyOnnxModel(modelFile: bertModelPath,
                                            shapeDictionary: new Dictionary<string, int[]>
                                            {
                                                { "input_ids", new [] { 1, 32 } },
                                                { "attention_mask", new [] { 1, 32 } },
                                              	{ "token_type_ids", new [] { 1, 32 } },
                                                { "last_hidden_state", new [] { 1, 32, 768 } },
                                                { "poller_output", new [] { 1, 768 } },
                                            },
                                            inputColumnNames: new[] {"input_ids",
                                                                     "attention_mask",
                                              			     "token_type_ids"},
                                            outputColumnNames: new[] { "last_hidden_state",
                                              				"pooler_output"},
                                            gpuDeviceId: useGpu ? 0 : (int?)null,
                                            fallbackToCpu: true);

Still, we don’t have an easy way to create tokens and provide them as input to this model. It was a challenge to create a correct object ofĀ ModelInput. Also, it is a challenge to make sense out of the output of the model. Now we can use BERTTokenizer NuGet PackageĀ for this purpose.

2. BERT Tokenizers NuGet Package

This NuGet PackageĀ should make your life easier. The goal is to be closer to ease of use inĀ Python as much as possible. The complete stack provided in the Python API of Huggingface is very user-friendly and it paved the way for many people using SOTA NLP models in a straightforward way. Hopefully, one day we will be able to do the same with C# as well.

BERTTokenizers NuGet Package

To install theĀ BERTTokenizers NuGet package use this command:

dotnet add package BERTTokenizers

Or you can install it with Package Manager:

Install-Package BERTTokenizers

2.1 Supported Models and Vocabularies

At the momentĀ BertTokenizers support the following vocabularies:

  • BERT Base Cased – class BertBaseTokenizer
  • BERT Large Cased – class BertLargeTokenizer
  • BERT German Cased – class BertGermanTokenizer
  • BERT Multilingual Cased – class BertMultilingualTokenizer
  • BERT Base Uncased – class BertBaseUncasedTokenizer
  • BERT Large Uncased – class BertLargeUncasedTokenizer

With this wide range of models is supported. Not just BERT models, but let’s say and DistilBERT models.

2.2 Available methods

Tokenizer UML

Every class provides the same three functions:

  • Encode – the input into this method are sequence length (this is mandatory because ML.NET doesn’t support input of variable length) and the sentence that needs to be encoded. As the output, this method provides a list of tuples with – Token ID, Token Type and Attention Mask, for each token in the encoded sentence.
  • TokenizeĀ –Ā In case you need more info on the tokens, or you want to perform padding on your own this method will do the trick. Input is a sentence that needs to be tokenized and the output is the list of tuples with – Token, Token ID and Token Type, for each token in the sentence.
  • UntokenizeĀ – This method is used to reverse the process and put the list of tokens into meaningful words. It is used on the output of the model.

The UML of the project looks something like this:

BERTTokenizers Project UML

3. Use Case Example

Let’s go back to BERT Base Model. For it, we built our input class like this:

public class ModelInput
  {
      [VectorType(1, 32)]
      [ColumnName("input_ids")]
      public long[] InputIds { get; set; }

      [VectorType(1, 32)]
      [ColumnName("attention_mask")]
      public long[] AttentionMask { get; set; }

      [VectorType(1, 32)]
      [ColumnName("token_type_ids")]
      public long[] TokenTypeIds { get; set; }
}
Sentiment Analysis Visual

Before we create an object that we will send into the pipeline, we need to create tokens and we can do it like this:

var tokenizer = new BertBaseTokenizer();

var encoded = tokenizer.Encode(32, sentence);

var bertInput = new ModelInput()
                {
                    InputIds = encoded.InputIds,
                    AttentionMask = encoded.AttentionMask,
                    TypeIds = encoded.TokenTypeIds,
                };

Important note:Ā The first parameter in the EncodeĀ method is the same as the sequence size in theĀ VectorType decorator in theĀ ModelInput class.

Conclusion

In this article, we saw how we can use BERTTokenizer NuGet package, to easily build tokens for BERT input.

Thanks for reading!

Ultimate Guide to Machine Learning with Python

This bundle of e-books is specially crafted for beginners. Everything from Python basics to the deployment of Machine Learning algorithms to production in one place. Become a Machine Learning Superhero TODAY!

Ultimate Guide to Machine Learning with Python

Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.

Become a Machine Learning SuperheroĀ TODAY!