We live in a time where we are always in a hurry – we are experiencing a lack of time because of our daily activities and we are overwhelmed with a lot of available information through the Internet and media. We do not always have time to find and spend time reading or watching what really interests us – but there is a solution specifically for the texts which we receive through the media daily.
We can read the summary of the text instead of the whole text. It will help us gain insight into what the text is about and still save time by not reading the whole text. Summarizations are already used everywhere – as an abstract, as a conclusion and etc.. Nowadays, text processing is often used in industry, and text summarization is considered as a challenging task.
This bundle of e-books is specially crafted for beginners.
Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.
Become a Machine Learning Superhero TODAY!
In the task of summarization, it is difficult to answer the question whether the summary of the text is good? One of the most important questions we want to answer is – Is the text informative enough? Summaries of texts are not so hard to make with Huggingface Transformers. Through this tutorial, we will show you how to make a summary of the text using some of the Huggingface Transformers. In this article, we cover:
- Prerequisites
- Choosing models and theory behind
- Pegasus
- BART
- T5
- Examples with implementation
- Experiments using pipeline
- Experiments using Auto Tokenizer and Auto Model
1. Prerequisites
In order to follow this tutorial, you need to have installed Python version 3.6 or higher. You can install it as a part of Anaconda or independently. You can choose which environment you like to work in – PyCharm, Visual Studio Code, Jupyter Notebook. Whatever you prefer. You need to install library transformers with command:
pip install transformers
After you install transformers you need to import the library with command:
import transformers
It is important to note that we will use only pre-trained models and we will not perform fine-tuning in this tutorial.
2. Choosing models and theory behind
The Huggingface contains section Models where you can choose the task which you want to deal with – in our case we will choose task Summarization. Transformers are a well known solution when it comes to complex language tasks such as summarization.
Summarization task uses a standard encoder-decoder Transformer – neural network with an attention model. Transformers introduced ‘attention’ which is responsible for catching the relationship between all words which occur in a sentence. In this tutorial we will use one text example and three models in experiments.
We decide to experiment with following models:
- Pegasus
- BART
- T5
2.1 Pegasus
Pegasus is standard Transformer encoder-decoder but in Pegasus’ pre-training task we have a similar approach as an extractive summary – important sentences are extracted from an input document and joined together as one output sequence from the remaining sentences.
This actually means that the encoder outputs masked tokens and decoder generates gap sentences. Paper regarding the Pegasus model introduces generating gap-sentences and explains strategies for selecting those sentences. More info about the Pegasus model can be found in the scientific paper in PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization written by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
The model was pre-trained for 1.5M steps instead of 500k steps, as we observed slower convergence of pre-training perplexity. The SentencePiece tokenizer was updated to encode the newline character. The PEGASUSlarge (mixed, stochastic) model achieved best results on almost all downstream tasks.
2.2 BART
This model is a sequence-to-sequence model trained as a denoising autoencoder. This indicates that BART can take as an input sequence in one language and return output sequence in a different language. BART found applications in many tasks besides text summarization, such as question answering, machine translation, etc.
BART model is pre-trained on the English language and it is fine-tuned on CNN Daily Mail. More information regarding the model can be found in paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. The Paper is written by Lewis et al.
BART outperforms the best previous work, which leverages BERT, by roughly 6.0 points on all ROUGE metrics—representing a significant advance in performance on this problem.
2.3 T5
XL-Sum represents a dataset which contains 1 million annotated pairs article-summary from BBC. The dataset covers 44 different languages and it is the largest dataset based on the number of collected data from a single source.
mT5 is a fine-tuned pre-trained multilingual T5 model on the XL-SUM dataset. More details can be found in XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages.
For many of the languages, XL-Sum provides the first publicly available abstractive summarization dataset and benchmarks. We also make the dataset curation tool available for the researchers, which will help to grow the dataset over time.
3. Examples with implementation
In all examples we will use the same text example. Text example is taken from the HuggingFace as an example for google/pegasus-xsum model.
text_example = 'The tower is 324 meters (1,063 ft) tall, about the same height
as an 81-storey building, and the tallest structure in Paris. Its base is square,
measuring 125 meters (410 ft) on each side. During its construction, the Eiffel
Tower surpassed the Washington Monument to become the tallest man-made structure
in the world, a title it held for 41 years until the Chrysler Building in New York
City was finished in 1930. It was the first structure to reach a height of 300 meters.
Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is
now taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters,
the Eiffel Tower is the second tallest free-standing structure in France
after the Millau Viaduct.'
3.1 Examples using Pipeline
Huggingface Transformers have an option to download the model with so-called pipeline and that is the easiest way to try and see how the model works.
The pipeline has in the background complex code from transformers library and it represents API for multiple tasks like summarization, sentiment analysis, named entity recognition and many more. For the PEGASUS model, we can use following code:
from transformers import pipeline
summarizer = pipeline("summarization", model = "google/pegasus-xsum")
summarizer(text_example)
[{'summary_text': 'The Eiffel Tower is a landmark in Paris, France.'}]
BART Model can be used in the simmilar manner:
from transformers import pipeline
summarizer = pipeline("summarization", model = "facebook/bart-large-cnn")
summarizer(text_example)
[{'summary_text': 'The tower is 324 meters (1,063 ft) tall, about the same height
as an 81-storey building. Its base is square, measuring 125 meters (410 ft) on each
side. During its construction, the Eiffel Tower surpassed the Washington Monument
to become the tallest man-made structure in the world.'}]
The same goes for the T5:
from transformers import pipeline
summarizer = pipeline("summarization", model= "csebuetnlp/mT5_multilingual_XLSum")
summarizer(text_example)
[{'summary_text': 'The Eiffel Tower has become the tallest free-standing building
in the world.'}]
3.2 Examples using AutoTokenizer and AutoModel
Often, we want to automatically retrieve the relevant model given the name to the pretrained config. That is possible thanks to Huggignface AutoClasses. AutoClasses are splitted in AutoConfig, AutoModel and AutoTokenizer. Instantiating one of them with respect to the model name or path will create the relevant architecture for the model whose name is provided.
For PEGASUS use this code:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained('google/pegasus-xsum')
tokenizer = AutoTokenizer.from_pretrained('google/pegasus-xsum')
tokens_input = tokenizer.encode("summarize: "+ text_example, return_tensors='pt', max_length=512, truncation=True)
ids = model.generate(tokens_input, min_length=80, max_length=120)
summary = tokenizer.decode(ids[0], skip_special_tokens=True)
print(summary)
"The Eiffel Tower in Paris, France, is the world's tallest free-standing
structure and one of the most famous buildings in the world, having opened to the public
on 1 September 1889, the same year it was officially opened to the public by the French
President, Charles de Gaulle, in a ceremony at the Arc de Triomphe on the Champs-Elysees
in Paris, France."
BART can be utilized in the simmilar way:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
tokens_input = tokenizer.encode("summarize: "+text_example, return_tensors='pt', max_length=512, truncation=True)
ids = model.generate(tokens_input, min_length=80, max_length=120)
summary = tokenizer.decode(ids[0], skip_special_tokens=True)
print(summary)
"The tower is 324 meters(1,063 ft) tall, about the same height as an 81-storey building.
Its base is square, measuring 125 meters (410 ft) on each side.
During its construction, the Eiffel Tower surpassed the Washington Monument to become
the tallest man-made structure in the world. It held the title for 41 years until
the Chrysler Building in New York City was finished in 1930."
Finally, here is the code for the T5:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/mT5_multilingual_XLSum")
model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/mT5_multilingual_XLSum")
tokens_input = tokenizer.encode("summarize: "+text_example, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model.generate(tokens_input, min_length=80, max_length=120)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)
"The Eiffel Tower has become the world's tallest building, marking its 50th anniversary.
But what exactly is it and what does it mean for those who want to be able to afford it?
The BBC has been looking at some of the key facts about the structure and why it is being built
in the Mediterranean."
Experiments using pipeline as an output have shorter summaries from the experiments where we used Auto Model and Auto Tokenizer. From all pipeline experiments that we get as an output in this experiment I would prefer to see summary from third model csebuetnlp/mT5_multilingual_XLSum.
Using Auto Model and Auto Tokenizer gave us more detailed summaries in all cases but we defined what needs to be the shortest length of each summary. Google/pegasus-xsum was the best in my opinion but the csebuetnlp/mT5_multilingual_XLSum was informative as well. In the output from model facebook/bart-large-cnn I did not like the first sentence – I can not conclude about which tower is that sentence and I would like to know that from the beginning of the summary.
4. Conclusion
In the end, depending on what you want to achieve – you can choose from various models at HuggingFace. You can find models for different languages or either multilingual models. In this tutorial we presented how you can use different models with transformers pipeline or using Auto Model and Auto Tokenizer.
Thank you for reading!
Kristina Licenberger
NLP Engineer
Trackbacks/Pingbacks