At the end of every month, we decipher three research papers from the fields of machine learning, deep learning and artificial intelligence, that left the biggest impact on us that month. Apart from that, at the end of the article, we add links to other papers that we have found interesting but were not in our focus that month. So, you can check those as well. Here are the links from the previous months:
In general, we try to present papers that are going to leave a big impact on the future of machine learning and deep learning. We believe that these proposals are going to change the way we do our jobs and push the whole field forward. Have fun!
We don’t do sales, but given the circumstances and the severity of the situation, we decided to change that. Don’t be fooled, this sale isn’t meant for profit and it’s most definitely not planned. This sale is here to help people who want to become better, learn new skills and be more productive than ever before. Our book offers are on a 50% sale.
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages
Ok, we tried to write an objective review of this paper, ie. of this toolkit, but we had a problem with it. Namely, everything we wrote sounded like a commercial. However, we are not paid to write anything like that, we are just truly excited about this awesome new toolkit from Standford – Stanza. Transfer learning seems to be the future of deep learning even when it comes to NLP. In the market, there are several NLP toolkits that provide various NLP options to engineers, like CoreNLP, Flair and spaCy. However, all these toolkits have some limitations. Majority of them supports only a few major languages, they are often under-optimized and have problem with input text that has been tokenized with some other tool. This means that these tools are limited when it comes to text that comes from multiple sources and a number of languages.
That is why authors present Stanza, an open-source Python natural language processing toolkit supporting whooping 66 human languages. Stanza has multiple advantages. It has a fully neural pipeline that takes raw text as input and as outputs annotations including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. Apart from that, their design is language agnostic, which gives the ability to support 66 languages. Finally, it’s performance is fantastic. On top of that Stanza provides Python API for Java Standford CoreNLP and pre-trained models.
In an essence, Stanza is composed of two main layers:
- Neural multilingual NLP pipeline
- Pyhon interface to Java CoreNLP
The pipeline is composed of multiple models whose purpose varies from tokenization to static analysis. Every one of these components is designed with processing many human languages and modularity in mind. Stanford’s Java CoreNLP is a really good tool with multiple options, especially when it comes to the English language. However, this thus far, it was hard to utilize this tool within Python. The second big module of Stanza is exactly the API to the CoreNLP. Stanza is easily installed with pip install stanza. Keep in mind that you should install PyTorch beforehand. Here is how Stanza is used on this Marcus Aurelius quote:
import stanza stanza.download('en') doc = stanza.Pipeline(use_gpu=False) # Default neural pipeline in English on CPU doc = nlp("When you arise in the morning, think of what a precious privilege it is to be alive - \ to breathe, to think, to enjoy, to love.") doc.sentences.print_dependencies()
The last line of the code gives output that looks like this:
('When', '3', 'mark') ('you', '3', 'nsubj') ('arise', '8', 'advcl') ('in', '6', 'case') ('the', '6', 'det') ('morning', '3', 'obl') (',', '8', 'punct') ('think', '0', 'root') ('of', '10', 'case') ('what', '8', 'obl') ('a', '13', 'det') ('precious', '13', 'amod') ('privilege', '10', 'nsubj') ('it', '15', 'nsubj') ('is', '10', 'acl:relcl') ('to', '18', 'mark') ('be', '18', 'cop') ('alive', '15', 'xcomp') ('-', '8', 'punct') ('to', '21', 'mark') ('breathe', '8', 'advcl') (',', '21', 'punct') ('to', '24', 'mark') ('think', '21', 'conj') (',', '24', 'punct') ('to', '27', 'mark') ('enjoy', '21', 'conj') (',', '27', 'punct') ('to', '30', 'mark') ('love', '27', 'xcomp') ('.', '8', 'punct')
The cool thing is that you can add various processors to the pipeline and build whatever you need. Processors are supporting single operations and in the previous example, we used them all. Here is the list of processors:
- tokenize – Tokenizes the text and performs sentence segmentation.
- mwt – Expands multi-word tokens (MWTs) into multiple words when they are predicted by the tokenizer.
- pos – Labels tokens with their universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats).
- lemma – Generates the word lemmas for all words in the Document.
- depparse – Provides an accurate syntactic dependency parsing analysis.
- ner – Recognize named entities for all token spans in the corpus.
So, technically we could do something like this:
import stanza stanza.download('en') doc = stanza.Pipeline(processors: 'tokenize, mwt, pos', use_gpu=False) doc = nlp("When you arise in the morning, think of what a precious privilege it is to be alive - \ to breathe, to think, to enjoy, to love.") doc.sentences.print_dependencies()
If you want to try out Stanza, there is even an interactive web demo that runs the pipeline here.
TensorFlow Quantum: A Software Framework for Quantum Machine Learning
Deep Learning has been a hot topic in a previous couple of years. Google, with TensorFlow, certainly pushed the boundaries and brought these technologies to the mainstream. During this period another interesting technology had huge growth in both academia and industry – Quantum computing. The rapid development in the quantum hardware resulted in the equally rapid development of quantum applications. These new ways of processing data will eventually change the way we think about handling information on a computer in general and it will affect each field of computer science. That is how quantum machine learning (QML) algorithms emerged too. They tackle a wide range of applications in both supervised and unsupervised learning. TensorFlow Quantum is a new Google library intended to accelerate the development of quantum machine learning algorithms.
One of the main challenges is that all classical algorithms and data need to be changed for new quantum ways, so they can work with quantum processors. That is the case with machine learning algorithms as well. Apart from that, any data emerging from an underlying quantum mechanical process can be considered quantum data. The first generation of QML focused on utilizing quantum processing power for getting performance improvements on linear algebra calculations. The main benefit quantum processors became their ability to perform fast linear algebra on a state space that grows exponentially with the number of qubits. Recently, the second generation of QML emerged. These algorithms are focussed on heuristic methods which studied empirically due to the increased computational capability of quantum hardware. This is similar to how machine learning evolved into deep learning in the first place. These new algorithms use parameterized quantum transformations called parameterized quantum circuits (PQCs) or Quantum Neural Networks (QNNs).
Due to the current state of quantum processors, authors anticipate that investigations into various possible hybrid quantum-classical machine learning algorithms will be a productive area of research. Technically, this means that quantum computers will be most useful as hardware accelerators, working together with traditional computers. That is where we are at the moment and where TensorFlow Quantum (TFQ), a new quantum framework from Google, is headed to. The main goal of this framework is to bridge quantum computing and machine learning communities. In its essence, it is a combination of two libraries: TensorFlow (Google’s ML framework) and Cirq (Google’s Quantum framework).
In TFQ, circuits and other quantum computing constructs are represented as tensors. Converting these quantum tensors into classical information is done by ops via simulators or real quantum devices. To be more precise, Cirq objects are converted to TensorFlow string tensors using tfq.convert_to_tensor method. Here is a simple example of how that looks like:
qubit = cirq . GridQubit (0 , 0) theta = sympy . Symbol (’theta ’) c = cirq . Circuit ( cirq .X ( qubit ) ** theta ) c_tensor = tfq . convert_to_tensor ([ c] * 3) theta_values = tf . constant ([ , ,]) m = cirq . Z( qubit ) paulis = tfq . convert_to_tensor ([ m ] * 3) expectation_op = tfq . get_expectation_op () output = expectation_op ( c_tensor , [’theta ’], theta_values , paulis ) abs_output = tf . math . abs ( output )
Now, this example is simple and has no QNNs. In general, steps of building and training QNNs would be similar to the building standard neural networks. Here they are:
- Prepare Quantum Dataset – As with regular neural networks, we need to build a dataset first. This is done by creating unparameterized cirq.Circuit objects and then injecting them into a computation graph with tfq.convert_to_tensor.
- Evaluate Quantum Model – In this step, we evaluate how well our quantum model is performing. It’s main goal is to perform a quantum computation in order to extract information hidden in a quantum subspace.
- Sample or Average – In this step, we extract classical information in the form of samples from a classical random variable. Quantum state and measured observable impact distribution values of this random variable.
- Evaluate Classical Model – We use deep neural networks to distill correlations between measured expectations.
- Evaluate Cost Function – Cost function is calculated and evaluated based on the results from the previous step.
- Evaluate Gradients & Update Parameters – After evaluation
To find out how to create a minimal example of a hybrid quantum-classical model – binary classification, you can check this Jupyter notebook. Eventually, TensorFlow Quantum fits quite nicely into the current TensorFlow Ecosystem.
Domain Adaptation by Class Centroid Matching and Local Manifold Self-Learning
Domain Adaptation is an integral part of machine learning and we are using it often. This term is generally used to describe learning from a source data distribution and using that model on different (but related) target data distribution. This process is, as you are probably aware, difficult with real-world data. Real-world is messy, and data is collected from different sources, under different conditions. It can hardly satisfy the identical probability distribution hypothesis. This brings us to the problem that the model created on the source domain can not be directly applied to the target domain. For example, overfitting to training data is one example of this problem.
The main goal of the approaches that are trying to solve this problem is to reduce the difference between source distribution and target distribution. Usually, these techniques are focused on discovering a common feature space in which differences between these distributions is minimal. The authors of this paper detected that the problem of these methods is that they make label predictions for target samples independently and thus ignore the data distribution. That is why authors created a new method for assigning pseudo-labels to target samples with the help of class centroids in two domains – CMMS. This way the distribution of both domains can be emphasized.
The first step in this novel approach is to find clusters of data in the target domain, which is done using K-means. That is how this domain distribution discrepancy minimization problem is converted to the class centroid matching problem which can be solved efficiently by the nearest neighbor search. Another thing, that the authors noticed is that this process can be problematic for the performance and used locally manifold learning strategy to improve it, meaning they learned data similarity matrix according to the local connectivity in the low-dimensional space, not in the projected common space. To sum it up, the main idea of the proposed approach is the emphasis on data distribution structure by class centroid matching of two domains and local manifold structure self-learning for target data. That is what CMMS stands for – class Centroid Matching and local Manifold Self-learning. It is pretty clever and effective. Experiments on five datasets showed that CMMS outperforms several state-of-the-art methods in both unsupervised and semi-supervised scenarios.
Other Amazing Papers from this Month
- A Novel AI-enabled Framework to Diagnose Coronavirus COVID 19 using Smartphone Embedded Sensors: Design Study
- Fixing the train-test resolution discrepancy: FixEfficientNet
- An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs
- Equalization Loss for Long-Tailed Object Recognition
- Benchmarking Graph Neural Networks
- jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models
In this article, we had a chance to real about three really cool papers that, as we see it, are going to change the way we perform our jobs. First, we saw how one can utilize Stanza for NLP. Then we explored the new Google library for Quantum Machine Learning – TensorFlow Quantum. Finally, we saw how we can improve domain adaptation using CMMS. Did you have any favorites this month? Let us know.
Thank you for reading!
Nikola M. Zivkovic
CAIO at Rubik's Code
Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.
Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.