The code that accompanies this article can be downloaded here.


When I was a kid every almost every superhero had a voice-controlled computer. So you can imagine how my first encounter with Alexa was a profound experience for me. The kid in me was so happy and excited. Of course, then my engineering instincts kicked in and I analyzed how these devices work. Turned out they have neural networks that handle this complicated problem. In fact, neural networks simplified the problem so much that today it is quite easy to make one of these applications on your computer using Python. We could one example of how you can do this in one of the previous articles.

But it wasn’t always like that. The first attempts were made back in 1952. by three Bell Labs researchers. They have built a system for single-speaker digit recognition with the vocabularies of 10 words. However, by the 1980s this number has grown dramatically. Vocabulary grew up to 20,000 words and first commercial products started appearing. Dragon Dictate was one of the first such products and it was originally priced at $9,000. Alexa is more affordable today, right?

Today we can even build these systems inside of the browser using TensorFlow.js. In previous articles, we used this library for creating standard feed-forward neural networks and convolutional neural networks. In this article, we will use a different approach and use pre-trained TensorFlow.js models or transfer learning approach. We will use it to build an application using which you will be able to draw with your voice. Let’s not get ahead of ourselves, let’s find out a little bit more about pre-trained models in general and concrete model that we will use for our solution.

Transfer Learning

Transfer learning is the process of using a model trained previously on one dataset and applying it on a different dataset. Of course, the model needs to be slightly modified and/or re-trained on the second dataset. However, due to the fact that that model was once trained it takes less time for it adapt to the second dataset. Usually, these types of models don’t have the output layer and using just the “core”. Because of this, there are several pre-trained models in TensorFlow.js that can be used out of the box.
They can be used directly or used in a transfer learning setting. In this article, we will use just out of the box solution.

There are several areas where using pre-trained models is suitable and speech recognition is one of them. This model is called Speech Command Recognizer. Essentially, it is a JavaScript module that enables recognition of spoken commands comprised of simple English words. The default vocabulary ’18w’ includes the following words: digits from “zero” to “nine”, “up”, “down”, “left”, “right”, “go”, “stop”, “yes”, “no”. Additional categories of “unknown word” and “background noise” are also available. Apart from already mentioned ’18w’ dictionary even smaller dictionary directional4w’ is available. It contains only four directional words (‘up’, ‘down’, ‘left’, ‘right’).

How does it work?

There are a lot of approaches when it comes to a combination of neural networks and audio. Speech is often handled using some sort of Recurrent Neural Networks or LSTMs. However, Speech Command Recognizer uses simple architecture that is called Convolutional Neural Networks for Small-footprint Keyword Spotting. This approach is based on image recognition and Convolutional Neural Networks we examined in the previous article. At the first glance, that might be confusing, since audio is one a one-dimensional continuous signal across time, not a 2D spatial problem.

This architecture is utilizing a spectrogram. That is a visual representation of the spectrum of frequencies of a signal as it varies with time. Essentially, the window of time in which word should fit into is defined. This is done by grouping audio signal samples into segments. When that is done, analysis of the strengths of the frequencies is done, and segments with possible words are defined. These segments are then converted into spectrograms, e.g. one-channel images that are used for word recognition:

Spectogram example

The image that’s made using this pre-processing is then fed into a multi-layer convolutional neural network similar to the one we created in the previous article.

Demo

You have probably noticed that this page asked you for permission of using microphone. That is because we embedded implementation demo in this page. In order for this demo to work, you have to allow it to use microphone.

Now, you can use commands ‘up’, ‘down’, ‘left’ and ‘right’ to draw on the canvas below. Go ahead try it out:

Implementation

The whole code that accompanies this blog post can be found here.

First, let’s take a look into index.html file of our implementation. In one of the previous article, we presented several ways of installing TensorFlow.js. One of them was integrating it within the script tag of the HTML file. That is how we will do it here as well. Apart from that, we need to add an additional script tag for the pre-trained model. Here is how index.html looks like:

JavaScript code that contains this implementation is located within script.js. This file should be located in the same folder as the index.html file. In order to run this whole process, all you have to do is open index.html inside of your browser and allow it to use your microphone. Now, let’s examine the script.js file, where the whole implementaiton is located. Here is how the main run function looks:

Here we can see the workflow of the application. First, we create an instance of the model and assign it to the global variable recognizer. We use ‘directional4w’ dictionary because we need only ‘up’, ‘down’, ‘left’ and ‘right’ commands. Then we wait for the model to be loaded. This might take some time if your internet connection is slow. Once that is done, we initialize the canvas on which drawing is performed. Finally, the predict method is called. Here is what is happening inside that function:

This method is doing the heavy lifting. In essence, it runs an endless loop in which recognizer is listening to the words you are saying. Notice that we are using parameter the probabilityThreshold. This parameter defines should the callback function be called at all. Essentially, the callback function is invoked only if the maximum probability score is greater than this threshold. When we get the word, we get the direction in which we should draw.

Then we calculate the coordinates for the end of the line using the function calculateNewPosition. The step is 10 pixels, meaning the length of the line will be 10 pixels. You can play with both the probabilityThreshold and this length value. Once we get the new coordinates we use canvas to draw the line. That is it. Pretty straight-forward, right?

Conclusion

In this article, we saw how we can easily use pre-trained models of TensorFlow.js. They are a good starting point for some easy applications. We even built one example of such applications using which you can draw using voice commands. That is pretty cool and possibilities are endless. Of course, you can further train these models, get better results and use them for more complicated solutions. Meaning, you can really utilize transfer learning. However, that is a story for another time.

Thank you for reading!


Read more posts from the author at Rubik’s Code.


One comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.