The code that accompanies this article can be downloaded here.
When I was a kid every almost every superhero had a voice-controlled computer. So you can imagine how my first encounter with Alexa was a profound experience for me. The kid in me was so happy and excited. Of course, then my engineering instincts kicked in and I analyzed how these devices work. Turned out they have neural networks that handle this complicated problem. In fact, neural networks simplified the problem so much that today it is quite easy to make one of these applications on your computer using Python. We could one example of how you can do this in one of the previous articles.
But it wasn’t always like that. The first attempts were made back in 1952. by three Bell Labs researchers. They have built a system for single-speaker digit recognition with the vocabularies of 10 words. However, by the 1980s this number has grown dramatically. Vocabulary grew up to 20,000 words and first commercial products started appearing. Dragon Dictate was one of the first such products and it was originally priced at $9,000. Alexa is more affordable today, right?
Today we can even build these systems inside of the browser using TensorFlow.js. In previous articles, we used this library for creating standard feed-forward neural networks and convolutional neural networks. In this article, we will use a different approach and use pre-trained TensorFlow.js models or transfer learning approach. We will use it to build an application using which you will be able to draw with your voice. Let’s not get ahead of ourselves, let’s find out a little bit more about pre-trained models in general and concrete model that we will use for our solution.
Transfer learning is the process of using a model trained previously on one dataset and applying it on a different dataset. Of course, the model needs to be slightly modified and/or re-trained on the second dataset. However, due to the fact that that model was once trained it takes less time for it adapt to the second dataset. Usually, these types of models don’t have the output layer and using just the “core”. Because of this, there are several pre-trained models in TensorFlow.js that can be used out of the box.
They can be used directly or used in a transfer learning setting. In this article, we will use just out of the box solution.
How does it work?
There are a lot of approaches when it comes to a combination of neural networks and audio. Speech is often handled using some sort of Recurrent Neural Networks or LSTMs. However, Speech Command Recognizer uses
This architecture is utilizing a spectrogram. That is a visual representation of the spectrum of frequencies of a signal as it varies with time. Essentially, the window of time in which word should fit into is defined. This is done by grouping audio signal samples into segments. When that is done, analysis of the strengths of the frequencies is done, and segments with possible words are defined. These segments are then converted into spectrograms, e.g. one-channel images that are used for word recognition:
The image that’s made using this pre-processing is then fed into a multi-layer convolutional neural network similar to the one we created in the previous article.
You have probably noticed that this page asked you for permission of using microphone. That is because we embedded implementation demo in this page. In order for this demo to work, you have to allow it to use microphone.
Now, you can use commands ‘up’, ‘down’, ‘left’ and ‘right’ to draw on the canvas below. Go ahead try it out:
The whole code that accompanies this blog post can be found here.
First, let’s take a look into
Here we can see the workflow of the application. First, we create an instance of the model and assign it to the global variable recognizer. We use ‘directional4w’ dictionary because we need only ‘up’, ‘down’, ‘left’ and ‘right’ commands. Then we wait for the model to be loaded. This might take some time if your internet connection is slow. Once that is done, we initialize the canvas on which drawing is performed. Finally, the
This method is doing the heavy lifting. In essence, it runs an endless loop in which recognizer is listening to the words you are saying. Notice that we are using parameter the
Then we calculate the coordinates for the end of the line using the function
In this article, we saw how we can easily use pre-trained models of TensorFlow.js. They are a good starting point for some easy applications. We even built one example of such applications using which you can draw using voice commands. That is pretty cool and possibilities are endless. Of course, you can further train these models, get better results and use them for more complicated solutions. Meaning, you can really utilize transfer learning. However, that is a story for another time.
Thank you for reading!
Read more posts from the author at Rubik’s Code.