The code that accompanies this article can be received after subscription

* indicates required

When I was a kid every almost every superhero had a voice-controlled computer. So you can imagine how my first encounter with Alexa was a profound experience for me. The kid in me was so happy and excited. Of course, then my engineering instincts kicked in and I analyzed how these devices work.

Turned out they have neural networks that handle this complicated problem. In fact, neural networks simplified the problem so much that today it is quite easy to make one of these applications on your computer using Python. But it wasn’t always like that. The first attempts were made back in 1952. by three Bell Labs researchers.

Ultimate Guide to Machine Learning with Python

This bundle of e-books is specially crafted for beginners.
Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.
Become a Machine Learning Superhero 

They have built a system for single-speaker digit recognition with the vocabularies of 10 words. However, by the 1980s this number has grown dramatically. Vocabulary grew up to 20,000 words and first commercial products started appearing. Dragon Dictate was one of the first such products and it was originally priced at $9,000. Alexa is more affordable today, right?

However today we can perform Speech Recognition in browser with Tensorflo.js. In this article, we cover:

  1. Transfer Learning
  2. How does Speech Recognition work?
  3. Demo
  4. Implementation with Tensorflow.js

1. Transfer Learning 

Historically, image classification is a problem that popularized deep neural networks especially visual types of neural networks – Convolutional neural networks (CNN). Today, transfer learning is used for other type of machine learning tasks, like NLP and Speech Recognition. We will not go into details about what are CNNs and how they work. However, we can say that CNNs were popularized after they broke a record in The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) back in 2012.

This competition evaluates algorithms for object detection and image classification at a large scale. The dataset that they provide contains 1000 image categories and over 1.2 million images. The goal of the image classification algorithm is to correctly predict to which class the object belongs to. Since 2012. every winner of this competition used CNNs.

Training deep neural networks can be computational and time-consuming. To get really good results, you need a lot of computing power, which means a lot of GPUs and this means…well, a lot of money. You could of course train these big architectures and get SOTA results on cloud environments, but this is also quite expensive. 

For a while, these architectures were not available for regular developers. However, the concept of transfer learning changed that. Especially, for the problem, we are solving today – image classification. Today we can use state-of-the-art architectures that won at ImageNet competition thanks to the transfer learning and pre-trained models.

1.1 Pre-Trained Models

At this moment one might wonder “What are pre-trained models?”. Essentially, a pre-trained model is a saved network that was previously trained on a large dataset, for example on the ImageNet dataset. There are two ways in which you can use those. You can use it as the out-of-the-box solution and or you can use it with transfer learning.  Since large datasets are usually used for some global solution you can customize a pre-trained model and specialize it for certain problems.

This way you can utilize some of the most famous neural networks without losing too much time and resources on training. Additionally, you can fine-tune these models, by modifying the behavior of the chosen layers. The whole idea revolves around using lower layers of pre-trained CNN model and adding additional layers that will customize the architecture for the specific problems.

Essentially, serious transfer learning models are usually composed of two parts. We call them backbone and head. Backbone is usually deep architecture that was pre-trained on the ImageNet dataset without top layers. Head is a part of the image classification model that is used for the prediction of custom classes.

These layers are added on top of the pre-trained model. With these systems, we have two phases: bottleneck and training phase. During the bottleneck phase, images of the specific dataset are run through the backbone architecture, and results are stored. During the training phase stored output from the backbone is used to train custom layers.

Data Visual

There are several areas where using pre-trained models is suitable and speech recognition is one of them. This model is called Speech Command Recognizer. Essentially, it is a JavaScript module that enables recognition of spoken commands comprised of simple English words.

The default vocabulary ’18w’ includes the following words: digits from “zero” to “nine”, “up”, “down”, “left”, “right”, “go”, “stop”, “yes”, “no”. Additional categories of “unknown word” and “background noise” are also available. Apart from already mentioned ’18w’ dictionary even smaller dictionary directional4w’ is available. It contains only four directional words (‘up’, ‘down’, ‘left’, ‘right’).

2. How does Speech Recognition work?

There are a lot of approaches when it comes to a combination of neural networks and audio. Speech is often handled using some sort of Recurrent Neural Networks or LSTMs. However, Speech Command Recognizer uses simple architecture that is called Convolutional Neural Networks for Small-footprint Keyword Spotting.

This approach is based on image recognition and Convolutional Neural Networks we examined in the previous article. At the first glance, that might be confusing, since audio is one a one-dimensional continuous signal across time, not a 2D spatial problem.

2.1 Spectogram

This architecture is utilizing a spectrogram. That is a visual representation of the spectrum of frequencies of a signal as it varies with time. Essentially, the window of time in which word should fit into is defined.

This is done by grouping audio signal samples into segments. When that is done, analysis of the strengths of the frequencies is done, and segments with possible words are defined. These segments are then converted into spectrograms, e.g. one-channel images that are used for word recognition:

The image that’s made using this pre-processing is then fed into a multi-layer convolutional neural network.

3. Demo

You have probably noticed that this page asked you for permission of using microphone. That is because we embedded implementation demo in this page. In order for this demo to work, you have to allow it to use microphone.

Now, you can use commands ‘up’, ‘down’, ‘left’ and ‘right’ to draw on the canvas below. Go ahead try it out:


4. Implementation with TensorFlow.js

4.1 HTML File

First, let’s take a look into index.html file of our implementation. In one of the previous article, we presented several ways of installing TensorFlow.js. One of them was integrating it within the script tag of the HTML file. That is how we will do it here as well. Apart from that, we need to add an additional script tag for the pre-trained model. Here is how index.html looks like:

    <script src=""></script>
    <script src=""></script>
    <section class='title-area'>
        <h1>TensorFlow.js Speech Recognition</h1>
        <p class='subtitle'>Using pretrained models for speech recognition</p>
    <canvas id="canvas" width="1000" height="800" style="border:1px solid #c3c3c3;"></canvas>
    <script src="script.js"></script>

JavaScript code that contains this implementation is located within script.js. This file should be located in the same folder as the index.html file. In order to run this whole process, all you have to do is open index.html inside of your browser and allow it to use your microphone. 

4.2 Script File

Now, let’s examine the script.js file, where the whole implementaiton is located. Here is how the main run function looks:

async function run() {
 recognizer = speechCommands.create('BROWSER_FFT', 'directional4w');
 await recognizer.ensureModelLoaded();

 var canvas = document.getElementById("canvas");
 var contex = canvas.getContext("2d");
 contex.lineWidth = 10;
 contex.lineJoin = 'round';
 var positionx = 400;
 var positiony = 500;

 predict(contex, positionx, positiony);

Here we can see the workflow of the application. First, we create an instance of the model and assign it to the global variable recognizer. We use ‘directional4w’ dictionary because we need only ‘up’, ‘down’, ‘left’ and ‘right’ commands.

Then we wait for the model to be loaded. This might take some time if your internet connection is slow. Once that is done, we initialize the canvas on which drawing is performed. Finally, the predict method is called. Here is what is happening inside that function:

function calculateNewPosition(positionx, positiony, direction)
    return {
        'up' : [positionx, positiony - 10],
        'down': [positionx, positiony + 10],
        'left' : [positionx - 10, positiony],
        'right' : [positionx + 10, positiony],
        'default': [positionx, positiony]

function predict(contex, positionx, positiony) {
 const words = recognizer.wordLabels();
 recognizer.listen(({scores}) => {
   scores = Array.from(scores).map((s, i) => ({score: s, word: words[i]}));
   scores.sort((s1, s2) => s2.score - s1.score);

    var direction = scores[0].word;
    var [x1, y1] = calculateNewPosition(positionx, positiony, direction);

    contex.lineTo(x1, y1);

    positionx = x1;
    positiony = y1;
 }, {probabilityThreshold: 0.75});

This method is doing the heavy lifting. In essence, it runs an endless loop in which recognizer is listening to the words you are saying. Notice that we are using parameter the probabilityThreshold.

This parameter defines should the callback function be called at all. Essentially, the callback function is invoked only if the maximum probability score is greater than this threshold. When we get the word, we get the direction in which we should draw.

Programming Visual

Then we calculate the coordinates for the end of the line using the function calculateNewPosition. The step is 10 pixels, meaning the length of the line will be 10 pixels. You can play with both the probabilityThreshold and this length value. Once we get the new coordinates we use canvas to draw the line. That is it. Pretty straight-forward, right?


In this article, we saw how we can easily use pre-trained models of TensorFlow.js. They are a good starting point for some easy applications. We even built one example of such applications using which you can draw using voice commands. That is pretty cool and possibilities are endless. Of course, you can further train these models, get better results and use them for more complicated solutions. Meaning, you can really utilize transfer learning. However, that is a story for another time.

Thank you for reading!

Ultimate Guide to Machine Learning with Python

This bundle of e-books is specially crafted for beginners.
Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.
Become a Machine Learning Superhero 

Nikola M. Zivkovic

Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic is the author of books: Ultimate Guide to Machine Learning and Deep Learning for Programmers. He loves knowledge sharing, and he is an experienced speaker. You can find him speaking at meetups, conferences, and as a guest lecturer at the University of Novi Sad.