This is a guest post by Gilad David Maayan

Creating accurate Machine Learning models that are capable of identifying and localizing multiple objects in a single image remained a core challenge in computer vision. But, with recent advancements in Deep Learning, object recognition applications are easier to develop than before.

In this object recognition article, we’ll learn how object recognition works with a focus on deep learning object recognition on live video streaming.

What Is Object Recognition?

Object recognition is a technique for identifying objects in images or videos. When people look at images or watch a video, they can readily identify objects, scenes, and visual details. The goal of object recognition algorithms is to teach a computer to do what comes naturally to humans, understand what an image contains.

Object recognition is a key technology behind driverless cars, enabling them to recognize road signs or to distinguish a pedestrian from a street light. It is also useful in a variety of applications such as medical imaging, industrial inspection, and robotics.

How Object Recognition Works

There are multiple machine learning and deep learning approaches to object recognition. The differences between machine learning and deep learning for object recognition is in the implementation and execution.

Object Recognition Using Deep Learning

Deep learning models such as convolutional neural networks are used to automatically learn and identify the object. For example, CNN can learn to identify differences between cats and dogs by analyzing thousands of training images and learning the features that make cats and dogs different.

There are two deep learning techniques that perform object recognition:

  • Training a new model—a process that involves using a large dataset and designing a network architecture that will learn the features and build the model. The results can be impressive, but this approach requires a large amount of training data, and you need to set up the layers and weights in the CNN.
  • Using transfer learning—transfer learning differs from traditional machine learning because it involves using a pre-trained model as a starting point for a secondary task. This method is faster because the model has already been trained on thousands of images.

Deep learning offers a high level of accuracy but requires a large amount of data to make accurate predictions.

Object Recognition Using Machine Learning

Machine learning techniques offer different approaches than deep learning. Common examples of machine learning techniques are:

  • HOG feature extraction with a Support vector machine (SVM) model
  • Bag-of-words models with Speeded Up Robust Features (SURF) and Maximally Stable Extremal Regions (MSER) features
  • The Viola-Jones algorithm, which can be used to recognize faces and other upper body parts.

Challenges of Performing Real-Time Object Detection

Real-time object recognition models should be able to sense the environment, parse the scene and react accordingly. The model should be able to identify the types of objects in the scene. Once the type of objects have been identified, the model should locate the position of these objects by defining a bounding box around each object.

There are two stages here. First, classifying the objects in the image, and then locating the objects with a bounding box (object detection).

We can potentially face multiple challenges when we are working on a real-time problem:

  • Object variations might be of difference in the shape of objects and brightness levels.
  • Deploying object detection models takes requires a lot of memory and computation power.
  • Keeping a balance between detection accuracy and real-time requirements. If the real-time requirements are met, there will be a drop in performance and vice versa.

Let’s look at some popular deep learning object detection approaches.

YOLO Object Detection

When it comes to deep learning-based object detection on live video streams, there are three primary object detectors you’ll encounter:

  • Variants of R-CNN, including the original R-CNN, Fast R- CNN, and Faster R-CNN
  • Single Shot Detector (SSDs)
  • YOLO

YOLO are a series of deep learning models designed for fast object detection, developed by Joseph Redmon, and first described in the 2015 paper titled “You Only Look Once: Unified, Real-Time Object Detection.”

The approach involves a single deep convolutional neural network that splits the input into a grid of cells and each cell directly predicts a bounding box and object classification. The result is a large number of candidate bounding boxes that are consolidated into a final prediction by a post-processing step.

There are three main variations of the approach, YOLOv1, YOLOv2, and YOLOv3. The first version proposed the general architecture, whereas the second version refined the design and made use of predefined anchor boxes to improve bounding box proposal, and the third version refined the model architecture and training process.

Although the accuracy of the models is close but not as good as Region-Based Convolutional Neural Networks (R-CNNs), they are popular for their object detection speed in real-time video or camera feed input.

Tensorflow Object Detection API

The TensorFlow Object Detection API is an open-source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models using transfer learning. This is useful when building an object detection model from scratch, which can be difficult and can take a very long time to train.

The API was trained on the COCO dataset (Common Objects in Context). This is a dataset of 300k images of the most commonly found objects.

The API provides 5 different models. Each model has a different speed of execution and detection accuracy. See the table below:

Model name Speed COCO mAP Outputs
ssd_mobilenet_v1_coco fast 21 Boxes
ssd_inception_v2_coco fast 24 Boxes
rfcn_resnet101_coco medium 30 Boxes
faster_rcnn_resnet101_coco medium 32 Boxes
faster_rcnn_inception_resnet_v2_atrous_coco slow 37 Boxes


There are several object recognition architectures for live video streaming. In this article we covered the Yolo model and the Tensorflow Object Detection API, which allows to create or use an object detection model by making use of pre-trained models and transfer learning.

Gilad David Maayan is a technology writer who has worked with over 150 technology companies including SAP, Samsung NEXT, NetApp and Imperva, producing technical and thought leadership content that elucidates technical solutions for developers and IT leadership.