Recognition of Various Objects from a Certain
 Categorical Set in Real Time Using Deep Convolutional
                     Neural Networks

                            Alexander Driaba                            Aleksei Gordeev
                           Bachelor’s student.                            Supervisor.
                            Volgograd, Russia                          Volgograd, Russia
                          casha.dryaba@mail.ru                     alexurgor2008@gmail.com

                                              Vladimir Klyachin
                                   Ph.D. Physical and mathematical sciences.
                                              Volgograd, Russia
                                             klyachin.va@volsu.ru
                               Institute of Mathematics and Informational Technologies
                                              Volgograd State University


                                                         Abstract
                       When creating a mobile autonomous robot, it became necessary to
                       solve the problem of recognizing a certain categorical set of objects on
                       the Raspberry Pi-3 Model B board with Intel Neural Compute Stick
                       2. The article discusses one of the approaches to solving this problem
                       using a neural network based on MobileNet-SSD architecture.
                       The problem has a solution, subject to the availability of the necessary
                       equipment.


1    Introduction
The main goal of the project is the implementation of computer vision systems in an autonomous mobile robot
with a camera. When moving, the robot must be able to recognize people and other environmental objects in
real time. As an approach to solving this problem, deep neural networks were chosen as one of the popular
methods for solving recognition problems.
   A similar project is the self-driving robot Ben, developed by Intel c . Their goal was to reduce the number of
road accidents by creating a plausible simulation of traffic. In our work, first of all, we consider the problem of
detecting various dangerous objects, such as explosives or gaps on the road.

2    Deep convolution neural network
The deep neural network we used for our detection and recognition task is MobileNet-SSD, which pre-trained
dataset we can retrieve from internet.
Copyright 2019 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
In: S. Hölldobler, A. Malikov (eds.): Proceedings of the YSIP-3 Workshop, Stavropol and Arkhyz, Russian Federation,
17-09-2019–20-09-2019, published at http://ceur-ws.org


                                                               1
2.1   SSD
SSD (Single Shot MultiBox Detector) - framework which purposes is localization(tracking, bounding boxes) and
classification at once. The SSD approach is based on a feed-forward convolutional network that produces a fixed-
size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by
a non-maximum suppression step to produce the final detections [10]. Single Shot means that both localization
and detection take place in one pass during recognition, the network simply ”looks” once at the image.
   SSD starts from the core network. This network is pre-trained on a large data set, such as ImageNet, which
allows it to learn a rich set of different features. The core network is used to transfer training, spread the input
image to a predetermined layer, obtain an object map, and then move this map forward to the object detection
layers.

2.2   Multibox
The term MultiBox means that an SSD can recognize objects of different classes, even if their bounding boxes
are overlapped. This is possible thanks to the priors system. Priors are fixed-size bounding boxes whose sizes
are pre-calculated based on the size and position of the ground-truth bounding boxes.
   The priors are selected such that their Intersection over Union (IoU) is greater than 50% with ground-truth
boxes. In order not to need to create a MultiBox predictor, fixed priors are used. In other words, when the
image is divided into cells when progresses through a convolutional network layers, for each cell tested several
standard bounding boxes of different aspect ratios.
   In the figure 1 you can see an example of the generated feature maps with different cell sizes. The objects
successfully recognized within the fixed frames are highlighted in color. An object that cannot be recognized
with a given cell size can be found with a lower number of partitions.


                                         Figure 1: Feature maps example


   Just like in Faster R-CNN the bounding box offsets are predicted. For each bounding box, the probability
of all class labels within the region is also computing. The calculation of the probabilities of all the bounding
boxes and a wide range of classes allows us to detect potentially overlapping objects.
   When training SSD we use loss function of the MultiBox algorithm, which includes categorical cross-entropy
loss and smooth L1-loss for location loss.
   The SSD framework also includes a concept of hard-negative mining to increase training accuracy. During
the training process, cells that have a low IoU with ground-truth objects are treated as negative examples.

2.3   MobileNet
As a base layer we used is MobileNet. As the base network, we could choose VGG or ResNet architecture,
however, we stopped at MobileNet, as the fastest, albeit at the cost of accuracy. MobileNetV1 is an architecture
from Google, which is known primarily for its smaller set of parameters and less network complexity, achieved
by fewer addition and multiplication operations. MobileNet is a convolutional neural network architecture that
applied on devices with limited computing power [8].


                                                         2
   The overall architecture of the neural network is presented in the figure 2. We use MobileNet until conv 6,
and then we separate all the other convolution layers. Each feature map is connected to the last recognition
layer, which allows the detection and localization of objects of different scales.


                                      Figure 2: MobileNet-SSD Architecture


3     Preparing the dataset
In our work, the goal was to recognize the boundaries and labels of the following categories of objects:
    • aeroplane, bicycle, boat, bus, car, motorbike, train
    • bird, cat, cow, dog, horse, sheep
    • bottle, chair, diningtable, pottedplant, sofa, tvmonitor
    • person, background, face
   The data set that we used was collected by us and tagged using the dlib library. Images were tagged with
the imglab tool, which is included in this library. One of the advantages of this tool is the ability to mark some
objects as ”ignored”. If we used dlib to train the detector, dlib would exclude such areas from the training set.
Examples of such areas are too small objects or objects that cannot be separated from each other. In such cases,
our model will not be able to isolate patterns and, accordingly, learn.

4     Training
To train the model on our dataset, we use the TensorFlow Object Detection API (or TFOD API). Since the
TFOD API does not know how to separate the ”ignored” areas from the usual ones, we converted the training
and testing files to a format that API understands removing all ignored areas. As a result, we received several
.record files and a file containing all our categories.
   Also, besides the dataset, we needed a configuration file, which we took from the TensorFlow Detection Model
Zoo, specifically the architecture ”ssd mobilenet v1 coco.config”[11].
   After that, we conducted the training procedure for our model, the preliminary results of which can be seen
in the figure 3. The left graph shows the change in the loss function, while the right graph shows the accuracy
of our model.

5     SSD issues and limits
It is worth noting that using the SSD framework, we encountered two main problems. The first is based on
the foundations of the architecture itself - SSD does poorly with small objects. You can solve this problem by
increasing the resolution of the input images, however, this will dramatically reduce the speed of our network.
   The second problem can arise for similar objects, such as tables and chairs in the background, such objects can
be confusing for the TFOD API due to hard-negative mining, in which cells with these objects can be marked
as negative patterns. This is especially aggravated by the ”ignored” areas that we discarded at the stage of
preparation of the dataset.


                                                             3
                                          Figure 3: Loss and precision


6   Equipment
Due to the small size of the robot, the equipment was selected according to two criteria:
- It should be small.
- It should consume as little energy as possible.
These 2 criteria are fulfilled for the Raspberry Pi 3b board chosen by us. By itself, the Raspberry Pi 3b board
has a 64-bit 4-core ARM processor with a clock frequency of 1.2 GHz [1].
   However, one board is not enough, if you try to deploy the network on a bare board, then the number of FPS
will not be enough for the robot to recognize objects in real time (less than 1).
   The lack of performance forces us to take Intel Neural Compute Stick 2 in addition to the board.
   Intel Neural Compute Stick - this is what the people called it ”neural stick”. Using this device significantly
improves the performance of the board and allows you to analyze incoming images from the camera in real time.
The device is based on the Movidius Myriad X Vision Processing Unit (VPU), a specialized chip containing
16 general-purpose cores and components accelerating the process of image recognition by neural networks [2].
The choice fell on the Myriad chip (used in the stick) primarily because of its relative cheapness and, as already
mentioned, low power consumption.
   We will receive images using a Logitech C310 USB camera.
   The figure 4 presents images of our equipment.


                        (a) Raspberry Pi 3 Model B              (b) Neural Compute Stick

                                        Figure 4: Equipment screenshots


7   Software
To work with a neural stick, we must install and configure the OpenVINO Toolkit. Raspberry Pi 3 Model B
running the Raspbian 19 operating system. To ensure the interaction between Raspberry Pi and Intel Neural
Compute Stick 2, we will use the OpenVINO toolkit, which provides an API for interacting with a VPU[3].


                                                        4
8    OpenVINO toolkit
The OpenVINO toolkit uses pre-trained models from various deep learning frameworks to optimize them for
specific Intel c hardware (specifically for the VPU in our case).
   To successfully use a pre-trained tensorflow or caffe models it is necessary to convert it from our legacy deep
learning framework format into special OpenVINO format for optimization, which consists of bin and xml files:
-Bin file format is used to store frozen weights of all levels of network nets.
-An xml file stores a serialized map reflecting the structure of the model, which determines exactly how weights
should be interact with each other [5].

  Model Optimizer takes the finished model and form the resulting neural depth more shallow, throwing
away extra layers, while maintaining relative accuracy. Further, this optimized model is loaded into the Infer-
ence Engine, which already has all the necessary tools and optimizations for execution on various types of devices.

    Next, OpenVINO passes the optimized model to the Inference Engine.

   Inference Engine is a framework with set of classes and APIs functions used for inference of neural networks
operation results. The framework provides an API for reading intermediate presentation, setting
input and output formats, and executing the optimized model on specific devices, such as FP32 CPU’s, Intel c
HD Graphics, Myriad VPU etc. [6].

9    Application structure
The inference cycle is shown in the figure 5. The video signal comes from the camera and is divided into frames.
Each frame is pre-processed using the OpenCV library before it gets into the plug-in. The Inference Engine plug-
in allows us to make calculations on Myriad. After neural network processing (detection, presence or absence,
tracking with the appropriate frame and signature), the data is used to overlay the frame on the original image,
which is then displayed on the screen. The entire process of the application work can be called an ”inference
loop”.


                                 Figure 5: The general scheme of the application


10     Results
In our examples, the bottle was best recognized if it was located close to the camera, the chair at times disappeared
from the objects of recognition, while the sofa all the time was determined (although it was partially in the frame),
the monitor was recognized best of all, there were no problems with it.
At a resolution of 320x240 charge produces stable 30 FPS. By increasing the resolution to 1280x720 (which is
HD) performance drops to 9 FPS.
For the normal functioning of the robot, with obtaining images in real time in good resolution, one INCS2 stick
is not enough, it is best to use from 2 to 4 sticks. Only then will it be possible to achieve an acceptable speed
in real time recognising [9].


                                                         5
  In figures 6-8, you can see the result of the network. The network recognizes a plastic bottle, tablet, sofa and
chair.


          Figure 6: Bottle                      Figure 7: Tablet                  Figure 8: Chair and sofa


11     Discussion
By the results, we found that our object detection and recognition system used MobileNet-SSD on Raspberry Pi
with INCS 2 allows us to recognize the most standard and typical objects with precision. And if we want to rise
the processing speed then need to use to use at least 4 sticks, which we going to work with in the near future.

References
[1]          Raspberry Pi 3: Specs, benchmarks & testing
             URL: https://www.raspberrypi.org/magpi/raspberry-pi-3-specs-benchmarks/
[2]          Jean-Luc Aufranc. Intel Neural Compute Stick 2 with Myriad X VPU Finally Announced
             URL: https://www.cnx-software.com/2018/11/14/intel-neural-compute-stick-2-myriad-x-vpu/

[3]          Neal Smith. Get started with Neural Compute Stick
             URL: https://software.intel.com/ru-ru/articles/get-started-with-neural-compute-stick
[4]          Introduction to Intel Deep Learning Deployment Toolkit
             URL: http://docs.openvinotoolkit.org/latest/_docs_IE_DG_Introduction.html

[5]          Model Optimizer Developer Guide URL:
             https://docs.openvinotoolkit.org/latest/_docs_MO_DG_Deep_Learning_Model_Optimizer_
             DevGuide.html
[6]          Inference Engine Developer Guide URL:
             https://docs.openvinotoolkit.org/latest/_docs_IE_DG_Deep_Learning_Inference_
             Engine_DevGuide.html
[7]          Wei Liu et al. ”SSD: Single Shot MultiBox Detector”
             URL: https://arxiv.org/abs/1512.02325
[8]          Andrew G. Howard et al. ”MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
             Applications”
             URL: https://arxiv.org/abs/1704.04861
[9]          Katsuya Hyodo. MobileNet-SSD-RealSense Library URL:
             https://qiita.com/PINTO/items/94d5557fca9911cc892d#24-fps-boost-raspberrypi3-with-four-neur
[10]         Adrian Rosebrock. Non-Maximum Suppression for Object Detection in Python
             URL: https://www.pyimagesearch.com/2014/11/17/non-maximum-suppression-object-detection-pyth
[11]         Tensorflow detection model zoo URL: https://github.com/tensorflow/models/blob/master/
             research/object_detection/g3doc/detection_model_zoo.md


                                                        6