Models and means of object recognition using artificial neural
networks
Vasyl Teslyuk1, Bohdan Borkivskyi1, Hamza Ali Alshawabkeh2
1
    Lviv Polytechnic National University, Lviv, Ukraine
2
    Al-Baha University, Kingdom of Saudi Arabia


                 Abstract
                 The purpose of this work is to present a generalized approach for object detection using
                 neural networks in various environments.
                 The novelty of the research is an offered combination of the machine learning method –
                 neural network for object detection and the classical method – custom programmed algorithm
                 for selecting objects of our interest. The effectiveness of the approach is achieved by
                 combining the two methods and using their strengths – ability of neural networks to learn
                 from images dataset and work with new images and ability to deliver unique business value
                 using custom programmed algorithms. This paper describes solving this problem, finding the
                 necessary methods and algorithms.

                 Keywords 1
                 Neural network, convolutional network, object detection.

1. Introduction
   Today the task of object recognition can not be considered a perfectly accomplished task. Different
approaches try to optimize specific aspects, such as the quality or speed of recognition, but in the
process suffer from other properties, such as the ability of algorithms to generalize. Examples of such
products are the system of recording intruders, which finds cars very well, but can not perform any
other functions; or a system that recognizes hundreds of different objects but is slow. Therefore,
finding a software solution that simultaneously tries to optimize all aspects is an urgent task.
   Object recognition systems are very useful in various fields, such as: security systems, violation
detection systems, social robotic systems. In order to improve the quality of object recognition,
systems are designed to perform a specific task that involves the use of a small amount of data. In this
case, to solve another task, it is necessary to design a separate system. That is, this approach is not
universal and cannot be used in the case of a large amount of input data.
   An example of a task that requires generalized action is a robotic system to help visually impaired
people find things. If such a system can find only a few objects, its practical value will be low. But if
such a system can find many commonly used things, and can work both indoors and outdoors, it will
be much better able to help people.

MoMLeT+DS-2022: 4th International Workshop on Modern Machine Learning Technologies and Data Science, November 25-26, 2022,
1

Leiden-Lviv, The Netherlands-Ukraine
EMAIL: vasyl.m.teslyuk@lpnu.ua (V. Teslyuk); bohdan.p.borkivskyi@lpnu.ua (B. Borkivskyi); Halshwabkah@bu.edu.sa (Hamza A.
Alshawabkeh)
ORCID: 0000-0002-5974-9310 (V. Teslyuk), 0000-0003-3301-476X (B. Borkivskyi), 0000-0003-3859-8055 (Hamza Ali Alshawabkeh)
              ©️ 2020 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
   But there is problem in using systems, that know how to work with a lot of different objects. Under
some circumstances we need to account only for specific set of objects. For example, when we
process footage from road camera – it can be used in system to keep track of people on crossroads,
and we should account for pedestrians and cyclists, or it can be used to look for vehicles and we need
to search cars and buses only. That is why developing an algorithm that can process the same
information in different ways is a relevant task.


2. Model selection
   When developing a computer vision system, great attention should be paid to the choice of the
model that will be the basis of the algorithm, because the quality and speed of the system depends on
the characteristics of the selected model.
   Developers in the field of artificial intelligence and computer vision are constantly improving
existing and trying to invent new approaches [1], and there are several that have already become
widely used around the world, because they have proven their effectiveness in applied tasks.
However, each of the algorithms has both strengths and weaknesses that need to be investigated to
determine which of the algorithms is best suited to solve our problem.
   The strength of the Faster R-CNN model [2] is the accuracy of object recognition, since the
architectural feature of this model is the presence of two subnets [3, 4] at once (Figure 1). One
performs the usual task of using a convolutional network to extract a feature map from the input data.
The role of the second subnet is the region proposal network (Region Proposal Network), which,
instead of pre-defined regions, searches for suitable regions on its own, which, in combination with a
large training sample, gives a high result on test images.


                              Figure 1: Faster R-CNN model structure

   But these architectural features are also weakness of the network. Since two sub-models are
involved in the operation of the model, this reduces the speed of the entire model and, accordingly,
the processing time of each sample.

  YOLO is another network that is often used to solve object search problems. The strength of this
model [5] is reflected in the name - "you only look once", that is, it does not have the weakness of
Faster R-CNN. The model is built mainly using convolutional layers, with the exception of a few FC
layers (Figure 2).


                                   Figure 2: YOLO model architecture

   Since the image passes through the network layers only once, it reduces the running time of the
model and increases performance. But this model is not without its drawbacks - due to the peculiarity
of the model's operation and the organization of its layers, in particular the last layer with a size of
7x7 and the possibility of building only 2 regions for each location - the total number of proposed
regions is only 98. Due to these spatial limitations of the model, it is difficult to find small objects , or
objects that appear in groups.

   SSD [6] has a somewhat similar methodology to YOLO, which is also reflected in its name -
"single frame detector". The network consists of convolutional layers (Figure 3), divided into two
parts. The first part, which is based on the VGG-16 convolutional neural network [7], is a typical
classification model, and is used to obtain a feature map, after which, using additional convolutional
layers of different sizes, additional feature maps of a smaller size are obtained.


                                    Figure 3: SSD model architecture

   Due to the peculiarity of the organization of layers in this network, at the output we get 8732
regions (against 98 in YOLO), which significantly increases the number of potential candidates for
recognition.

    For a better understanding of the effectiveness of each model, let's get acquainted with the results
[8] of testing the models (Table 1) on the same data.

Table 1: Models comparison

              Model                             Speed (fps)                       Accuracy (mAP)
                SSD                                 59                                 74.3
           Faster R-CNN                              7                                 73.2
               YOLO                                 45                                 63.4
  Taking into account the reviewed features of the models, their advantages and disadvantages,
which are confirmed by the results of comparative testing, the best variant of the model is SSD.

3. Dataset selection
   When implementing a system using artificial intelligence methods of the "learning with a teacher"
type, it is important to collect a high-quality array of data, which will later be used to train the model.
The main criterion when choosing a data set is its diversity, both in terms of object type and the
number of unique instances. Since there will be no practical use in systems for recognizing several
objects, the number of classes should start from several tens of units.

   The VOC dataset was introduced in 2005 and was used in the PASCAL VOC Challenge from
2005 to 2012 [9]. During this time, the variability of the data increased from 4 to 20 classes, which are
divided into 4 superclasses. In total, the data contains 10,000 images with 24,000 objects on them.
   A significant drawback of this data set when creating a generalized object recognition system is
the presence of only six classes for marking household items.

    CIFAR-100, as well as its simplified version CIFAR-10, are datasets presented by researchers at
the Department of Computer Science at the University of Toronto [10]. This dataset has 100 classes
divided into 20 larger superclasses. Each class contains 500 images for training and 100 images for
testing the model, giving a total of 60,000 images.

   ImageNet is a set of images that is ordered according to the WordNet hierarchy, in which
thousands of images correspond to each link of the system [11]. Two research needs in the field of
computer vision inspired the creation of this dataset. The first is the growing demand for highly
accurate metrics for evaluating object classification systems, and the second is the critical need for
large volumes of data to create more generalizable machine learning methods.
   The most widely used subset of this set is ILSVRC, which was created during 2012-2017 and
contains 1.2 million images, divided into a thousand classes, with a size of 166Gb. Such a volume of
data is very valuable for powerful laboratories, and at the same time is not appropriate for creating
compact computer vision systems [12, 13], since training models outside of laboratory conditions
becomes a very long process.

   COCO is a large-scale data set for solving the problems of object recognition and segmentation,
image caption generation, which was created with the support of such IT companies as Microsoft and
Facebook [14].
   The updated version of this set, released in 2017, contains 120 thousand images, which are divided
into 80 classes. Unlike the previous sets, this dataset maintains a balance between different types of
classes, and around 30 classes are available to label household objects, which allows for a well-
generalized recognition system.


4. Results processing
    A typical object detection pipeline has one component for creating proposals for classification.
Proposals are nothing more than candidate regions for the object we are interested in. Most
approaches use a sliding window over the feature map and assign confidence scores depending on the
features computed in that window. Nearby windows have somewhat similar scores and are considered
candidate regions. This leads to hundreds of offers. Since the proposal generation method must have
high completeness (recall), we keep loose constraints at this stage. However, processing these
multiple propositions through a classification network is cumbersome. This leads to a technique that
filters offers based on some criteria called Non-maximum Suppression [15].
    The main steps of the algorithm (Figure 4):
   Step 1. As input, the algorithm accepts an array B with a list of proposals, corresponding
confidence estimates S, and a certain threshold value N.
   Step 2. We choose the proposal with the highest confidence value and transfer it from the input
array to the output array D.
Figure 4: NMS algorithm scheme
   Step 3. We compare this entry with the rest of the entries of the input array by calculating the value
of the IoU (Intersection over Union) metric. If the received value is greater than the specified
threshold value (elements are very similar) - we delete this offer.
   Step 4. Repeat steps 2-3 until the input array becomes empty.
   Step 5. We return the original array D with unique offers of regions.

    To ensure the generalized operation of the system, a model capable of recognizing different types
of objects is used. But in some situations, it is not advisable to search for all possible objects: a car
that can be seen far away in the window of the room, or household appliances when the automated
system moves around the room.
    To ensure the search of only relevant objects, a filtering algorithm was developed according to the
mode (Figure 5), which compares information about each separately taken classified object, and filters
it according to information about the classes of the specified mode.
    At the core of this filtering algorithm is creation of proper configuration, based on what type of
detection is performed, e.g. for keeping track of pedestrians relevant classes would be person, cyclist,
etc, for keeping track of transport relevant classes would be car, bus, van, scooter, etc. The input of
the algorithm is list of previously found objects, and based on operating mode of the system, only
corresponding objects from configuration are considered.


                  Figure 5: Object detection result with filtering (“road” mode applied)


   As can be seen from Figure 6, although image contains variety of different objects (cars,
pedestrians) and model can identify all of them, based on “road” mode setup (blue color) only
vehicles are displayed as recognized objects. In this example “road” mode stands for vehicles.
                            Figure 6: Object detection result with no filtering


5. Experiments
    Testing took place on computers of various capacities. The weakest computer on which the
operating system was tested is the Raspberry Pi 4 Model B [16], according to the minimum hardware
requirements formed based on the characteristics of this device (Table 2). Although a device with 8Gb
of RAM was used during system testing, the program files themselves and additional configuration
files are small in size, so a model with 2Gb of RAM will be enough for work.

Table 2: Minimum hardware requirements

            Parameter                                                Value
               CPU                                  Quad core Cortex-A72 64-bit 1.5GHz
              RAM                                         2GB LPDDR4-3200 SDRAM
           Storage size                                               8GB
           Camera port                          Availability of one USB/Micro USB/CSI-2 port

   A short video of the operation of the robotic system for finding household items was recorded to
check the functionality of the system. During the operation of the system, various actions were
performed on the user interface - switching the available operating modes, turning on and off the
debugging mode.
   During the testing, no problems were found in the operation of the system, no errors in the
recognition of objects or the application of the selected modes.

   After checking the correct operation of the system, experiments were carried out on several
devices of different power in order to find out the speed of the system in conditions of different
configuration of devices.
   The first experiment was conducted on a Dell Latitude E6330 computer, the characteristics of
which are presented in Table 3. Thanks to the combination of these characteristics, this model copes
well with the role of an office computer and can be used for solving problems using finding objects in
offices.
Table 3: Dell computer specifications

             Parameter                                                  Value
               Name                                              Dell Latitude E6330
                CPU                                 Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
           System bit rate                                              64 bit
                RAM                                               16Gb DDR3-1600
            Storage size                                                128GB

   The second experiment was conducted using a portable Raspberry Pi 4 Model B computer, the
characteristics of which are presented in Table 4:
Table 4: Raspberry Pi computer specifications

              Parameter                                              Value
                Name                                         Raspberry Pi 4 Model B
                 CPU                                          Broadcom BCM2711,
                                                  Quad core Cortex-A72 (ARM v8) SoC @ 1.5GHz
           System bit rate                                           64 bit
                RAM                                        8GB LPDDR4-3200 SDRAM
            Storage size                                              8GB

   The third experiment was conducted using a more powerful computer to reveal the potential
capabilities of the system. The experiment used a modern Lenovo ThinkPad E14 Gen 2 laptop, the
characteristics of which are presented in Table 5:

Table 5: Lenovo computer specifications

             Parameter                                                 Value
               Name                                         Lenovo ThinkPad E14 Gen 2
                CPU                              11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
           System bit rate                                             64 bit
                RAM                                              32Gb DDR4-3200
            Storage size                                              512GB

   The experiments consisted in determining the speed of the system on different devices by
calculating the time spent on processing one frame. Having obtained the average value of the time
spent, we obtained the average number of processed frames per second. After conducting several
experiments on both devices, the performance results presented in Table 6 were obtained:

Table 6: Experiments results

              Device                            Processing time (s)               Speed (fps)
                Dell                                   0.09                          11.1
            Raspberry Pi                               0.34                          2.94
              Lenovo                                   0.05                           20

    From the obtained results, it follows that none of the devices can be used in real-time systems with
a minimum average fps of 30 frames per second.
    However, a modern Lenovo computer with an indicator of 20fps can be used to solve most tasks,
not only recognizing objects in a confined space, but also recognizing objects that move at high speed,
in particular, streams of cars.
    Although the Dell computer loses almost 2 times in terms of speed, the result of 10fps is also good,
and the system on such a device can also be used to solve many tasks that involve tracking not such
fast objects.
    According to the results of the experiments, the Raspberry Pi computer showed the worst results,
which corresponds to its technical characteristics. The image processing speed of 3fps does not allow
using this device to recognize objects that move even at an average speed. But the device can be used
to implement the object search technology for a robotic system, since mostly such systems move at a
low speed, which is sufficient for high-quality image processing with a long processing time.


6. Conclusions
    As a result of the work, the existing software solutions for solving the problem of finding the
object in the image were investigated, available approaches to the design and construction of neural
networks for object recognition were analyzed. A comparative characterization of known models for
object recognition was carried out, namely SSD, YOLO and Faster R-CNN, and experiments were
conducted to determine the qualitative characteristics of the proposed models. As a result, the choice
was made in favor of the SSD model.
    A comparative characterization of different datasets to use in supervised learning was conducted.
COCO dataset was selected as the main one.
    An algorithm for filtering model results according to the selected operating mode was developed
and a program with a graphical interface was developed to demonstrate its work.
    Combination of fast SSD model, extensive COCO dataset with 80 different object classes, and
filtering algorithm produces efficient object detection system, that allows processing information as
fast as 20 frames per second and find only relevant objects in different environments based on the
working setup.
    The developed system can be used in different situations. Example of case where this technology
can be useful is surveillance system. Based on what we need surveillance for, we can configure
corresponding mode: detecting people in office or detecting vehicles in logistic company.
    Another example of a situation where a solution could be useful is assistive robots. If they work
outdoors they can help navigate person across city. If they work indoors, the same system can
navigate person across building and help locating relevant things as well.


7. References
 [1] Jiuxiang Gu et al. Recent Advances in Convolutional Neural Networks. 2015. arXiv:
     1512.07108 [cs.CV].
 [2] Shaoqing Ren and Kaiming He and Ross Girshick and Jian Sun. Faster R-CNN: Towards Real-
     Time Object Detection with Region Proposal Networks. 2016. arXiv: 1506.01497 [cs.CV]
 [3] Ross Girshick. Fast R-CNN. 2015. arXiv: 1504.08083 [cs.CV]
 [4] Teslyuk V., Kazarian A., Kryvinska N., Tsmots I. Optimal artificial neural network type
     selection method for usage in smart house systems. Sensors, 2021, 21(1), pp. 1–14, 47
 [5] Joseph Redmon and Santosh Divvala and Ross Girshick and Ali Farhadi. You Only Look Once:
     Unified, Real-Time Object Detection. 2016. arXiv: 1506.02640 [cs.CV]
 [6] Liu, Wei and Anguelov, Dragomir and Erhan, Dumitru and Szegedy, Christian and Reed, Scott
     and Fu, Cheng-Yang and Berg, Alexander C. SSD: Single Shot MultiBox Detector. 2016. arXiv:
     1512.02325 [cs.CV]
 [7] Hussam Qassim and David Feinzimer and Abhishek Verma. Residual Squeeze VGG16. 2017.
     arXiv: 1705.03004 [cs.CV]
 [8] Sik-Ho Tsang. SSD — Single Shot Detector (Object Detection). 2018. URL:
     https://towardsdatascience.com/review-ssd-single-shot-detector-object-detection-851a94607d11
[9] Everingham, M. and Eslami, S. M. A. and Van~Gool, L. and Williams, C. K. I. and Winn, J.
    and Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. International
    Journal of Computer Vision (Jan 2015) p. 98-136.
[10] Alex      Krizhevsky.      CIFAR-10      and    CIFAR-100     datasets.   [Web    resource]:
    http://www.cs.toronto.edu/~kriz/cifar.html
[11] Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and
    Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael Bernstein
    and Alexander C. Berg and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
    2015. arXiv: 1409.0575 [cs.CV]
[12] Hrytsyk V., Nazarkevych M. Real-Time Sensing, Reasoning and Adaptation for Computer
    Vision Systems. Lecture Notes on Data Engineering and Communications Technologies, 2022,
    77, pp. 573–585.
[13] Berezsky O., Zarichnyi M., Pitsun O. Development of a metric and the methods for
    quantitative estimation of the segmentation of biomedical images. Eastern-European Journal of
    Enterprise Technologies, 2017, 6(4-90), pp. 4–11.
[14] Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross
    Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and
    Piotr Dollár. Microsoft COCO: Common Objects in Context. 2015. arXiv: 1405.0312 [cs.CV]
[15] Zekun Luo, Zheng Fang, Sixiao Zheng, Yabiao Wang, Yanwei Fu. NMS-Loss: Learning with
    Non-Maximum Suppression for Crowded Pedestrian Detection. 2021. arXiv: 2106.02426
    [cs.CV]
[16] Dayal A, Paluru N, Cenkeramaddi LR, J. S, Yalavarthy PK. Design and Implementation of
    Deep Learning Based Contactless Authentication System Using Hand Gestures. Electronics.
    2021; 10(2):182. https://doi.org/10.3390/electronics10020182.