Using high-performance deep learning platform to accelerate
object detection

                S O Stepanenko1, P Y Yakimov1,2


                1
                 Samara National Research University, Moskovskoe Shosse 34А, Samara, Russia, 443086
                2
                 Image Processing Systems Institute of RAS - Branch of the FSRC "Crystallography and
                Photonics" RAS, Molodogvardejskaya street 151, Samara, Russia, 443001


                e-mail: serega.stepanenko.97@gmail.com


                Abstract. Object classification with use of neural networks is extremely current today. YOLO is
                one of the most often used frameworks for object classification. It produces high accuracy but the
                processing speed is not high enough especially in conditions of limited performance of a computer.
                This article researches use of a framework called NVIDIA TensorRT to optimize YOLO with the
                aim of increasing the image processing speed. Saving efficiency and quality of the neural network
                work TensorRT allows us to increase the processing speed using an optimization of the architecture
                and an optimization of calculations on a GPU.


1. Introduction
Object detection is becoming more and more popular [1]. It has become possible with the development of
new powerful computational devices and the use of neural networks, which can find objects in an image
having high accuracy. A system that is based on an artificial neural network is not a big problem to be
created because there is a large number of different frameworks which simplify creating of a neural
network reducing the network development to functions call. The object detection problem requires high
computational power, and in real tasks, for example processing of a video stream, powerful equipment is
required [2]. For example, FPS of YOLO work on NVIDIA GTX Titan X is about 40 [3], FPS of SSD on
NVIDIA GTX Titan X is 19 [4], FPS of FasterR-CNN on Tesla k40 is 5 [5], FPS of Fast R-CNN is 0.5
[3]. All those algorithms except for YOLO have FPS less than a common camera frame rate.
    Nowadays there are many solutions for object detection [6]. All of them use different algorithms to
detect, can detect with different accuracy and can have different speed of processing [7]. The most
existing solutions use CUDA [8] to process data in parallel. Via CUDA, we can increase the processing
speed but there are other ways to increase the processing speed as well. An optimization of the neural
network architecture can be used to make the processing faster and to remain the accuracy at the same
level. But it’s not always easy to make especially if the network has a very complex architecture. There is
a way to increase the neural network processing speed not spending much time to change the program.


                    V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)
Data Science
S O Stepanenko, P Y Yakimov


There is a platform which is able to increase the neural network processing speed using algorithms to
optimize an architecture and using abilities of NVIDIA GPUs to increase calculations as well. This
platform is called TensorRT [9]. TensorRT provides an API for creating of neural networks and allows us
to optimize models of many popular frameworks as well. It makes that convenient to use in many cases
because it’s possible to accelerate the program not spending many resources to change the code.

2. Convolutional neural networks inference technologies
The word inference means receiving the result of work of the neural network which was trained on some
data set. This article considers a use of the platform TensorRT to accelerate an algorithm for object
detection that is called YOLO [9].

2.1. YOLO
YOLO [10] – is an algorithm for object classification and detection using convolutional neural networks
to do that. Pros of convolutional neural networks for tasks of this type are that convolutional neural
networks can process images having more simple architecture than standard neural networks. There are
many implementations of YOLO based on different frameworks and written in different programming
languages. The standard implementation is based on the neural darknet which is written in the
programming language C. The work of YOLO starts from changing of the input image. It becomes
448x448x3, where 448x448 is the image size, 3 is color channels amount. At first the image is passed
through the modified net GoogleNet. It’s the 1st 20 layers of the network. The output of this part of the
network is 1024 feature maps with size of 14x14. Then the images are passed through a sequence of
convolutional layers and a sequence of pooling layers. At the moment of getting into a fully connected
layer there are 1024 feature maps with size of 7x7. After the images have been passed through 2 fully
connected layers the network provides prediction of some class belonging and provides the position of an
object in the image [11].
To define the object bounds in YOLO algorithm at first a grid with size of SxS is imposed. Then object
prediction is done for each cell. A vector with size of 5*B+C is created for each grid element, where B is
bound amount which are predicted by a grid element, C is class amount which the network can predict, 5
defines object amount which can be found. 1st 5*B values of the vector show coordinates of the center of
the bound inside the grid cell, height and width and probability that the bound has been defined correctly.
Other C values show probability that the object center is at the center of this cell. As a result, there are
S*S*B bounds of objects with class probabilities. Then the vector is sorted descending and the algorithm
Non maximal suppression is used. It repeats for every class. As a result, all bounds are viewed. The max
probability of classes is considered for every bound and if it is positive then the bound is put on the image
[3].

2.2. TensorRT
TensorRT is a platform of deep learning by NVIDIA [12]. Nowadays there are 5 versions of TensorRT.
Every new version is able to interact with greater number of layer types of a neural network and
mathematical operations. TensorRT enables to use implemented parsers for many popular frameworks. It
contains: Tensorflow, Caffe2, PyTorch, Mxnet, Microsoft Cognitive Toolkit, Chainer. Tensorflow has
built-in TensorRT 3.0 [9]. In case when the network is created on these frameworks it is very simple to
use TensorRT. It is enough to use aт implemented parser. The process of the network creating with use of
a TensorRT parser is shown in figure 1. If the network is not created on these frameworks, then it’s
possible to use the API of TensorRT to transfer the network model.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                    355
Data Science
S O Stepanenko, P Y Yakimov


                 Figure 1. A flow of the network creating with use of TensorFlow on TensorRT.

   The advantage of TensorRT using is that this platform is able to accelerate a neural network using an
algorithm to simplify the network architecture not changing the network functionality and using abilities
of NVIDIA GPUs to accelerate calculations.
   To simplify the network architecture TensorRT analyzes a graph that represent the network model. If
there are elements in the graph which are repeated, then TensorRT merges them. As a result the network
size becomes less.
   Acceleration on a GPU is possible due to an ability to use “Tensor Cores”. These cores allow to use
half-precision data type float16 for calculations. It is not possible if CUDA is used. CUDA allows to use
data type float32. The processing speed increases due to much more fast transfer of data and more fast
calculations with this data type. This type of accelerating is possible only with use of a little amount of
GPUs which can provide this technology.

3. YOLO implementation
To compare the processing speed implementations of YOLO with use TensorRT platform and without
use, with use of one data set and same trained models, have been considered.

3.1. Implementation of YOLO without use of TensorRT

3.1.1. Darknet
To compare performance one of implementations of the YOLO algorithm that is based on the neural
network darknet has been considered. YOLO was run on a GPU. To do that CUDA 10.0 and OpenCV
were required. YOLOv2 model was used as the model. Before running it’s required to make the project. It
can be done via running the command make from the project folder. After installing There will be an
executable file which must be run. To type a command with required options is enough to run. The
command for running the program is the follow: ./darknet detect path_to_cfg_file path_to_weights_file.
Darknet allows to process a video from a file and from a webcam.

3.1.2. Darkflow
Another implementation of the YOLO algorithm in the language python that uses Tensorflow. It is
required to install CUDA 9.0, Tensorflow 1.0, Numpy, OpenCv 3.0 or above to run this program. It is
required to have CUDA 10.0 and CUDA 9.0 to run darkflow and other implementations in one PC. To
change the CUDA versions it is enough to update the environment variables. Darkflow has an ability to
process a video stream. Before running it is required to run an installation script. After installatiom the
program can be run via command: flow --model path_to_cfg_file --load path_to_weights_file --imgdir
path_to_folder_with_images –gpu percent, where percent – a digit from 0 to 1 that shows the percent
ofGPU usage. 0 – o% of usage. 1 – 100% of usage. The processing is on a CPU if –gpu has not been
specified. Probably in this case the processing speed is significantly less than in case of processing on a
GPU.

3.2. Implementation of YOLO with use of TensorRT
This article presents an implementation of YOLO with TensorRT 5.0 [13]. Before launching the program
it’s required to install all dependencies. To make the program runnable CUDA 10.0, TensorRT 5.0,

V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                  356
Data Science
S O Stepanenko, P Y Yakimov


OpenCV 3.4.0 are required. Files which contain trained model weights and the network configuration are
required to run the program. They can be found on the official web site of the YOLO developers. A
trained model YOLOv2 is used for research. This model is able to detect 80 classes of objects. At first it’s
required to install the project using make. Then it’s required to set up the project typing paths to all
dependencies and to weights and configuration files. Then the data type that will be used must be chosen.
It’s possible to choose Float32, Float16 and Int8. In case when the GPU doesn’t support tensor cores the
program can be run with use Float32 only. There is a possibility to process not only single images and
batches of images. Video processing is possible when Deepstream SDK is used in addition. Deepstream
SDK is developed by NVIDIA to process data in streams. It uses TensorRT, CUDA, Video Codec SDK.
Today the last version of Deepstream SDK is 3.0. It’s possible to process video without use of Deepstream
SDK when the source code is changed to make it possible to extract frames from video streams. Such
capability is provided by OpenCV. To run the program, it’s required to type the following command:
    trt-yolo-app
    The following options are available for this command:
     • Batch_size – Images amount which are processed at the moment
     • Decode – Input is either True or False. It is for decoding of images. True by default.
     • Seed – A parameter for the random digit generator.
    After the program work has been finished files which contain processed images are saved to a folder.
To process a video, it’s possible to use OpenCV which extracts frames from the video stream. There is
another way to process a video to use deepstream a library by NVIDIA to process streams. Deepstream
uses libraries for accelerating stream processing and uses TensorRT and CUDA as well.
    Also Darkflow was modified in order to be run on TensorRT.

4. Experiment researches
2 implementations of the YOLO algorithm were used with use of the one trained model YOLOv2 for
experimental research. 2416 images were used as input. Output images which objects were found on were
saved to a folder. Processing time of every image were written to a file for the implementation without
TensorRT. Processing time of every image wasn’t calculated and the average time was calculated. All
experiments were done on a PC with characteristics which are presented in table 1.

                                 Table 1. Main characteristics of the PC.
                   GPU                                 CPU                       Memory
                   NVIDIA GeForce GT 710               AMD FX-4300               4 GB
                   NVIDIA GeForce GTX 950              Intel Core i5-6500        8 GB
    Average FPS of the image set processing by Darkflow implementation is presented in table 2.

                                    Table 2. FPS of Darkflow work.
                                        GPU                                 FPS
                         NVIDIA GeForce GT 710                              1.31
                         NVIDIA GeForce GTX 950                            10.53
                         Tesla p100                                          120
                         NVIDIA GeForce GTX 2080 TI                          170
    Average FPS of the image set processing by Darknet is presented in table 3.

                                            Table 3. FPS of Darknet work.
                                          GPU                             FPS
                                          NVIDIA GeForce GT 710           1.2
                                          NVIDIA GeForce GTX 950          6.25


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                   357
Data Science
S O Stepanenko, P Y Yakimov


  Darknet processes the images slower than Darkflow. Average FPS of the image set processing by an
implementation of YOLO in TensorRT API is presented in table 4.

                              Table 4. FPS of an implementation of YOLO in TensorRT.
                                       GPU                               FPS
                                       NVIDIA GeForce GT 710             5
                                       NVIDIA GeForce GTX 950            50

    YOLO in TensorRT works faster than Darkflow and darknet. The implementation is written in C++
with use of API of TensorRT.
    The time of work with different size of a batch was compared for the implementation with use of
TensorRT. Batch size was from 1 to 16. It was not possible to allocate the GPU memory if the batch size
was more than 16. Time of work with use of different batch size is presented in figures 2 and 3 for 2
different GPUs.


    Figure 2. Time of the algorithm work with use of different batch size for NVIDIA GeForce GT 710.


  Figure 3. Time of the algorithm work with use of different batch size for NVIDIA GeForce GTX 950.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)              358
Data Science
S O Stepanenko, P Y Yakimov


   Difference between work time on 2 GPUs is not significant. The difference between the worst and the
best results was about 1.27 times. This could be related to different causes and it's difficult to define the
optimal size in advance. It should be done experimental.
   Average FPS of Darkflow with TensorRT is presented in table 5.

                                            Table 5. FPS of Darkflow with TensorRT.
                                            GPU                                  FPS
                                            NVIDIA GeForce GT 710                1.78
                                            NVIDIA GeForce GTX 950               14.41
                                            NVIDIA GeForce GTX 2080
                                  TI                                             200
   Darkflow with TensorRT works faster than Darkflow about in 1.36 times on NVIDIA GeForce GT
710, 1.37 times on NVIDIA GeForce GTX 950, 1.18 times on NVIDIA GeForce GTX 2080 TI. FPS of all
used implementation is presented in figure 4.
                          250


                          200


                          150


                          100


                           50


                              0
                                       NVIDIA GT 710 NVIDIA GTX 950 NVIDIA GTX 2080     NVIDIA Tesla
                                                                           TI              p100

                                       Darknet    Darkflow    TensorRT c++       darkflow + TensorRT

                                           Figure 4. FPS of all used implementations.


                                                  Figure 5. A processed frame.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                    359
Data Science
S O Stepanenko, P Y Yakimov


    YOLO in TensorRT API has the best acceleration. The acceleration is about 10 times. Darkflow with
TensorRT has an acceleration but it is much more less.
    After using of TensorRT accuracy of YOLO work has not been reduced. An example of an image that
is processed by YOLO is presented in figure 5.

5. Conclusion
The article considered 3 implementations of the YOLO algorithm to compare performance. One of these
implementations uses the TensorRT platform. Another implementation was modified in order to work
with TensorRT. The platform is able to accelerate the algorithm producing the same accuracy. This ability
can be used on practice in video stream processing where processing speed is an important value. Using
TensorRT the processing time reduced about by 4 times on NVIDIA GT 710 and about by 8 times on
NVIDIA GTX 950 in comparison with the standard implementation of the algorithm if an ability of GPUs
to do calculations with use of tensor cores was not used because the GPU could not do such calculations.
Darkflow that was modified worked faster in 1.36 times on NVIDIA GT 710, 1.37 times on NVIDIA
GeForce GT 950, 1.18 times on NVIDIA GTX 2080 TI.

6. References
[1] Bibikov S A, Kazanskiy N L and Fursov V A 2018 Vegetation type recognition in hyperspectral
      images using a conjugacy indicator Computer Optics 42(5) 846-854 DOI: 10.18287/2412- 6179-
      2018-42-5-846-854
[2] Shatalin R A, Fidelman V R and Ovchinnikov P E 2017 Abnormal behavior detection method for
      video surveillance applications Computer Optics 41(1) 37-45 DOI: 10.18287/2412- 6179-2017-41-
      1-37-45
[3] Redmon J, Farhadi A 2017 YOLO9000: Better, Faster, Stronger (University of Washington, Allen
      Institute for AI) p 9
[4] Wei L 2016 SSD: Single Shot MultiBox Detector ECCV: Computer Vision 21-37
[5] Ren Sh, He K, Girshick R and Sun J 2017 Faster R-CNN: Towards Real-Time Object Detection
      with Region Proposal Networks IEEE Transactions on Pattern Analysis and Machine Intelligence
      39(6) 1137-1149 DOI: 10.1109/TPAMI.2016.2577031
[6] Amosov O S, Ivanov Y S and Zhiganov S V 2017 Human localiztion in video frames using a
      growing neural gas algorithm and fuzzy inference Computer Optics 41(1) 46-58 DOI:
      10.18287/2412-6179-2017-41-1-46-58
[7] Shustanov A, Yakimov P 2017 CNN Design for Real-Time Traffic Sign Recognition Procedia
      Engineering 201 718-725 DOI: 10.1016/j.proeng.2017.09.594
[8] CUDA URL: https://developer.nvidia.com/cuda-gpus (01.11.2018)
[9] Official site of TensorRT URL: https://developer.nvidia.com/tensorrt (01.11.2018)
[10] YOLO: Real-Time Object Detection URL: https://pjreddie.com/darknet/yolo/ (01.11.2018)
[11] Redmon J, Divvala S, Girshick R, Farhadi A 2015 You Only Look Once: Unified, Real-Time
      Object Detection You Look Only Once p 10
[12] TensorRT integration speeds up tensorflow inference URL: https://devblogs.nvidia.com/tensorrt-
      integration-speeds-tensorflow-inference/ (01.11.2018)
[13] Implementation of YOLO with TensorRT URL: https://github.com/vat-nvidia/deepstream-plugins/
      (01.11.2018)

Acknowledgements
This work was partly funded by the Russian Foundation for Basic Research – Project # 17-29-
03112 ofi_m and the Russian Federation Ministry of Science and Higher Education within a state contract
with the "Crystallography and Photonics" Research Center of the RAS under agreement 007-
ГЗ/Ч3363/26.

V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                360