<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparison of Support Vector Machines and Deep Learning for Vehicle Detection</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Özgür Kaplan and Ediz Şaykol Beykent University Department of Computer Engineering</institution>
          ,
          <addr-line>Ayazağa Campus, 34396, Istanbul</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The main goal of this paper is to compare different Vehicle Detection algorithms and to provide an effective comparison technique for developers and researchers. During this study, fine tunings are suggested to improve the implementations of these algorithms. Our focus on Support Vector Machines (SVM) and then Deep Learning based approaches. The SVM based vehicle detection implementation utilizes Histogram Oriented Gradients (HOG). The deep learning approach we consider is the YOLO implementation. Our evaluation employs 400 random frames extracted from a real world driving video. As stated by the experimental results, YOLO is more accurate with %81.9 success than SVM which only scored %57.8.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>According to the World Health Organization (WHO),
1.25 million people die each year from traffic accidents
[WHO15]. Prevention of traffic accidents and
punishment of drivers who violate the rules are of great
importance for humanity. Many municipalities are using
automated camera system to detect cars that violates
traffic rules. Proper operation of these systems is crucial
to avoid further traffic accidents.</p>
      <p>It is estimated that 1 in every 10 cars in traffic will be
composed of vehicles capable of self-driving in 2030
and 1 in every 3 in 2050 [MKG+16]. It is obvious that
the safety of future traffic will be directly linked to the
quality of the software these vehicles will be using. For
both software, detecting vehicles who violate the rules
and self-driving vehicles, it is critical to identify images
correctly.</p>
      <p>There are several techniques in the literature suited
for vehicle detection task. Existing papers suggests
using acoustic sensor networks [WCH17], wavelet and
interest point based feature extraction [KB13],
edgebased constraint filters based vehicle segmentation
[Sri02] and time spatial data [MRR10].</p>
      <p>Besides, there are also simulation-based approaches
to help drivers. For example, [PYL+15] proposed an
increased reality system to increase the driver's
immediate attention. The system calculates the
probability of collision with the vehicles in the same
lane as the driver and colorizes the lanes according to
the risk ratio. In [RSL+14], it is aimed to help
middleaged and older drivers to make a left turn in running
traffic. Drivers have been given hints to return with
augmented reality for their left turns in different
scenarios such as heavy traffic and flowing traffic.</p>
      <p>In this paper, we focus on SVM-based vehicle
detection technique, where we can simplify vehicle
detection problem to a classification problem by
examining individual sections of image and classify
whether it is a vehicle image or not a vehicle. On the
other side, we consider Deep Learning approaches that
automatically learn image features that are required for
vehicle detection. There are different deep learning
techniques such as Region-based Convolutional
Network (R-CNN) [GDD+14], Fast R-CNN [Gir15],
Faster R-CNN [RHG+17] and You Only Look Once
(YOLO) [RDG+16]. We choose YOLO as it is the faster
among these techniques [DU18]. The basic aim is to
provide an effective comparison technique for
developers and researchers.</p>
      <p>The organization of the paper is as follows: In Section
2, we present vehicle detection techniques that we
employ in our study. Section 3 provides the comparative
study and Section 4 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Vehicle Detection Approaches</title>
      <p>Our study contains two major components. First
phase is detecting vehicles using SVM and deep
learning with YOLO. Second phase is comparing
accuracy and performance of these two methods.</p>
      <p>As our starting point we utilized implementation of
[Fu17] that includes a classical SVM approach using
OpenCV and Histogram of Oriented Gradients (HOG)
feature extraction as well as a deep neural network with
YOLO tensor flow implementation using pre-trained
weight object [Cho16]. To simplify the problem we only
considered cars as vehicles.</p>
      <sec id="sec-2-1">
        <title>2.1 SVM-based Vehicle Detection using HOG</title>
        <p>In this approach, we used supervised learning with
pre-categorized images. We used the images provided
by GTI vehicle database [GTI18]. There are 3425
images of vehicle rears and 3900 images of road
sequences not containing vehicles. All images are 64x64
pixels. Figure 1 shows a vehicle image and a non-vehicle
image.</p>
        <p>First step to prepare the images for SVM classifier is
to extract HOG features. The main purpose of HOG is
to identify the image as a group of local histograms.</p>
        <p>HOG is not scale invariant. In order to use HOG we
needed the same size for our training images. All our
training set images were already the same size of 64x64
so we were able to use our images directly. Figure 2
shows our HOG implementation details.
The HOG technique counts the occurrences of the
gradient orientation of the localized regions of an image.
First image is divided into small connected cells, and for
each cell gradient directions are calculated. Each cell is
splitted into angular bins according to the gradient
orientation. The pixel of each cell adds the weight
gradient to the corresponding angular bin. Groups of
adjacent cells are called blocks. The grouping of cells to
blocks is the basis of normalization of histograms. This
normalized group of histograms generates the block
histogram and set of block histograms represents the
descriptor [Int18]. Figure 3 shows a car image and the
corresponding HOG transformation.</p>
        <p>To train the model we randomly split the data into
two parts with respect to 5-fold. %80 percent is used for
training and %20 percent is used for test, as suggested
by most of the learning schemes.</p>
        <p>We explored different parameters and color spaces
for our images to get better results. After some trial and
error we managed to get %86.7 success rate.</p>
        <p>We wanted to test the classifier on a large scale
images. We found raw driving video data on KITTI
Vision Benchmark Suite [KIT18]. We modified the
code to make it work on 1392 x 512 pixels which is the
resolution of raw driving video data.</p>
        <p>The 64x64 pixel block of search windows is used to
search the entire frame. We used 50% percent overlap
for search windows. For each window, our SVM
classifier is used to detect if there is a vehicle or not.
Figure 4 shows how search windows are used.
To get the exact locations of vehicles, heat map
generation is used with the windows that may have
vehicle. Figure 5 shows a successfully detected vehicle
using SVM.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Deep Learning based Vehicle Detection</title>
        <p>We chose the YOLO technique for this purpose.
YOLO uses deep neural network to detect objects. Yolo
approaches object detection as a simple single
regression problem to find bounding boxes class
probabilities. YOLO is trained on full images and it
predicts multiple bounding boxes using single neural
network. First, YOLO model accepts an image as input
then divides it into an SxS grid. Each cell of this grid
estimates the B bounding boxes and C class
probabilities. The bounding box has 5 components: x, y,
w, h and confidence. The (x, y) are center coordinates of
the box. The (w, h) are dimensions. The confidence
score tells us if there is any shape in the box. If the score
is zero then there should be no object in the cell. Each
grid cell makes B predictions, so there are total of S x S
x B x 5 outputs. The network predicts one class
probability per cell that results S x S x C total
probabilities. Figure 6 shows YOLO object detection
and Figure 7 shows YOLO execution pipeline.</p>
        <p>We utilized [Fu17] code, which employs tensorflow
implementation of YOLO [Cho16]. It uses pre-trained
YOLO_small network, which has 20 classes as follows:
"aeroplane", "bicycle", "bird", "boat", "bottle", "bus",
"car", "cat", "chair", "cow", "dining table", "dog",
"horse", "motorbike", "person", "potted plant", "sheep",
"sofa", "train","tv monitor". Since vehicle is already
known as “car”, we were able to use precomputed
settings and apply it directly to our inputs. We used 30%
threshold, cells whose car class score is 0.3 or more are
selected. Figure 8 shows a successfully detected vehicle
using YOLO.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3. Comparing SVM and YOLO</title>
        <p>We tested both algorithms on the same randomly
selected raw driving video data. Random data contains
400 images with total of 266 vehicles in them.</p>
        <p>Test results are evaluated in 3 categories;
 Positive : Vehicle detected correctly
 Negative: Vehicle is not detected.
 False Positive: Non-vehicle is detected as
vehicle.</p>
        <p>Figure 9 shows a sample snapshot of our
experimental study employing SVM-based technique.
SVM generated a lot of false positives. In some frames,
SVM identified road signs, trees and pedestrians as
vehicles. These false negatives may be avoided if a pre
road line detection is performed. Thus, we can limit
SVM search window within actual road.
Both algorithms failed to detect vehicles, which are
further ahead. We may be able to detect these further
vehicles with SVM using higher resolution data.
However higher resolution data would require more
computation and therefore the performance would be
worse. As for YOLO, which downscales pictures to
448x448 pixels, it is useless to use higher resolution
data.</p>
        <p>Our test results shows that YOLO’s success rate is
81.9%, 218 out of 266, and SVM’s is %57.8, 154 out of
266. It also has significantly lower false positives with 5
against SVM’s 84. When we apply false negatives
YOLO’s success rate drops to 80.4% and SVM’s drops
to 44%.</p>
        <p>One reason of high false positives of SVM is that
unlike YOLO, who can classify 20 different types, SVM
can only identify vehicles. YOLO lowers the class
probability if there is another possible object in the
frame, which lowers the chance of false positives.</p>
        <p>We processed the entire video using Asus Nvdia
Geforce 1060 gpu. YOLO performed up to 42 fps while
SVM reach just 4 fps. This is because YOLO is
lightweight and use single neural network on given
frame. On the other hand, SVM needs recursive
calculations with multiple sliding window calculations.</p>
      </sec>
      <sec id="sec-2-4">
        <title>4. Conclusion and Future Work</title>
        <p>In this paper, we looked into two different vehicle
detection algorithm implementations and test them
against real world traffic data.</p>
        <p>Our results showed that deep neural network with
YOLO performed more accurate results. YOLO also
performed faster which makes it more suited for real
time tasks.</p>
        <p>With our approach we are able to find strong and
weak sides of both models. Thus, we managed to
implement and suggest fine tunings. Our approach has
potential to be used to compare different algorithms and
to be used to find fine tunings.
[WHO15] World Health Organization. (2015). Global
status report on road safety 2015. World
Health Organization.</p>
        <p>http://www.who.int/iris/handle/10665/189242
[MKG+16] D.Mohr, H.Kaas, P.Gao, D.Wee, and
T.Möller, “Automotive revolution—
Perspective towards 2030: How the
convergence of disruptive technology-driven
trends could transform the auto industry,”
McKinsey &amp; Company, Washington, DC,</p>
        <p>USA, Tech. Rep., Jan. 2016
[WCH17]Wang, R., Cao, W., &amp; He, Z. (2017). Vehicle
recognition in acoustic sensor networks using
multiple kernel sparse representation over
learned dictionaries. International Journal of
Distributed Sensor Networks.</p>
        <p>https://doi.org/10.1177/1550147717701435
[KB13] KumarMishra, Pradeep &amp; Banerjee, Biplab.
(2013). Multiple Kernel based KNN
Classifiers for Vehicle Classification.</p>
        <p>International Journal of Computer</p>
        <p>Applications. 71. 1-7. 10.5120/12359-8673.
[Sri02] Srinivasa, Narayan. (2002). Vision-based
vehicle detection and tracking method for
forward collision warning in automobiles.</p>
        <p>10.1109/IVS.2002.1188021.
[MRR10] N. C. Mithun, N. U. Rashid and S. M. M.</p>
        <p>Rahman, "Detection and Classification of
Vehicles From Video Using Multiple
TimeSpatial Images," in IEEE Transactions on
Intelligent Transportation Systems, vol. 13,</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          no.
          <issue>3</issue>
          , pp.
          <fpage>1215</fpage>
          -
          <lpage>1225</lpage>
          , Sept.
          <year>2012</year>
          . doi:
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          10.1109/TITS.
          <year>2012</year>
          .2186128
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [PYL+15]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>"Augmented reality based on driving situation awareness in vehicle,"</article-title>
          <source>2015 17th International Conference on Advanced Communication Technology (ICACT)</source>
          , Seoul,
          <year>2015</year>
          , pp.
          <fpage>593</fpage>
          -
          <lpage>595</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICACT.
          <year>2015</year>
          .7224865
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [RSL+14]
          <string-name>
            <surname>Rusch</surname>
            ,
            <given-names>Michelle</given-names>
          </string-name>
          &amp; Schall,
          <string-name>
            <surname>Mark</surname>
          </string-name>
          &amp; Lee, John &amp; Dawson, Jeffrey &amp; Rizzo,
          <string-name>
            <surname>Matthew.</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Augmented reality cues to assist older drivers with gap estimation for left-turns</article-title>
          .
          <source>Accident Analysis &amp; Prevention. 71</source>
          .
          <fpage>210</fpage>
          -
          <lpage>221</lpage>
          .
          <fpage>10</fpage>
          .1016/j.aap.
          <year>2014</year>
          .
          <volume>05</volume>
          .020.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [GDD+14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <article-title>"Rich Feature Hierarchies for Accurate Object Detection</article-title>
          and
          <string-name>
            <given-names>Semantic</given-names>
            <surname>Segmentation</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>2014 IEEE Conference on Computer Vision</source>
          and Pattern Recognition, Columbus,
          <string-name>
            <surname>OH</surname>
          </string-name>
          ,
          <year>2014</year>
          , pp.
          <fpage>580</fpage>
          -
          <lpage>587</lpage>
          .doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2014</year>
          .81
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Gir15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>"Fast</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>2015 IEEE International Conference on Computer Vision</source>
          (ICCV), Santiago,
          <year>2015</year>
          , pp.
          <fpage>1440</fpage>
          -
          <lpage>1448</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2015</year>
          .169
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [RHG+17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>"Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks,"</article-title>
          <source>in IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , vol.
          <volume>39</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>1137</fpage>
          -
          <issue>1149</issue>
          , 1 June 2017. doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2016</year>
          .2577031
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [RDG+16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>"You Only Look Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas</article-title>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          ,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Du18]
          <string-name>
            <given-names>Juan</given-names>
            <surname>Du</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Understanding of Object Detection Based on CNN Family and YOLO</article-title>
          .
          <source>Journal of Physics: Conference Series</source>
          ,
          <volume>1004</volume>
          ,
          <fpage>012029</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Fu17]
          <string-name>
            <surname>Juhnsheng</surname>
            <given-names>FU.</given-names>
          </string-name>
          (
          <year>2017</year>
          , March 17).
          <article-title>Vehicle Detection for Autonomous Driving</article-title>
          . Retrieved from https://github.com/JunshengFu/vehicledetection
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Cho16]
          <string-name>
            <surname>Jinyoung Choi.</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Tensorflow Implementation of 'YOLO: Real-Time Object Detection'</article-title>
          . Retrieved from https://github.com/gliese581gg/YOLO_tensorf low
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [GTI18]
          <string-name>
            <given-names>Vehicle</given-names>
            <surname>Image Database</surname>
          </string-name>
          (
          <year>2018</year>
          , May 12) http://www.gti.ssr.upm.es/data/Vehicle_databa se.html
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Int18]
          <article-title>Histogram of Oriented Gradients (HOG) Descriptor (2018, April 2</article-title>
          ) https://software.intel.com/en-us/
          <article-title>ipp-devreference-histogram-of-oriented-gradientshog-descriptor</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [KIT18]
          <string-name>
            <given-names>KITTI Vision</given-names>
            <surname>Bench. Suite</surname>
          </string-name>
          (
          <year>2018</year>
          , May 13) http://www.cvlibs.net/datasets/kitti/raw_data.p hp
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>