=Paper=
{{Paper
|id=Vol-2470/p24
|storemode=property
|title=Detection of different types of vehicles from aerial imagery
|pdfUrl=https://ceur-ws.org/Vol-2470/p24.pdf
|volume=Vol-2470
|authors=Jonas Uus,Tomas Krilavičius
|dblpUrl=https://dblp.org/rec/conf/ivus/UusK19
}}
==Detection of different types of vehicles from aerial imagery==
<pdf width="1500px">https://ceur-ws.org/Vol-2470/p24.pdf</pdf>
<pre>
 Detection of different types of vehicles from aerial
                       imagery
                                 Jonas Uus                                               Tomas Krilavičius
                       Applied Informatics faculty                                   Vytautas Magnus University
                       Vytautas Magnus University                                          Kaunas, Lithuania
                           Kaunas, Lithuania                                   Baltic Institute of Advanced Technology
                        Email: jonas.uus@bpti.lt                                            Vilnius, Lithuania
                                                                                   Email: tomas.krilavicius@bpti.lt


   Abstract—Accurate detection of vehicles in large amounts of                  Some results show that Yolo V2 performs quite well with
imagery is one of the harder objects’ detection tasks as the image           aerial imagery only with applied modifications: "First making
resolution can be as high as 16K or sometimes even higher.                   the net shallower to increase its output resolution. Second
Difference in vehicles size and their position (direction, they face)
is another challenge to overcome to achieve acceptable detection             changing the net shape to more closer match the aspect ratio
quality. The vehicles can also be partially obstructed, cut off              of the data." [5].
or it may be hard to differentiate between object colour and                    In another vehicles detection solution newer YOLO version
its foreground. Small size of vehicles in high resolution images             was used [6]. Images were taken from 3 publicly available
complicates the task of accurate detection even more. CNN is                 datasets: VEDAI, COWC and DOTA. The model had good test
one of the most promising methods for image processing, hence,
it was decided to use their implementation in YOLO V3. To deal               results for small objects, rotating objects, as well as compact
with big high resolution images method for splitting/recombining             and dense objects, with 76.7% mAP and 92% recall.
images and augmenting them was developed. Proposed approach                     None of these solutions used splitting and remerging tech-
allowed to achieve 81.72% average precision of vehicles detection.           nique with images’ overlapping. They used already presplitted
Results show practical applicability of such approach for vehicles           images.
detection, yet to reach higher accuracy on tractor, off-road and
van categories of the vehicles the count in different vehicle                                       II. PROBLEM
categories needs to be balanced, i.e. more examples of the
mentioned vehicles are required.                                                As computing speed is increasing, and technology is ad-
                                                                             vancing neural networks are being optimised, it had been
                       I. INTRODUCTION                                       decided to apply best image augmentation/splitting/remerging
                                                                             methods for vehicle detection. In the application of neural
   Vehicles’ detection from aerial photography is a very impor-              network the following set of problems becomes apparent:
tant and quite a difficult task, especially when it is performed
                                                                                1) Having a variety of different resolution images in dataset
in real time or high resolution aerial or satellite images are
                                                                                   (HD, Full HD, 2K ...).
used for vehicle detection, such as 18000x18000 px. resolution
                                                                                2) Uneven vehicles’ sizes in a dataset which are influenced
images in COWC [1] dataset. As the drones are used in
                                                                                   by different ground sample distances (GSD).
more and more sectors (according to "cbinsights", currently
                                                                                3) Uneven vehicles’ count in categories by having more
unmanned aerial vehicle (UAV) could be used in 38 different
                                                                                   cars than other vehicle categories combined.
sectors [2]), for that reason the volume of video and photo
                                                                                4) Almost all of the fully connected convolutional neural
material from drones is increasing, the need to create solution
                                                                                   networks have a fixed-size first layer and all images
for making use of this unprecedented amount of data has
                                                                                   should be resized to fit the first layer.
become pronounced (at the moment of writing this paper
                                                                                5) Vehicles can be partially obstructed (only part of the
YouTube returns more than 3.3 million results with "aerial
                                                                                   vehicle could be seen).
footage" query). For human to annotate vehicles from videos
                                                                                6) Hard to differentiate vehicles from foreground (for ex-
or high resolution images it takes a lot of resources. Thus
                                                                                   ample, black car parked in a shadow).
vehicle detection task needs to be automated.
                                                                                7) Vehicles may be facing multiple directions depending
   In this paper we investigate applicability of Convolutional
                                                                                   on the camera flight direction and its rotation.
Neural Networks. Due to good performance [3], we use YOLO
                                                                                8) Available vehicle detection solutions are limited to de-
V3 (You only look once) [4] CNN as a tool to apply proposed
                                                                                   tecting a small number of features.
splitting/merging images method.
                                                                                9) After re-merging splitted images, the same vehicle may
   Moreover, we split the image into fixed overlapping rectan-
                                                                                   be detected multiple times.
gular frames (a sliding window method).
                                                                                Currently existing vehicles’ detection solutions are subject
c 2019 for this paper by its authors. Use permitted under Creative           to company trade secrets and companies do not openly discuss
Commons License Attribution 4.0 International (CC BY 4.0)                    technical specifications and application results (for example

                                                                        80
web platform Supervisely [7]). That is why it is difficult                  The number of vehicles used in the training images is
to adapt or even sometimes impossible to add additional                  presented in Fig. 1 and the validation datasets are presented
functionality, some solutions are based on older versions of             in Fig. 2.
neural networks (for as long as they are functional) and they
detect few vehicle categories. For example, one of the vehicle
detection solutions [8] detects vehicle features based only on
their size (either a small or a large vehicle). Also, currently
available solutions which uses CNNs mostly work with fixed
size input images or rescale them to fixed size as existing
deep convolutional neural networks (CNNs) require a fixed-
size (e.g. 224x224) input images [9]. As rescaling images
is detrimental for small objects features within images, the
images thus are split into smaller pieces, then after the process
of individual detection of vehicles in each piece, every image
is remerged into full sized image. For example, if a high
resolution image such as 4K is rescaled to 608 by 608 pixels,
then a rear glass of a car is about 20 by 10 px. after rescaling,                  Fig. 1: Vehicles count in training dataset
and thus the window width becomes about 6 times smaller and
its height about 3.5 times smaller and the size of a window
decreases to about 6 by 6 px., as a result it becomes harder
to differentiate between a van and a car and the probability
of misidentification increases. In case of multiple detections
in the overlapping image pieces, the NMS (Non-Maximum
Suppression) [10] is used to remove duplicate detections as
NMS retains only the overlapping bounding box with highest
probability (if its area overlaps more than preset value). The
herein discussed practice of YOLO application encompasses
the attempt to solve all of the above problems.


                        III. DATASET

   MAFAT tournament [11] provided images which were used
for training, validation and csv file with boxes and classes,                     Fig. 2: Vehicles count in validation dataset
but the csv file was created with classification task in mind
and it was not used. The images were adapted for object                    The characteristics of dataset images:
detection task as the original dataset was initially created               1) Images were taken from a variety of locations, some
for classification task and not every object was annotated, or                were taken in cities, others in rural areas.
false positives [12] (also called a false detection, vehicle is            2) Images were taken at a different time of a day.
annotated where there is none) were assigned. Every image                  3) Vehicles were lit from different sides.
was manually annotated and some of them were removed.                      4) The resolution of images were different from 900x600
Those images that were removed were not taken orthogonal                      px. to 4010x3668px.
to ground, they were taken at an angle. Only images with top-              5) Some parts of images were darkened out (for example
down view were kept. For image augmentation horizontal and                    one half of image was made completely black, while
vertical flipping and rotation at 45◦ intervals was used.                     another half of image has picture).
   Following is the count of dataset images:                               6) GSD (Ground sample distance) of images varied be-
  1) 1712 images were chosen as training images, about 80%                    tween 5 and 15 cm.
     of original training dataset images.                                  7) Objects in images might have been obstructed by trees
  2) After splitting training images into 500x500 pixel pieces,               or cut off, only part of vehicle might have been seen
     images count rose to 9141.                                               (for example, a car parked in a garage, a car near the
  3) 1986 images were chosen for validation, about 78% of                     edge of the image).
     original validation dataset images.                                 Couple of images examples taken from dataset Fig 3.
  4) 12 227 vehicles were annotated manually by me in the
     training dataset, Fig. 1.                                             See variation in image resolutions in table I.
  5) 10 914 vehicles were annotated manually by me in the
     validation dataset, Fig. 2.                                           The categories of vehicles that were being detected:


                                                                    81
                                             TABLE I: Distribution of images with different resolution in
                                             dataset
                                                 Image resolution (px)   Validation dataset   In Training dataset
                                                      900 x 600                1975                  1592
                                                     1057 x 800                   2                    3
                                                     1332 x 1283                  1                    0
                                                     2026 x 1649                  6                   37
                                                     4010 x 2668                  2                   40


                                               1) Car,
                                               2) Off-road vehicle,
                                               3) Large Vehicle,
                                               4) Van,
                                               5) Tractor.

                                             The above dataset was considered sufficient for the evaluation
                                             of developed method.
                                                                IV. P ROPOSED SOLUTION
                                                The objective was to develop a method for identification of
                                             diverse vehicles.
                                             Image resolution and sizes
                                                 The use of CNNs is complicated due to the dataset having
                                             a variety of different resolution images (HD, Full HD, 2K
                                             ...) and uneven vehicles’ sizes in a dataset, see Sect. III.
                                             The different sizes in the images are influenced by different
                                             ground sample distances (GSD) [13]. As almost all of the
                                             convolutional neural networks have a fixed-size first layer
                                             [9], all images are resized to that layer size, so if an image
                                             resolution is as high as 16K and it is being resized to, for
                                             example, 608x608 px. all of the small vehicle features will
                                             disappear from the subsequent image. For this reason we
                                             propose to split the image into fixed overlapping rectangular
                                             frames (a sliding window method). This produces double
                                             detection problem as vehicles may be detected on both images.
                                             To remove duplicates, NMS (Non-Maximum Suppression)
                                             is used [10]. If two or more bounding boxes overlap with
                                             same vehicle category, then the box with highest detection
                                             probability is kept, while the others are removed. Amount of
                                             overlapping is determined by finding largest possible vehicle
                                             size in the dataset. This ensures that if the vehicle was cut
                                             off on one of the images, it would be fully visible in another
                                             image.
                                             Image obstruction
                                                One more problem with vehicle detection in images is
                                             that the vehicles can be partially obstructed (only part of the
                                             vehicle could be seen) for example when car are half parked in
                                             garage, or when car are parked alongside tree ant tree branches
                                             obstruct car features, or when car is on the edge of image.
Fig. 3: Examples of images in dataset        Orientation
                                                As vehicles orientation in images are not constant they
                                             may be facing multiple directions depending on the camera
                                             flight direction and its rotation. To solve different vehicles


                                        82
orientation problem, the images are augmented with random
rotation at 45◦ intervals Fig. 4.


                                                                               (a) original                  (b) flipped horizontally


   (a) rotated 0◦ (original)               (b) rotated 45◦


                                                                                                         (d) flipped vertically and hori-
                                                                          (c) flipped vertically         zontally
                                                                                 Fig. 5: Images augmentation by flipping
       (c) rotated 90◦                     (d) rotated 135◦


      (e) rotated 180◦                     (f) rotated 225◦


                                                                                    Fig. 6: YOLO V3 architecture [14]


      (g) rotated 270◦                     (h) rotated 315◦           was observed that if any bigger change was to be carried out on
                                                                      neural network, such as adding new object category, the neural
           Fig. 4: Images augmentation by rotating                    network should be trained from previous weights in which
                                                                      neural network had been more generic at detections. Training
                                                                      after changing parameters from scratch would be even better,
Image augmentation
                                                                      but that would take longer. It was observed that YOLO detects
  To increase images’ count, images were augmented by rotat-          new class better when previous best weights are not used.
ing them at 45 degrees intervals. Additionally, dataset images           Also, it is hard to differentiate an off-road from a car when
were augmented by flipping them vertically, horizontally and          looking from above, as the body shape of an off-road may
by flipping both horizontally and vertically Fig. 5.                  differ only slightly (for example, be wider), thus off-road was
                         V. E XPERIMENTS                              annotated as a car. Jeep category is hard too, as the only
                                                                      difference between a car and a jeep is that a jeep has a rear
A. Tools                                                              spare tire attached or it has a truck bed (like a pickup).
   For experiments, convolutional neural network YOLO V3
was used on Darknet framework. YOLO V3 architecture is                B. Dataset
presented in Fig. 6.                                                     Vehicle categories like cars, jeeps, large vehicles, vans
   On original YOLO repository the problem was that while             and tractors need to be detected from the aerial photographs
training, detection loss climbed to infinity, when any single         and their position needs to be marked by drawing bounding
parameter was changed, thus another forked repository [15]            box around each the object. At first, cars’ class had been
from github was used instead, as it does not have the same            divided into hatchbacks and sedans, but during manual objects’
issue. For YOLO V3 to work with splitting/ merging workflow,          annotation it was observed, that if a car is half obstructed and
original source code was modified. To know when the training          only its front part can be seen, it is impossible to tell whether
had to be terminated, an average loss value was observed. It          it is a sedan or a hatchback as the only differentiating factor


                                                                 83
is the size of rear glass and only the trunk/ boot can be seen.          reason vehicle category detection average precision was very
For this reason, sedans and hatchbacks were merged into one              low. To solve this problem, the dataset needs to have more
vehicle class.                                                           unified count of vehicles in every category.
   As the dataset contained mostly cars, YOLO learned that if
unsure, it should ascribe an object to a car category, that way
it could reach better mAP result in a long run than guessing
rarer classes. This non-homogeneous dataset problem shows
up, if dataset has different number of vehicles for given class
in dataset. This non-homogeneous dataset problem could be
solved by adding images in which rarer classes’ vehicles are
shown or by augmenting a larger number of rearer class images
than images with other vehicles.
   Cross-validation statistical method was used during YOLO
training, the dataset was divided into images for training
and validation. The neural network can not see any of the
validation images during training, it can only see them when
its performance is validated. This method is used to prevent
overfitting. The following modifications were performed for
the purposes for training and validation images in a dataset:
   1) Images’ slicing/ overlapping parameter values modifica-
       tion.
   2) Fixing wrongly annotated vehicle data and their bound-                  Fig. 7: Precision and recal curve for cars category
       ing boxes’ locations in the datasets.
   3) Changing vehicles’ count of classes by adding, merging
       existing, then reannotating dataset.
   4) Choosing images from dataset for training/ validation.
   5) Experimenting with images’ manipulations (vertical/
       horizontal flipping, image rotation), this drastically im-
       proved dataset size. These manipulations were manually
       coded as YOLO, unlike Tensorflow, does not have these
       image manipuliations integrated.
   The following modifications which were done on YOLO:
   1) Change of YOLO layer resolution (mostly first layer, as
       all images are resized to the same resolution as the first
       layer size).
   2) Experiments with different YOLO configurations and
       different layers’ count.
   3) Change of network parameters (such as anchors, recal-
       culating certain layer size after vehicles’ classes modi-
       fications, learning speed).
   4) Adding a module to darknet for easier work with split
       images and for external communication with other pro-             Fig. 8: Precision and recall curve for large vehicle category
       grams.

C. Experiments results
   To evaluate performance PASCAL VOC evaluation metrics                    The above figures show how precision and recall are core-
were used and the results were compared using AP (average                lated, for example, if we choose precision at 95%, then 45%
precision) [16]. This metric uses Jaccard index [17] for cal-            of cars were detected in validation images at that level of
culating IOU (intersection over union) to compare between                precision. F-Score [18] at this precision level is equal to 0.61,
ground truth and detection boxes.                                        if recall increases to 80% then the precision drops to 75%.
   After training the YOLO V3 neural network it managed                  F-Score at 75% is equal to 0.77. When all categories were
to detect cars with 78.69% average precision (AP) Fig. 7,                merged into one and then results were validated again, average
large vehicles with 44.85% average precision (AP) Fig. 8.                precision increased to 81.72% Fig. 9. This indicates that in
Other vehicle categories such as jeeps, vans and tractors were           order detection precision is increased, YOLO V3 needs to
detected but they were wrongly categorised. That was the                 classify categories more accurately.


                                                                    84
                                                                         dataset is relatively small, it needs to be increased from freely
                                                                         available datasets and photos taken from drones. As the dataset
                                                                         should have more unified count of vehicles categories, more
                                                                         photos with tractors, large vehicles and jeeps should be added
                                                                         to the dataset.
                                                                                                      R EFERENCES
                                                                          [1] Wesam A. Sakla Kofi Boakye T. Nathan Mundhenk, Goran Konjevod.
                                                                              A large contextual dataset for classification, detection and counting of
                                                                              cars with deep learning. arXiv:1609.04453, 2016.
                                                                          [2] CBINSIGHTS. 38 ways drones will impact society: From fighting war
                                                                              to forecasting weather, uavs change everything. Accessed: 2019.02.22.
                                                                          [3] Joseph Redmon and Ali Farhadi. Yolo: Real-time object detection.
                                                                              Accessed: 2019.02.22.
                                                                          [4] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement.
                                                                              CoRR, abs/1804.02767, 2018.
                                                                          [5] Jennifer Carlet and Bernard Abayowa. Fast vehicle detection in aerial
                                                                              imagery. CoRR, abs/1709.08666, 2017.
Fig. 9: Precision and recal graph when all vehicles are merged            [6] J. Lu, C. Ma, L. Li, X. Xing, Y. Zhang, Z. Wang, and J. Xu. A vehicle
to one category                                                               detection method for aerial image based on yolo. Journal of Computer
                                                                              and Communications, pages 98–107, 2018.
                                                                          [7] Supervise. The leading platform for entire computer vision lifecycle.
                     VI. C ONCLUSIONS                                         Accessed: 2019.02.22.
                                                                          [8] Alexey. Object detection on satellite images. Accessed: 2019.02.22.
   This application could be used for statistics (to count how            [9] Shaoqing Ren Jian Sun Kaiming He, Xiangyu Zhang. Spatial pyra-
many vehicles are there in a given image), vehicles tracking,                 mid pooling in deep convolutional networks for visual recognition.
                                                                              arXiv:1406.4729v1, 2014.
prediction of further vehicle movement direction and real-               [10] Adrian Rosebrock. Non-maximum suppression for object detection in
time vehicle detection from real time video feed. A vehicles’                 python. Accessed: 2019.02.22.
detection application was created so as users could easily               [11] yuvalsh. Mafat challenge - fine-grained classification of objects from
                                                                              aerial imagery. Accessed: 2019.02.22.
configure it and make vehicles’ detection task easier. The user          [12] Google. Classification: True vs. false and positive vs. negative. Ac-
only needs to input images and a couple of parameters to                      cessed: 2019.02.22.
execute vehicles’ detection with CNN.                                    [13] Wikipedia contributors. Ground sample distance. Accessed: 2019.02.22.
                                                                         [14] Ayoosh Kathuria. What’s new in yolo v3? Accessed: 2019.02.22.
   Results:                                                              [15] Alexey. Yolo-v3 and yolo-v2 for windows and linux. Accessed:
   1) Dataset was prepared for vehicles detection task by                     2019.02.22.
      manually annotating all of the vehicles in dataset images.         [16] Jonathan Hui. map (mean average precision) for object detection.
                                                                              Accessed: 2019.02.22.
   2) Images’ were augmented to increase dataset size.                   [17] Wikipedia. Jaccard index. Accessed: 2019.02.22.
   3) Method for combining splitting and joining images and              [18] Marina Sokolova, Nathalie Japkowicz, and Stan Szpakowicz. Beyond
      using convulutional neural network for vehicles detec-                  accuracy, f-score and roc: A family of discriminant measures for
                                                                              performance evaluation. volume Vol. 4304, pages 1015–1021, 01 1970.
      tion was proposed.
   4) Proposed method performance was tested by using
      YOLO V3 CNN
   Conclusions:
   1) When YOLO V3 is used together with proposed method
      is capable of detecting cars with 79% accuracy and large
      vehicles with 45% accuracy.
   2) When proposed method is used, YOLO V3 CNN still
      has difficulty detecting characteristics of other vehicles,
      such as off-road, tractors and vans which makes the final
      detection result lower.
   3) Proposed method helps to avoid losing vehicles and their
      features that would otherwise be lost by resizing high
      resolution images.
   4) The dataset used for training and validation should have
      more unified count of vehicles categories (more photos
      with tractors, large vehicles and jeeps should be added
      to the dataset).
   For future work R-CNN and SSD networks will be trained
on Tensorflow framework as those networks are also widely
used CNN’s for object detection tasks and they will be tested
using same proposed method. Also, as currently used images’


                                                                    85

</pre>