Detection of different types of vehicles from aerial imagery Jonas Uus Tomas Krilavičius Applied Informatics faculty Vytautas Magnus University Vytautas Magnus University Kaunas, Lithuania Kaunas, Lithuania Baltic Institute of Advanced Technology Email: jonas.uus@bpti.lt Vilnius, Lithuania Email: tomas.krilavicius@bpti.lt Abstract—Accurate detection of vehicles in large amounts of Some results show that Yolo V2 performs quite well with imagery is one of the harder objects’ detection tasks as the image aerial imagery only with applied modifications: "First making resolution can be as high as 16K or sometimes even higher. the net shallower to increase its output resolution. Second Difference in vehicles size and their position (direction, they face) is another challenge to overcome to achieve acceptable detection changing the net shape to more closer match the aspect ratio quality. The vehicles can also be partially obstructed, cut off of the data." [5]. or it may be hard to differentiate between object colour and In another vehicles detection solution newer YOLO version its foreground. Small size of vehicles in high resolution images was used [6]. Images were taken from 3 publicly available complicates the task of accurate detection even more. CNN is datasets: VEDAI, COWC and DOTA. The model had good test one of the most promising methods for image processing, hence, it was decided to use their implementation in YOLO V3. To deal results for small objects, rotating objects, as well as compact with big high resolution images method for splitting/recombining and dense objects, with 76.7% mAP and 92% recall. images and augmenting them was developed. Proposed approach None of these solutions used splitting and remerging tech- allowed to achieve 81.72% average precision of vehicles detection. nique with images’ overlapping. They used already presplitted Results show practical applicability of such approach for vehicles images. detection, yet to reach higher accuracy on tractor, off-road and van categories of the vehicles the count in different vehicle II. PROBLEM categories needs to be balanced, i.e. more examples of the mentioned vehicles are required. As computing speed is increasing, and technology is ad- vancing neural networks are being optimised, it had been I. INTRODUCTION decided to apply best image augmentation/splitting/remerging methods for vehicle detection. In the application of neural Vehicles’ detection from aerial photography is a very impor- network the following set of problems becomes apparent: tant and quite a difficult task, especially when it is performed 1) Having a variety of different resolution images in dataset in real time or high resolution aerial or satellite images are (HD, Full HD, 2K ...). used for vehicle detection, such as 18000x18000 px. resolution 2) Uneven vehicles’ sizes in a dataset which are influenced images in COWC [1] dataset. As the drones are used in by different ground sample distances (GSD). more and more sectors (according to "cbinsights", currently 3) Uneven vehicles’ count in categories by having more unmanned aerial vehicle (UAV) could be used in 38 different cars than other vehicle categories combined. sectors [2]), for that reason the volume of video and photo 4) Almost all of the fully connected convolutional neural material from drones is increasing, the need to create solution networks have a fixed-size first layer and all images for making use of this unprecedented amount of data has should be resized to fit the first layer. become pronounced (at the moment of writing this paper 5) Vehicles can be partially obstructed (only part of the YouTube returns more than 3.3 million results with "aerial vehicle could be seen). footage" query). For human to annotate vehicles from videos 6) Hard to differentiate vehicles from foreground (for ex- or high resolution images it takes a lot of resources. Thus ample, black car parked in a shadow). vehicle detection task needs to be automated. 7) Vehicles may be facing multiple directions depending In this paper we investigate applicability of Convolutional on the camera flight direction and its rotation. Neural Networks. Due to good performance [3], we use YOLO 8) Available vehicle detection solutions are limited to de- V3 (You only look once) [4] CNN as a tool to apply proposed tecting a small number of features. splitting/merging images method. 9) After re-merging splitted images, the same vehicle may Moreover, we split the image into fixed overlapping rectan- be detected multiple times. gular frames (a sliding window method). Currently existing vehicles’ detection solutions are subject c 2019 for this paper by its authors. Use permitted under Creative to company trade secrets and companies do not openly discuss Commons License Attribution 4.0 International (CC BY 4.0) technical specifications and application results (for example 80 web platform Supervisely [7]). That is why it is difficult The number of vehicles used in the training images is to adapt or even sometimes impossible to add additional presented in Fig. 1 and the validation datasets are presented functionality, some solutions are based on older versions of in Fig. 2. neural networks (for as long as they are functional) and they detect few vehicle categories. For example, one of the vehicle detection solutions [8] detects vehicle features based only on their size (either a small or a large vehicle). Also, currently available solutions which uses CNNs mostly work with fixed size input images or rescale them to fixed size as existing deep convolutional neural networks (CNNs) require a fixed- size (e.g. 224x224) input images [9]. As rescaling images is detrimental for small objects features within images, the images thus are split into smaller pieces, then after the process of individual detection of vehicles in each piece, every image is remerged into full sized image. For example, if a high resolution image such as 4K is rescaled to 608 by 608 pixels, then a rear glass of a car is about 20 by 10 px. after rescaling, Fig. 1: Vehicles count in training dataset and thus the window width becomes about 6 times smaller and its height about 3.5 times smaller and the size of a window decreases to about 6 by 6 px., as a result it becomes harder to differentiate between a van and a car and the probability of misidentification increases. In case of multiple detections in the overlapping image pieces, the NMS (Non-Maximum Suppression) [10] is used to remove duplicate detections as NMS retains only the overlapping bounding box with highest probability (if its area overlaps more than preset value). The herein discussed practice of YOLO application encompasses the attempt to solve all of the above problems. III. DATASET MAFAT tournament [11] provided images which were used for training, validation and csv file with boxes and classes, Fig. 2: Vehicles count in validation dataset but the csv file was created with classification task in mind and it was not used. The images were adapted for object The characteristics of dataset images: detection task as the original dataset was initially created 1) Images were taken from a variety of locations, some for classification task and not every object was annotated, or were taken in cities, others in rural areas. false positives [12] (also called a false detection, vehicle is 2) Images were taken at a different time of a day. annotated where there is none) were assigned. Every image 3) Vehicles were lit from different sides. was manually annotated and some of them were removed. 4) The resolution of images were different from 900x600 Those images that were removed were not taken orthogonal px. to 4010x3668px. to ground, they were taken at an angle. Only images with top- 5) Some parts of images were darkened out (for example down view were kept. For image augmentation horizontal and one half of image was made completely black, while vertical flipping and rotation at 45◦ intervals was used. another half of image has picture). Following is the count of dataset images: 6) GSD (Ground sample distance) of images varied be- 1) 1712 images were chosen as training images, about 80% tween 5 and 15 cm. of original training dataset images. 7) Objects in images might have been obstructed by trees 2) After splitting training images into 500x500 pixel pieces, or cut off, only part of vehicle might have been seen images count rose to 9141. (for example, a car parked in a garage, a car near the 3) 1986 images were chosen for validation, about 78% of edge of the image). original validation dataset images. Couple of images examples taken from dataset Fig 3. 4) 12 227 vehicles were annotated manually by me in the training dataset, Fig. 1. See variation in image resolutions in table I. 5) 10 914 vehicles were annotated manually by me in the validation dataset, Fig. 2. The categories of vehicles that were being detected: 81 TABLE I: Distribution of images with different resolution in dataset Image resolution (px) Validation dataset In Training dataset 900 x 600 1975 1592 1057 x 800 2 3 1332 x 1283 1 0 2026 x 1649 6 37 4010 x 2668 2 40 1) Car, 2) Off-road vehicle, 3) Large Vehicle, 4) Van, 5) Tractor. The above dataset was considered sufficient for the evaluation of developed method. IV. P ROPOSED SOLUTION The objective was to develop a method for identification of diverse vehicles. Image resolution and sizes The use of CNNs is complicated due to the dataset having a variety of different resolution images (HD, Full HD, 2K ...) and uneven vehicles’ sizes in a dataset, see Sect. III. The different sizes in the images are influenced by different ground sample distances (GSD) [13]. As almost all of the convolutional neural networks have a fixed-size first layer [9], all images are resized to that layer size, so if an image resolution is as high as 16K and it is being resized to, for example, 608x608 px. all of the small vehicle features will disappear from the subsequent image. For this reason we propose to split the image into fixed overlapping rectangular frames (a sliding window method). This produces double detection problem as vehicles may be detected on both images. To remove duplicates, NMS (Non-Maximum Suppression) is used [10]. If two or more bounding boxes overlap with same vehicle category, then the box with highest detection probability is kept, while the others are removed. Amount of overlapping is determined by finding largest possible vehicle size in the dataset. This ensures that if the vehicle was cut off on one of the images, it would be fully visible in another image. Image obstruction One more problem with vehicle detection in images is that the vehicles can be partially obstructed (only part of the vehicle could be seen) for example when car are half parked in garage, or when car are parked alongside tree ant tree branches obstruct car features, or when car is on the edge of image. Fig. 3: Examples of images in dataset Orientation As vehicles orientation in images are not constant they may be facing multiple directions depending on the camera flight direction and its rotation. To solve different vehicles 82 orientation problem, the images are augmented with random rotation at 45◦ intervals Fig. 4. (a) original (b) flipped horizontally (a) rotated 0◦ (original) (b) rotated 45◦ (d) flipped vertically and hori- (c) flipped vertically zontally Fig. 5: Images augmentation by flipping (c) rotated 90◦ (d) rotated 135◦ (e) rotated 180◦ (f) rotated 225◦ Fig. 6: YOLO V3 architecture [14] (g) rotated 270◦ (h) rotated 315◦ was observed that if any bigger change was to be carried out on neural network, such as adding new object category, the neural Fig. 4: Images augmentation by rotating network should be trained from previous weights in which neural network had been more generic at detections. Training after changing parameters from scratch would be even better, Image augmentation but that would take longer. It was observed that YOLO detects To increase images’ count, images were augmented by rotat- new class better when previous best weights are not used. ing them at 45 degrees intervals. Additionally, dataset images Also, it is hard to differentiate an off-road from a car when were augmented by flipping them vertically, horizontally and looking from above, as the body shape of an off-road may by flipping both horizontally and vertically Fig. 5. differ only slightly (for example, be wider), thus off-road was V. E XPERIMENTS annotated as a car. Jeep category is hard too, as the only difference between a car and a jeep is that a jeep has a rear A. Tools spare tire attached or it has a truck bed (like a pickup). For experiments, convolutional neural network YOLO V3 was used on Darknet framework. YOLO V3 architecture is B. Dataset presented in Fig. 6. Vehicle categories like cars, jeeps, large vehicles, vans On original YOLO repository the problem was that while and tractors need to be detected from the aerial photographs training, detection loss climbed to infinity, when any single and their position needs to be marked by drawing bounding parameter was changed, thus another forked repository [15] box around each the object. At first, cars’ class had been from github was used instead, as it does not have the same divided into hatchbacks and sedans, but during manual objects’ issue. For YOLO V3 to work with splitting/ merging workflow, annotation it was observed, that if a car is half obstructed and original source code was modified. To know when the training only its front part can be seen, it is impossible to tell whether had to be terminated, an average loss value was observed. It it is a sedan or a hatchback as the only differentiating factor 83 is the size of rear glass and only the trunk/ boot can be seen. reason vehicle category detection average precision was very For this reason, sedans and hatchbacks were merged into one low. To solve this problem, the dataset needs to have more vehicle class. unified count of vehicles in every category. As the dataset contained mostly cars, YOLO learned that if unsure, it should ascribe an object to a car category, that way it could reach better mAP result in a long run than guessing rarer classes. This non-homogeneous dataset problem shows up, if dataset has different number of vehicles for given class in dataset. This non-homogeneous dataset problem could be solved by adding images in which rarer classes’ vehicles are shown or by augmenting a larger number of rearer class images than images with other vehicles. Cross-validation statistical method was used during YOLO training, the dataset was divided into images for training and validation. The neural network can not see any of the validation images during training, it can only see them when its performance is validated. This method is used to prevent overfitting. The following modifications were performed for the purposes for training and validation images in a dataset: 1) Images’ slicing/ overlapping parameter values modifica- tion. 2) Fixing wrongly annotated vehicle data and their bound- Fig. 7: Precision and recal curve for cars category ing boxes’ locations in the datasets. 3) Changing vehicles’ count of classes by adding, merging existing, then reannotating dataset. 4) Choosing images from dataset for training/ validation. 5) Experimenting with images’ manipulations (vertical/ horizontal flipping, image rotation), this drastically im- proved dataset size. These manipulations were manually coded as YOLO, unlike Tensorflow, does not have these image manipuliations integrated. The following modifications which were done on YOLO: 1) Change of YOLO layer resolution (mostly first layer, as all images are resized to the same resolution as the first layer size). 2) Experiments with different YOLO configurations and different layers’ count. 3) Change of network parameters (such as anchors, recal- culating certain layer size after vehicles’ classes modi- fications, learning speed). 4) Adding a module to darknet for easier work with split images and for external communication with other pro- Fig. 8: Precision and recall curve for large vehicle category grams. C. Experiments results To evaluate performance PASCAL VOC evaluation metrics The above figures show how precision and recall are core- were used and the results were compared using AP (average lated, for example, if we choose precision at 95%, then 45% precision) [16]. This metric uses Jaccard index [17] for cal- of cars were detected in validation images at that level of culating IOU (intersection over union) to compare between precision. F-Score [18] at this precision level is equal to 0.61, ground truth and detection boxes. if recall increases to 80% then the precision drops to 75%. After training the YOLO V3 neural network it managed F-Score at 75% is equal to 0.77. When all categories were to detect cars with 78.69% average precision (AP) Fig. 7, merged into one and then results were validated again, average large vehicles with 44.85% average precision (AP) Fig. 8. precision increased to 81.72% Fig. 9. This indicates that in Other vehicle categories such as jeeps, vans and tractors were order detection precision is increased, YOLO V3 needs to detected but they were wrongly categorised. That was the classify categories more accurately. 84 dataset is relatively small, it needs to be increased from freely available datasets and photos taken from drones. As the dataset should have more unified count of vehicles categories, more photos with tractors, large vehicles and jeeps should be added to the dataset. R EFERENCES [1] Wesam A. Sakla Kofi Boakye T. Nathan Mundhenk, Goran Konjevod. A large contextual dataset for classification, detection and counting of cars with deep learning. arXiv:1609.04453, 2016. [2] CBINSIGHTS. 38 ways drones will impact society: From fighting war to forecasting weather, uavs change everything. Accessed: 2019.02.22. [3] Joseph Redmon and Ali Farhadi. Yolo: Real-time object detection. Accessed: 2019.02.22. [4] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018. [5] Jennifer Carlet and Bernard Abayowa. Fast vehicle detection in aerial imagery. CoRR, abs/1709.08666, 2017. Fig. 9: Precision and recal graph when all vehicles are merged [6] J. Lu, C. Ma, L. Li, X. Xing, Y. Zhang, Z. Wang, and J. Xu. A vehicle to one category detection method for aerial image based on yolo. Journal of Computer and Communications, pages 98–107, 2018. [7] Supervise. The leading platform for entire computer vision lifecycle. VI. C ONCLUSIONS Accessed: 2019.02.22. [8] Alexey. Object detection on satellite images. Accessed: 2019.02.22. This application could be used for statistics (to count how [9] Shaoqing Ren Jian Sun Kaiming He, Xiangyu Zhang. Spatial pyra- many vehicles are there in a given image), vehicles tracking, mid pooling in deep convolutional networks for visual recognition. arXiv:1406.4729v1, 2014. prediction of further vehicle movement direction and real- [10] Adrian Rosebrock. Non-maximum suppression for object detection in time vehicle detection from real time video feed. A vehicles’ python. Accessed: 2019.02.22. detection application was created so as users could easily [11] yuvalsh. Mafat challenge - fine-grained classification of objects from aerial imagery. Accessed: 2019.02.22. configure it and make vehicles’ detection task easier. The user [12] Google. Classification: True vs. false and positive vs. negative. Ac- only needs to input images and a couple of parameters to cessed: 2019.02.22. execute vehicles’ detection with CNN. [13] Wikipedia contributors. Ground sample distance. Accessed: 2019.02.22. [14] Ayoosh Kathuria. What’s new in yolo v3? Accessed: 2019.02.22. Results: [15] Alexey. Yolo-v3 and yolo-v2 for windows and linux. Accessed: 1) Dataset was prepared for vehicles detection task by 2019.02.22. manually annotating all of the vehicles in dataset images. [16] Jonathan Hui. map (mean average precision) for object detection. Accessed: 2019.02.22. 2) Images’ were augmented to increase dataset size. [17] Wikipedia. Jaccard index. Accessed: 2019.02.22. 3) Method for combining splitting and joining images and [18] Marina Sokolova, Nathalie Japkowicz, and Stan Szpakowicz. Beyond using convulutional neural network for vehicles detec- accuracy, f-score and roc: A family of discriminant measures for performance evaluation. volume Vol. 4304, pages 1015–1021, 01 1970. tion was proposed. 4) Proposed method performance was tested by using YOLO V3 CNN Conclusions: 1) When YOLO V3 is used together with proposed method is capable of detecting cars with 79% accuracy and large vehicles with 45% accuracy. 2) When proposed method is used, YOLO V3 CNN still has difficulty detecting characteristics of other vehicles, such as off-road, tractors and vans which makes the final detection result lower. 3) Proposed method helps to avoid losing vehicles and their features that would otherwise be lost by resizing high resolution images. 4) The dataset used for training and validation should have more unified count of vehicles categories (more photos with tractors, large vehicles and jeeps should be added to the dataset). For future work R-CNN and SSD networks will be trained on Tensorflow framework as those networks are also widely used CNN’s for object detection tasks and they will be tested using same proposed method. Also, as currently used images’ 85