Coyote: A Dataset of Challenging Scenarios in Visual Perception for Autonomous Vehicles Suruchi Gupta1 , Ihsan Ullah2 , Michael G. Madden1∗ 1 School of Computer Science, National University of Ireland Galway, Galway, Ireland 2 CeADAR Ireland’s Center for Applied AI, University College Dublin, Dublin, Ireland {s.gupta9, michael.madden}@nuigalway.ie, ihsan.ullah@ucd.ie Abstract 1 Introduction Recent advances in Artificial Intelligence have im- An Autonomous Vehicle (AV) perceives its environment us- mense potential for the realization of self-driving ing sensors such as radar, sonar, GPS, and cameras, and uses applications. In particular, deep neural networks an advanced control system to identify an appropriate naviga- are being applied to object detection and seman- tion path [Janai et al., 2017]. For this, AV architectures make tic segmentation, to support the operation of semi- use of the field of computer vision to interpret and understand autonomous vehicles. While full Level 5 auton- their visual inputs. Attempts to provide computers with an omy is not yet available, elements of these tech- understanding of visual components around them dates back nologies are being brought to market in advanced to the 1960s [Papert, 1966]. Before the emergence of Convo- driver assistance systems that provide partial au- lutional Neural Networks (CNNs) [Krizhevsky et al., 2017], tomation at Level 2 and 3. However, multiple stud- traditional algorithms were used to extract edges and identify ies have demonstrated that current state-of-the-art shapes. These extracted structural features were then used to deep learning models can make high-confidence identify elements of an image [Szeliski, 2011]. but incorrect predictions. In the context of a crit- Although researchers have reported that the performance ical application such as understanding the scene in of modern computer vision system approaches human-level front of a vehicle, which must be robust, accurate performance [Russakovsky et al., 2015], other research stud- and in real-time, such failures raise concerns; most ies have conversely demonstrated that images with small per- significantly, they may pose a substantial threat to turbations or minor features that should be irrelevant can ad- the safety of the vehicle’s occupants and other peo- versely affect performance [Hendrycks et al., 2019; Nguyen ple with whom the vehicle shares the road. et al., 2015; Szegedy et al., 2014]. Such images, known as To examine the challenges of current computer vi- adversarial examples, can occur naturally [Hendrycks et al., sion approaches in the context of autonomous and 2019] or be user-constructed [Nguyen et al., 2015]. In this semi-autonomous vehicles, we have created a new paper, we explore similar ideas, focusing specifically on the test dataset, called Coyote1 , with photographs that domain of computer vision for (semi-)autonomous vehicles. can be understood correctly by humans but might Our Contributions: not be successfully parsed by current state-of-the- 1. We have compiled and annotated a dataset from pub- art image recognition systems. The dataset has 894 licly available images of real-world photographs that are photographs with over 1700 ground-truth labels, easily understood by humans but might not be parsed grouped into 6 broad categories. successfully by computer vision systems. We have tested the dataset against existing state- 2. We have used this dataset to evaluate the performance of-the-art object detection (YOLOv3 & Faster R- of current state-of-the-art CNN-based computer vision CNN) and semantic segmentation (DeepLabv3) systems, to identify challenging scenarios that can lead models to measure the models’ performance and to erroneous performance in self-driving applications. identify situations that might be a source of risk to transportation safety. Our results demonstrate 3. We have analysed the affects of these scenarios on the that these models can be confused for various ad- performance of autonomous vehicles. versarial examples resulting in lower performance 4. We have considered the key risks associated with these than expected: YOLOv3 achieves an accuracy of challenging scenarios, and proposed some mitigations. 49% and precision of 62%, while Faster R-CNN As we will note, improvements to computer vision mod- achieves an accuracy of 52% and precision of 60%. els, or using them in combination with other sensor sys- ∗ Contact Author Copyright © 2021 for this paper by its authors. Use permitted 1 https://github.com/Suruchidgupta/ under Creative Commons License Attribution 4.0 International (CC UniversalAdversarialChallenges-AutonomousVehicles BY 4.0) tems, can reduce the risk but may not remove the risk entirely. 2 Related Work There are multiple computer vision datasets for autonomous vehicle applications. Cameras from an autonomous driving Figure 1: Natural Adversarial Examples from [Hendrycks et al., platform were used to acquire 13k images for the KITTI 2019] dataset [Geiger et al., 2013], where scenarios include road, city, residential, campus, etc. KITTI is often used for eval- 2018]. Xie et al. [2020] have used adversarial examples as a uation only, due to its limited size [Janai et al., 2017]. The sample space while training the model to prevent over-fitting Cityscapes dataset [Cordts et al., 2016] contains pixel-level and improve the overall performance of the model. Madry semantic labelling for 25k images related to urban scenarios et al. [2018] have laid out optimization techniques to handle from 50 cities. It has more detailed annotations than KITTI “the first-order adversary” and building adversary robustness but does not cover as many scenarios. The ApolloScape into models for accurate classification results. In our work, dataset [Huang et al., 2020] provides 140k labelled images we are not concerned with advarsiarial attacks through image of street views for lane detection, car detection, semantic seg- modification, but with problems arising from “edge cases” mentation, etc., and is intended to enable performance eval- that might not be well covered in training sets but that will uation across different times of day and weather conditions occur in the real world. [Janai et al., 2017]. The WoodScape dataset for autonomous Of more direct relevance to our work, there are a few ad- cars [Yogamani et al., 2019] provides 10k images from 4 versarial datasets for self-driving applications. For instance, dedicated fisheye cameras with semantic annotations for 40 WildDash [Zendel et al., 2018] is a test dataset containing classes. 1800 frames addressing the natural risks in images like dis- Since images in our dataset include a wide variety of ob- tortion, overexposure, windscreen, etc. The dataset consid- jects that might not be associated with vehicles, we use the ers the road conditions from diverse geographic locations, Microsoft COCO dataset (Common Objects in Context) [Lin weather and lighting conditions to reduce the bias in train- et al., 2014] as the basis for our trained classifiers and for ing. Similarly, the Adverse Conditions Dataset with Corre- comparative evaluation. COCO contains 328k images with spondences (ACDC) for semantic driving scene understand- 80 labels of commonly available objects in their surround- ing [Sakaridis et al., 2021] studies the effects of four condi- ings, that could be recognised by a young child. tions: fog, nighttime, rain, and snow on semantic segmen- With the increasing use of deep neural network (DNN) tation using a set of 4006 images. The dataset includes a models for image processing, there have been multiple anal- normal-condition image for each adverse-condition image to yses of how these models can be attacked. Experiments have identify the challenges with changing weather and lighting shown that small but carefully chosen perturbations in data conditions and it aims to be used in conjunction with the ex- can significantly decrease the performance of models. These isting datasets to improve the model performance under the perturbations, known as adversarial examples, can either be aforementioned conditions. Additionally, the FishyScapes naturally-occurring unseen scenarios [Hendrycks et al., 2019] dataset [Blum et al., 2019] tries to place anomalous objects or user-constructed [Nguyen et al., 2015] [Szegedy et al., in front of the vehicle and evaluates them on various state- 2014] to induce mistakes. Szegedy et al. [2014] used a pre- of-the-art semantic segmentation models. It uses the images trained network and derived perturbations specific to an im- from the CityScapes [Cordts et al., 2016] dataset, overlays age by making small adjustments to specific pixels that are the objects at random distance and sizes to study the model not noticed by the human eye but result in the images be- performance in presence of anomalous objects. ing misclassified by the network. Conversely, Nguyen et al. While those datasets include high-quality pixel-level se- [2015] generated random images that do not appear recognis- mantic segmentations, they do not cover the diverse range of able to the human eye but are classified as objects with high scenarios covered in our Coyote dataset. We hope that the confidence by a DNN. Coyote dataset can form a basis for testing computer vision In [Hendrycks et al., 2019], a set of real-world images for autonomous vehicles in edge-case scenarios. Moreover, were collected that contain object classes on which DNNs the image collection can be extended so that a larger version are trained, but are challenging for DNNs to classify because could be used to train computer vision systems that yield bet- of features such as texture, shape and background. They also ter performance and resilience to edge-cases and adversarial added images with unseen classes and found that the DNN attacks, thereby improving automotive safety. model made high-confidence incorrect classifications, rather than having low confidence in recognising unseen classes, 3 Overview of the Coyote Dataset raising further concerns about the reliability of current DNN Our Coyote dataset consists of 894 photographs with over models for handling unseen examples. Some images from 1700 ground-truth labels, grouped into 6 broad categories, as [Hendrycks et al., 2019] are shown in Fig. 1. briefly outlined in Fig. 2 and described in the following sub- Other work has focused on the use of adversarial exam- sections. We have named this dataset after the cartoon char- ples to improve upon existing state-of-the-art models by har- acter Wile E. Coyote, who sometimes used realistic murals nessing these examples to build new DNN models that are to mislead the Road Runner. We have chosen photographs resistant to adversarial attacks [Xie et al., 2020; Madry et al., that we consider to be easily understood correctly by humans, but not necessarily parsed correctly by current state-of-the-art shows a car disguised as a cat. Interestingly, Houston hosts image recognition systems. an Art Car Parade every year to showcase unique car designs; more examples can be found on their website. 3.1 Collection Methodology Vehicles with Textures: Some companies and individuals Initially, we collected a sample image set that might po- use texture as a medium to advertise their brand or decorate tentially influence the performance of autonomous vehicles their vehicles. Vehicles are either covered with a specific pat- and configured the state-of-the-art object detection models. tern such as grass, cow patches, tiger prints, etc. or by small We evaluated these images on state-of-the-art object detec- assorted patterns to create a unique effect on the automobile tion models and employed an iterative approach and use the body (Fig. 3(e)). Alternatively, some vehicles can also have a outcomes to refine the collection process iteratively. As we scene painted on their body, such as an brand image, a graphic collected images, we organised them into categories: Street art book image, or a movie scene. Signs; Vehicle Art and Textures; Art in Surroundings and Custom Built Vehicles: Some vehicles are uniquely de- Murals; Parking Spaces; On-road Scenarios; and Advanced signed as ‘positional goods’ to have distinguishing features, Scenarios. The images contain either the front or side view of such as custom prints, solar panels, dual engines, etc. As the objects. In almost all cases, the images are un-edited, but shown in the Fig. 3(f), some of these vehicles are hybrids of we cropped 3 images in the dataset to reduce the background different automobiles; for instance, the car that looks similar noise in them. The images collected are of different sizes and to a helicopter might hinder object recognition. aspect ratios. General Data Protection Regulation Considerations: 3.4 On-Road Scenarios All images selected for inclusion in the dataset are publicly Computer vision systems for autonomous vehicles are trained available, free for distribution and are labeled for reuse un- on datasets relevant to the road, which contain road objects der the Creative Commons license. We avoided many other and scenarios to help the model identify on-road components. images because of copyright restrictions. However, scenarios across the world are so diverse that it is challenging to ensure that all possible scenarios are included 3.2 Art in Surroundings and Murals in training datasets. The images in this category are unusual Visual art dates back to ancient civilisations and was used but realistic on-road scenarios that might be challenging for as an effective way of communication without using words. autonomous vehicle object recognition. Streets and their surroundings, across the world, witness this Animals on the Street: As humans expand our habit- form of art; some consider it to be a means of communica- able land, there are multiple places where encounters with tion whereas others consider it vandalism. Either way, an au- animals on roads is not unusual. Hence, images in this cat- tonomous vehicle must distinguish works of art from reality. egory show different species of animal, wild and domestic, Hence, this category aims to identify art that exists near roads wandering across the streets in rural and urban scenarios (e.g. and that might deteriorate self-driving applications’ perfor- Fig. 3(g)). Unlike other categories such as murals and those mance. involving other vehicles, the behaviour of animals is difficult Re-creation of a road scenario: Some art painted on to predict and hence can be difficult for the autonomous ve- walls depicts streets with components like cars, traffic lights, hicle to handle. cycles, pedestrians, etc. For example, Fig. 3(a) is a mural Billboards along the Road: Billboards are commonly based on the Beatles’ Abbey Road album, which might be seen along roads. Although they should not interfere with interpreted as a roadway rather than a wall. an autonomous vehicle’s operation, their presence can poten- False identification of risks in surroundings: Some mu- tially confuse it. The image in Fig. 3(h) shows an election rals contain elements that may be misidentified as a source of poster. If an autonomous vehicle mistakes a poster for a real threat for the occupants, e.g. pictures of accidents, wild ani- human, it might apply emergency brakes, leading to erratic mals, natural calamities, etc. The mural in Fig. 3(b)) might driving behaviour. be misinterpreted as a crashed car. Challenging Driving Scenarios: When employed in the Art representing road objects: Sculptures and other art- physical world, autonomous vehicles are subjected to differ- works may depict objects typically found on the road, such as ent lighting conditions and varying weather conditions (fog, cars, trucks, motorbike, etc. There is a risk that artworks such rain, snow, etc.) throughout the year. Hence, they must be as the one in Fig. 3(c)) can be confused with an actual car. aware of different weather conditions and their resulting im- 3.3 Vehicle Art and Textures pact on the surroundings. The image in Fig. 3(i) is an ex- ample of the change in the weather condition. For instance, if Most vehicles conform to standard makes and colours. How- the autonomous vehicle does not correctly identify the objects ever, some have unusual artwork or textures, either for artistic in low visibility or in varying weather conditions, it might reasons or for commercial branding. This category contains not modify its behaviour accordingly. This category also in- images of vehicles that are unusual and as such may be chal- cludes images of regional variations of vehicles (tricycles for lenging or autonomous vehicle object recognition. public transport, cargo bicycles for delivery, etc.), challenging Vehicles disguised as other objects: This category in- roads scenarios (such as mountains, valleys, etc.), and images cludes images of vehicles camouflaged as different objects of extreme situations such as accidents, to evaluate how au- such as a shoe, telephone, animal (cat, peacock, dragon), or adorned with flowers, skulls, and other designs. Fig. 3(d) https://www.thehoustonartcarparade.com/ Figure 2: Broad categories of images in Coyote dataset Figure 3: (a) Sample murals re-creating road scenarios; (b) Paintings on road depicting safety threat; (c) Sculptures that resemble road objects such as cars and trucks; (d) Motor vehicles camouflaged as other objects; (e) Textures to decorate vehicles; (f) Custom built motor vehicle; (g) Photos of animals on roads; (h) Road-side signs with pictures of humans; (i) Challenging road scenarios for AVs; (j) Street signs modified by artists; (k) Street signs in Arabic; (l) Custom traffic signals on the road; (m) Images showing presence of unseen objects in the parking space; (n) Non-standard signs and warnings; (o) 3-dimensional illusions that may challenge semantic segmentation; (p) Additional art that might act as adversarial examples; (q) Examples of animal crossing signs; (r) Natural events tonomous vehicles handle such scenarios. Other datasets such the regional language (e.g. Fig. 3(k)). While autonomous as the Yamaha-CMU Off-Road dataset (YCOR) [Maturana et vehicles sold in a region would be configured to handle these al., 2017] and the PASCAL-VOC dataset also includes some regional variations, they could cause problems for vehicles extreme weather scenarios but does not include other scenar- that travel between regions or that are imported by the owner ios presented in the Coyote dataset. into an unfamiliar region. Custom Variations in Traffic Signals: While curating 3.5 Street Signs images relating to street signs, we encountered some custom Street signs have an important function in guiding and pro- traffic signals. Instead of standard circular lights, they may viding instructions to road users, but street signs that are mis- have custom figures in red and green colours. Signals such as understood by autonomous vehicle would have potentially Fig. 3(l) are easily understood by humans but deviate from an adverse effect. This category includes regional variations what autonomous vehicles may have been trained on. of street signs across the world and modifications made by This category also contains images of a large mosaic art- artists to street signs. work created from discarded street signs. Identifying any of Art on Street Signs: Some artists have modified the exist- the street signs in the art might lead to an unexpected out- ing road sign elements to create interesting variations. These come by the vehicle. The COCO dataset only identifies the variations generally do not affect humans’ ability to recognise stop sign; a more comprehensive study with a domain-centred them but may be more challenging for autonomous vehicles. dataset can provide insights into the effects of these street For example, if the modified speed bump sign or stop sign signs on autonomous vehicles. in Fig. 3(j) is ignored, the autonomous vehicle may fail to 3.6 Parking Spaces reduce its speed. This category covers parking spaces and their environments. Regional Variations of Street Signs: While street signs It includes examples of animals or objects in the parking are generally standardised within a region, there are many space and non-standard environments such as rural settings variations across regions. Most regions have street signs in without the standard parking boxes. Unforeseen Objects in Parking Spaces: The images and umbrella, followed by others with fewer than ten occur- in this category include miscellaneous objects in the parking rences. We have released the images and our ground truth spaces, such as animals, shopping carts, etc. (e.g. Fig. 3(m)). labels for research purposes. Images have unique file names. The presence of unidentified objects like shopping carts (left) An accompanying spreadsheet provides manually-annotated and animals (right) in the parking spaces might lead to mis- ground truth, comprising the file name and a count of objects judgment by the autonomous vehicle. of each class in the image, using the same set of classes as Unconventional Parking Signs and Warnings: There used in the COCO dataset. The database also contains an ap- are cases where authorities display warnings/notices or cus- pendix with a list of links to the sources of all images. tomised parking/no-parking signs that are not easily inter- preted. The sample image on the left in Fig. 3(n) shows one 4 Experiments of such unconventional parking notice, while the image on the right shows a warning sign regarding the icy car paths ahead, 4.1 Experimental Methodology that would require significant natural language processing to As discussed in Section 1, this project employs state-of-the- interpret. art object detection and semantic segmentation models to test the collected road scenarios. The models used are pre-trained 3.7 Semantic Segmentation on the COCO dataset to identify 80 common object classes in While collecting photos in the category of Art in surrounding the surroundings and are not altered during this experiment. and Murals, we found images that create an illusion of 3D The experiments are conducted on a MacBook Pro running space. To examine how an autonomous vehicle might parse macOS Catalina version 10.15.6. these scenarios, we have evaluated these images on a seman- After collecting the initial sample set, we configured the tic segmentation model. The images depict multiple scenar- state-of-the-art models to run inference. The threshold for ios, for instance, Fig. 3(o) shows a painting of a hole with the Object Detection models is set as 70%. Subsequently, water and the presence of wild animals, etc. Such images we labeled all the object classes in the images manually us- could result in the vehicle failing to proceed and blocking the ing the output classes in the MS-COCO dataset to generate road if access to the road is damaged and there is no way to the ground truth for the data. Finally, we used the generated go forward. ground truth data for the implementation of the evaluation metrics and summarised the results to infer the overall out- 3.8 Advanced Scenarios come of the project. The COCO dataset covers only 80 commonly available object To compare the results of the Coyote dataset with bench- classes, and not all relevant scenarios can be covered by these mark datasets, we have used the MS-COCO 2017 Validation 80 classes. Hence, the Advanced Scenarios contain images of set and a random subset of 1715 images from the KITTI objects or scenarios that may be unrecognisable for a model dataset testing set. The images from these datasets are trained on COCO or a similar dataset. tested in the same setup as the Coyote dataset. The KITTI Additional Artistic Creations: Section 3.2 described dataset has different class labels from those of MS-COCO, so scenarios where art can potentially mislead autonomous ve- we mapped them to the closest matching categories in MS- hicles. This category contains art images with objects that are COCO (e.g. Pedestrian maps to Person) and evaluated them not in the COCO dataset classes, which may be even harder with the same set of metrics, to enable valid comparisons. to handle. The image in Fig. 3(p) shows a painting of a road; Metrics: To evaluate the performance of the models in line this might lead a vehicle to incorrectly drive ahead, compro- with what is done in other work such as [Hung et al., 2020; mising occupants’ safety. Benjdira et al., 2019], we used the following metrics: True Animals Crossing: Different forms of animal crossing Positives (TP), False Positives (FP), False Negatives (FN), signs are used to ensure all intended users’ safety. It may Accuracy (Acc), precision (Prec), recall (Rec), and F1-score be challenging for autonomous vehicles to identify all such (F1). As usual, True Negatives are not counted since it is not signs and also understand details like the distance over which useful to note that an object class does not exist in an im- the sign applies. This category includes a variety of animal age and that object class was not detected; since MS-COCO crossing signs from across the world, e.g. Fig. 3(q). has 80 classes, we would have a huge number of TNs for ev- Natural Calamities: With the ever-changing weather and ery image. For simplicity, the ground truth labels used in the environmental conditions, natural calamities are a risk that Coyote dataset identify the number of occurrence for each of cannot be eliminated. These incidents often severely damage the output classes and the present version does not include the transportation infrastructure around us. This category in- bounding boxes. Hence, we cannot compute mAP or IoU for cludes examples of natural calamities. Fig. 3(r) shows road the Coyote dataset. damage that occurred as a result of a flood (left) and an earth- quake (right). 4.2 Results with YOLOv3 & Faster R-CNN 3.9 Summary of Dataset YOLOv3: YOLOv3 internally uses a darknet architecture The curated dataset contains a total of 894 images across trained on MS-COCO. It provides a robust single-shot ap- six categories. The number of images in each category is proach to object detection and is known to be at par with other given in Table 1. The highest number of objects in the Coy- https://github.com/Suruchidgupta/ ote dataset are (in descending order) person, car, truck, bicy- UniversalAdversarialChallenges-AutonomousVehicles cle, motorbike, traffic light, bus, stop sign, train, bird, cow, https://pjreddie.com/darknet/yolo/ Table 1: Summary of type and number of images in the dataset. Category Name Advanced Scenarios Art-in-surrounding and Murals On-road Scenario Parking spaces Street Signs Vehicle Art and Textures Total Number of Images 99 226 210 41 99 219 894 Table 2: Cumulative results for Faster R-CNN (FRCNN) model vs. YOLOV3 on Coyote, KITTI, and MS-COCO datasets. The overall result show a significant drop in precision for the Coyote dataset. Red color shows lowest performance among the three datasets in the column Category Name Accuracy Precision Recall F1-score True Positive False Positive False Negative FRCNN YOLOv3 FRCNN YOLOv3 FRCNN YOLOv3 FRCNN YOLOv3 FRCNN YOLOv3 FRCNN YOLOv3 FRCNN YOLOv3 On-road Scenario 0.64 0.61 0.77 0.88 0.79 0.67 0.78 0.76 550 467 161 61 149 232 Art in surroundings & Murals 0.15 0.24 0.15 0.25 0.86 0.78 0.26 0.38 84 76 460 223 14 22 Street Signs 0.56 0.5 0.63 0.8 0.84 0.58 0.72 0.67 74 51 43 13 14 37 Parking spaces 0.78 0.7 0.82 0.91 0.94 0.75 0.88 0.82 108 86 24 8 7 29 Vehicle Art and Textures 0.59 0.61 0.72 0.9 0.76 0.65 0.74 0.75 542 460 212 49 167 249 Coyote Total 0.52 0.49 0.60 0.62 0.79 0.71 0.68 0.66 1358 1211 900 739 351 498 — — — — — — — — — — — — — — — MS-COCO 2017 Val Set 0.53 0.46 0.91 0.79 0.56 0.53 0.69 0.63 11211 10668 1069 2915 8977 9520 KITTI Testing Subset 0.76 0.63 0.86 0.81 0.87 0.74 0.86 0.77 3778 3213 604 735 564 129 models such as Faster R-CNN. Fig. 4 shows some cases from the Coyote dataset where YOLOv3 performs correctly. In the top-left and bottom-right images, the model distinguishes be- tween the billboard/mural and the person with confidences of 99%. It identifies the car decorated with fruit (top-right) and the altered stop sign (bottom-left) with confidence values of 87% and 95%. Figure 5: YOLOv3: some unsuccessful cases. (Top-left) Sculpture mistaken for a person on bicycle; (Top-right) bike parking confused for bicycle; (centre-left) stop sign not detected; (centre-right) mural with a person on a bicycle classified as real; (bottom-left) undetected Figure 4: YOLOv3: some successful examples. (Top-left) Man car with grass texture; (bottom-right) mural containing cart and peo- standing beside a billboard; (top-right) Car decorated with fruit; ple identified as real. (bottom-left) Art on a stop sign; (bottom-right) Woman sitting in 1024x1024 resolution. Faster R-CNN identifies more details front of a truck mural. in the images than the YOLOv3 and hence, provides good However, Fig. 5 shows other examples where YOLOv3 results for many images in the dataset. fails. In the top-left image, a person-like sculpture is mounted Fig. 6 top-left shows the successful identification of a bicy- on an old bicycle, and the model identifies it as a person with cle mounted on the front of a bus. The top-right and bottom- 95% confidence. In the top-right image, parking spaces for right images show the instances where YOLOv3 fails but bikes are bike-shaped, causing false detection of a bicycle Faster R-CNN successfully identifies the stop sign and the with 93% confidence. The stop sign in the centre-left and grass-textured car. The bottom-left image shows that a mural the car in the bottom-left are not detected because of the art with a cyclist does not confuse the model. Some unsuccess- on the sign and the grass texture on the car. Additionally, the ful results for the Faster R-CNN model are shown in Fig. 7. murals in the centre-right and bottom-right images are incor- The top-left and centre-left images show that Faster R-CNN rectly identified as real objects. identified the objects from billboards and murals as real with The overall performance for YOLOv3 across all categories high confidence (98% and 99%, respectively). The top-right is provided below in Table 2. The high FP value in the Art image shows that the decorated car is not missed while others in Surroundings and Murals category indicates that multiple are successfully classified. The model misclassifies the car on objects in the art are identified as real objects. Conversely, in the centre-right as a cake with 97% confidence and the shop- other categories, the high number of FNs indicates that there ping cart as a bicycle with 91% confidence. The bottom-right are objects present that the model cannot find. YOLOv3’s image shows a sculpture made from discarded street signs, overall accuracy on the Coyote dataset is low at 49%, with but the model identifies some as actual stop signs and mis- precision=62%, recall=71%, and F1-score=66%. The statis- classifies the arrows as a parking meter with 90% confidence. tics in the table indicate that the images are challenging for As shown in Table 2, the overall performance of Faster R- YOLOv3. Faster R-CNN: The Faster R-CNN model trained CNN is not strong. Like YOLOv3, in the Art in Surroundings on MS-COCO is built on the ResNet101 architecture with and Murals category has a high FP value, implying that the Figure 6: Faster R-CNN: sample successful results. (Top-left) Bus with bicycle mounted in front; (top-right) modified stop sign; (bottom-left) mural with a cyclist; (bottom-right) car with grass tex- ture. model is confused by the painting and murals. Faster R-CNN has a high FP level for the Vehicle Art and Texture category also. In the other categories, there are also significant FN levels, though they are not quite as high as YOLOv3. The number of TPs for Faster R-CNN is significantly higher than that of YOLOv3, leading to fewer FNs. However, Faster R-CNN has a very high number of FPs, which reduces Figure 7: Faster R-CNN: sample unsuccessful results. (Top-left) the model’s overall performance. Hence, its accuracy is very Photo in poster identified as real person; (top-right) car not detected; low at 52%, precision=60%, recall=79%, and F1-score=68%. (centre-left) painting of motorcycle identified as real; (centre-right) On the whole, the Coyote dataset affects the performance of decorated car identified as cake; (bottom-left) Shopping cart identi- fied as bicycle; (bottom-right) art built using discarded street signs. both Faster R-CNN and YOLOv3 models. Comparison with KITTI and MS-COCO: We per- formed further experiments to evaluate Faster R-CNN and YOLOv3 on the MS-COCO 2017 validation set and a sub- set of the KITTI dataset. The objective was to determine whether the Coyote dataset is indeed more challenging than existing benchmark datasets. The results in Table 2 show that the precision of the DNN models on the Coyote dataset drops by 31% and 26% with Faster-RCNN and 17% and 8% with YOLOv3, relative to MS-COCO and KITTI, respec- Figure 8: DeepLabv3 semantic segmentation samples. (Top) Illu- tively. In addition, the F1-scores are lower on the Coyote sion of ice blocks on the road and a person fishing; (bottom) people dataset. Hence, a vehicle that uses such models would be at beside an illusion of a road and cyclist tearing through a wall. risk of accidents when it faces real-world edge-case examples as exemplified by the Coyote dataset. standing beside a painting of a bicycle and road with a tear- ing illusion. The model classifies the cyclist and the bicycle 4.3 Semantic Segmentation with DeepLabv3 as real, along with the real people standing beside them. Semantic segmentation is an extension of the proposed On analysing the results for all 19 images, we observe that project. DeepLabv3 [Chen et al., 2017] is pre-trained on the the reason why the images in Fig. 8(top) and are correctly MS-COCO dataset and uses the colour map from the PAS- segmented can be explained by the limited number of output CAL VOC dataset with 21 output labels (background + 20 classes in the PASCAL VOC dataset, which do not include PASCAL VOC labels). The DeepLabv3 model used for this ice, for example. The most common failures in semantic seg- project operates on the MobileNetv2 architecture with a depth mentation are in classifying art as real objects; this can make multiplier of 0.5 [Sandler et al., 2018]. We apply it to a small the autonomous vehicles susceptible to failure in presence of collection of 19 images and evaluate the results manually. 3-D illusions. The output of semantic segmentation for some images show correct results, where the model is not affected by the 5 Conclusion presence of confusing art around them. For example, Fig. 8(top) shows art creating an illusion of ice blocks and a per- 5.1 Analysis of Results son fishing. DeepLabV3 correctly segments the people in the In this paper, we have presented a new publicly available image is not affected by the art. In Fig. 8(bottom), people are test dataset, called Coyote, of real-world photographs that are easily understood by humans but might not be successfully Common Sense Reasoning: Work is being done on parsed by computer vision systems, along with manual anno- simulation-based reasoning systems that aim to synthesize tations of objects in all of the photographs. The photographs understanding of a domain. It is conceivable that such sys- are grouped into six broad categories: (1) art in surrounding tems could be extended to recognise, for example, that a pho- and murals; (2) vehicle art and textures; (3) On-road scenar- tograph of the head of a person is not actually a person. ios; (4) street sign; (5) parking spaces; and (6) advanced sce- Better Treatment of Scale: Depth and scale are important narios. factors for humans in distinguishing real items from artificial We have used the Coyote dataset to evaluate the perfor- ones; e.g. a car decorated to look like a cat. Current CNNs mance of current state-of-the-art CNN-based computer vision rely to a large extent on textures and patterns and are designed systems, to identify challenging scenarios that can lead to er- to be scale-invariant. roneous performance in self-driving applications. We have Spatio-Temporal Reasoning: While a single image of a found that the paintings in Art in surrounding and Murals mural may look realistic to a human, when viewed over time category confuse the models the most. Both YOLOv3 and from slightly different viewpoints, it rapidly becomes clear Faster R-CNN models perform worse on this category than that it is 2D. This requires Spatio-temporal reasoning that is any other, with a high number of FPs, showing that the mod- beyond the capacity of current systems. els identify the paintings’ components as real objects. In the 5.3 Future Work case of Street Signs category, embellishments to street signs There is substantial scope for building on the work presented and regional variations of road signs degrade the performance here. Most obviously, the Coyote dataset can be extended by of both of the models. The decorated vehicles in the Vehi- adding more images. For example, [Mufson, 2017] presents cle Art and Textures category show contrasting failure modes art created by Filip Piekniewski containing scenarios that can for Faster R-CNN and YOLOv3 models. While the YOLOv3 confuse autonomous vehicles. In addition, the images could model cannot identify many of the objects in the images (high be annotated with bounding boxes and pixel-level semantic FNs), Faster R-CNN incorrectly identifies a large number of annotations. Such fine-grained ground truth will be used in objects that are not present (high FPs). localization of objects detection which is critical information The unseen road scenarios presented in the On-Road sce- for autonomous vehicles. narios category yields somewhat better performance for both Further work will be done on increasing the number of im- of the models. However, both models have a high number of ages by increasing the challenging scenarios; e.g. occluded FNs, indicating that the models are often unable to identify objects such as a person occluded under an umbrella, the field the objects in these new scenarios. The Parking Spaces cate- of views from each object, and by capturing it from a camera gory, which contains the smallest number of examples, shows within a vehicle itself. Additionally, other forms of data(such the best performance for the models among all of the cate- as LiDAR, multiple Cameras, FishEye, GPS, temporal con- gories. It is possible that the models are not misled due to the sistency between frames, etc.) can be brought in context to COCO dataset having a limited set of class labels. Nonethe- analyse how much the additional information can support the less, this category highlights some interesting adversarial sce- autonomous vehicles in such challenging scenarios as being narios for self-driving applications. In Advanced Scenarios done in e.g. nuScenes dataset [Caesar et al., 2020]. category with 3-D art on roads, DeepLab can correctly seg- ment humans and the background but it fails in some cases when 3-D representations of objects are painted on walls. References However, these scenarios could be assessed further using a [Benjdira et al., 2019] B. Benjdira, T. Khursheed, more comprehensive model trained specifically for road sce- A. Koubaa, A. Ammar, and K. Ouni. Car detection narios. using unmanned aerial vehicles: Comparison between Analysing the aggregate results for both models shows faster r-cnn and yolov3. In 2019 1st International Confer- that the Faster R-CNN model captures more details than the ence on Unmanned Vehicle Systems-Oman (UVS), pages YOLOv3 model. This helps the Faster R-CNN model make 1–6, 2019. better inferences; however, it also makes its model more vul- [Blum et al., 2019] Hermann Blum, Paul-Edouard Sarlin, nerable to failure from edge-cases. Juan Nieto, Roland Siegwart, and Cesar Cadena. 5.2 Risks and Mitigation Fishyscapes: A benchmark for safe semantic segmentation The key risk associated with the kinds of errors we identify in autonomous driving. In Proceedings of the IEEE/CVF in this work is that a vehicle may react inappropriately, for International Conference on Computer Vision Workshops, example braking sharply to avoid hitting a “person” that is pages 0–0, 2019. actually an image on the back of a van or preparing to drive [Caesar et al., 2020] Holger Caesar, Varun Bankiti, Alex H ahead when the street ahead is actually a mural. Some strate- gies for risk mitigation are summarised below. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Sensor Fusion: For example, data fused from camera and Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. lidar systems might indicate that an image of a person on the nuscenes: A multimodal dataset for autonomous driving. back of a van is flat and therefore cannot be real, thereby re- In Proceedings of the IEEE/CVF conference on computer ducing false positives. However, if the vision system detects vision and pattern recognition, pages 11621–11631, 2020. a person ahead with high confidence and lidar does not, must [Chen et al., 2017] Liang-Chieh Chen, George Papandreou, the autonomous vehicle act conservatively to preserve life? Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. CoRR, conference on computer vision and pattern recognition, abs/1706.05587, 2017. pages 427–436, 2015. [Cordts et al., 2016] Marius Cordts, Mohamed Omran, Se- [Papert, 1966] Seymour A Papert. The summer vision bastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo project. 1966. Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. [Russakovsky et al., 2015] Olga Russakovsky, Jia Deng, The cityscapes dataset for semantic urban scene under- Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, standing. pages 3213–3223, 2016. Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael [Geiger et al., 2013] Andreas Geiger, Philip Lenz, Christoph Bernstein, et al. Imagenet large scale visual recogni- Stiller, and Raquel Urtasun. Vision meets robotics: The tion challenge. International journal of computer vision, KITTI dataset. Int. J. Robotics Res., 32(11):1231–1237, 115(3):211–252, 2015. 2013. [Sakaridis et al., 2021] Christos Sakaridis, Dengxin Dai, and [Hendrycks et al., 2019] Dan Hendrycks, Kevin Zhao, Luc Van Gool. Acdc: The adverse conditions dataset with Steven Basart, Jacob Steinhardt, and Dawn Song. Natural correspondences for semantic driving scene understand- adversarial examples. CoRR, abs/1907.07174, 2019. ing. arXiv preprint arXiv:2104.13395, 2021. [Huang et al., 2020] Xinyu Huang, Peng Wang, Xinjing [Sandler et al., 2018] Mark Sandler, Andrew Howard, Men- Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang. glong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. The apolloscape open dataset for autonomous driving and Mobilenetv2: Inverted residuals and linear bottlenecks. In its application. IEEE Trans. Pattern Anal. Mach. Intell., Proceedings of the IEEE conference on computer vision 42(10):2702–2719, 2020. and pattern recognition, pages 4510–4520, 2018. [Hung et al., 2020] Goon Li Hung, Mohamad Safwan Bin [Szegedy et al., 2014] Christian Szegedy, Wojciech Sahimi, Hussein Samma, Tarik Adnan Almohamad, and Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Er- Badr Lahasan. Faster R-CNN Deep Learning Model for han, Ian J. Goodfellow, and Rob Fergus. Intriguing Pedestrian Detection from Drone Images. SN Computer properties of neural networks. In Yoshua Bengio and Science, 1(2):1–9, 2020. Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR, Banff, AB, Canada, [Janai et al., 2017] Joel Janai, Fatma Güney, Aseem Behl, April 14-16,, 2014. and Andreas Geiger. Computer Vision for Autonomous [Szeliski, 2011] Richard Szeliski. Computer Vision - Al- Vehicles: Problems, Datasets and State of the Art. arXiv e-prints, page arXiv:1704.05519, April 2017. gorithms and Applications. Texts in Computer Science. Springer, 2011. [Krizhevsky et al., 2017] Alex Krizhevsky, Ilya Sutskever, [Xie et al., 2020] Cihang Xie, Mingxing Tan, Boqing Gong, and Geoffrey E Hinton. Imagenet classification with deep Jiang Wang, Alan L Yuille, and Quoc V Le. Adversar- convolutional neural networks. Communications of the ial examples improve image recognition. In IEEE/CVF ACM, 60(6):84–90, 2017. Conference on Computer Vision and Pattern Recognition, [Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Be- pages 819–828, 2020. longie, James Hays, Pietro Perona, Deva Ramanan, Piotr [Yogamani et al., 2019] Senthil Yogamani, Ciarán Hughes, Dollár, and C Lawrence Zitnick. Microsoft coco: Com- Jonathan Horgan, Ganesh Sistu, Padraig Varley, Derek mon objects in context. In European conference on com- O’Dea, Michal Uricár, Stefan Milz, Martin Simon, Karl puter vision, pages 740–755. Springer, 2014. Amende, et al. Woodscape: A multi-task, multi-camera [Madry et al., 2018] Aleksander Madry, Aleksandar fisheye dataset for autonomous driving. In IEEE Interna- Makelov, Ludwig Schmidt, Dimitris Tsipras, and tional Conference on Computer Vision, pages 9308–9318, Adrian Vladu. Towards deep learning models resistant to 2019. adversarial attacks. In 6th International Conference on [Zendel et al., 2018] Oliver Zendel, Katrin Honauer, Markus Learning Representations, ICLR, Canada, April 30 - May Murschitz, Daniel Steininger, and Gustavo Fernandez 3, 2018. Dominguez. Wilddash-creating hazard-aware bench- [Maturana et al., 2017] Daniel Maturana, Po-Wei Chou, marks. In Proceedings of the European Conference on Masashi Uenoyama, and Sebastian A. Scherer. Real-time Computer Vision (ECCV), pages 402–416, 2018. semantic mapping for autonomous off-road navigation. In Field and Service Robotics, 11th International Confer- ence, FSR, Zurich, Switzerland, 12-15 September, vol- ume 5, pages 335–350. Springer, 2017. [Mufson, 2017] Beckett Mufson. An engineer is painting surreal scenarios that confuse self-driving cars, 2017. [Nguyen et al., 2015] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High con- fidence predictions for unrecognizable images. In IEEE