=Paper=
{{Paper
|id=Vol-2916/paper_7
|storemode=property
|title=Coyote: A Dataset of Challenging Scenarios in Visual Perception for Autonomous Vehicles
|pdfUrl=https://ceur-ws.org/Vol-2916/paper_7.pdf
|volume=Vol-2916
|authors=Suruchi Gupta,Ihsan Ullah,Michael Madden
|dblpUrl=https://dblp.org/rec/conf/ijcai/GuptaUM21
}}
==Coyote: A Dataset of Challenging Scenarios in Visual Perception for Autonomous Vehicles==
<pdf width="1500px">https://ceur-ws.org/Vol-2916/paper_7.pdf</pdf>
<pre>
Coyote: A Dataset of Challenging Scenarios in Visual Perception for Autonomous
                                   Vehicles
                            Suruchi Gupta1 , Ihsan Ullah2 , Michael G. Madden1∗
             1
               School of Computer Science, National University of Ireland Galway, Galway, Ireland
             2
               CeADAR Ireland’s Center for Applied AI, University College Dublin, Dublin, Ireland
                         {s.gupta9, michael.madden}@nuigalway.ie, ihsan.ullah@ucd.ie

                           Abstract                           1     Introduction
       Recent advances in Artificial Intelligence have im-     An Autonomous Vehicle (AV) perceives its environment us-
       mense potential for the realization of self-driving    ing sensors such as radar, sonar, GPS, and cameras, and uses
       applications. In particular, deep neural networks      an advanced control system to identify an appropriate naviga-
       are being applied to object detection and seman-       tion path [Janai et al., 2017]. For this, AV architectures make
       tic segmentation, to support the operation of semi-    use of the field of computer vision to interpret and understand
       autonomous vehicles. While full Level 5 auton-         their visual inputs. Attempts to provide computers with an
       omy is not yet available, elements of these tech-      understanding of visual components around them dates back
       nologies are being brought to market in advanced       to the 1960s [Papert, 1966]. Before the emergence of Convo-
       driver assistance systems that provide partial au-     lutional Neural Networks (CNNs) [Krizhevsky et al., 2017],
       tomation at Level 2 and 3. However, multiple stud-     traditional algorithms were used to extract edges and identify
       ies have demonstrated that current state-of-the-art    shapes. These extracted structural features were then used to
       deep learning models can make high-confidence          identify elements of an image [Szeliski, 2011].
       but incorrect predictions. In the context of a crit-      Although researchers have reported that the performance
       ical application such as understanding the scene in    of modern computer vision system approaches human-level
       front of a vehicle, which must be robust, accurate     performance [Russakovsky et al., 2015], other research stud-
       and in real-time, such failures raise concerns; most   ies have conversely demonstrated that images with small per-
       significantly, they may pose a substantial threat to   turbations or minor features that should be irrelevant can ad-
       the safety of the vehicle’s occupants and other peo-   versely affect performance [Hendrycks et al., 2019; Nguyen
       ple with whom the vehicle shares the road.             et al., 2015; Szegedy et al., 2014]. Such images, known as
       To examine the challenges of current computer vi-      adversarial examples, can occur naturally [Hendrycks et al.,
       sion approaches in the context of autonomous and       2019] or be user-constructed [Nguyen et al., 2015]. In this
       semi-autonomous vehicles, we have created a new        paper, we explore similar ideas, focusing specifically on the
       test dataset, called Coyote1 , with photographs that   domain of computer vision for (semi-)autonomous vehicles.
       can be understood correctly by humans but might           Our Contributions:
       not be successfully parsed by current state-of-the-        1. We have compiled and annotated a dataset from pub-
       art image recognition systems. The dataset has 894            licly available images of real-world photographs that are
       photographs with over 1700 ground-truth labels,               easily understood by humans but might not be parsed
       grouped into 6 broad categories.                              successfully by computer vision systems.
       We have tested the dataset against existing state-         2. We have used this dataset to evaluate the performance
       of-the-art object detection (YOLOv3 & Faster R-               of current state-of-the-art CNN-based computer vision
       CNN) and semantic segmentation (DeepLabv3)                    systems, to identify challenging scenarios that can lead
       models to measure the models’ performance and                 to erroneous performance in self-driving applications.
       identify situations that might be a source of risk
       to transportation safety. Our results demonstrate          3. We have analysed the affects of these scenarios on the
       that these models can be confused for various ad-             performance of autonomous vehicles.
       versarial examples resulting in lower performance          4. We have considered the key risks associated with these
       than expected: YOLOv3 achieves an accuracy of                 challenging scenarios, and proposed some mitigations.
       49% and precision of 62%, while Faster R-CNN                  As we will note, improvements to computer vision mod-
       achieves an accuracy of 52% and precision of 60%.             els, or using them in combination with other sensor sys-
  ∗
    Contact Author                                                Copyright © 2021 for this paper by its authors. Use permitted
   1
    https://github.com/Suruchidgupta/                         under Creative Commons License Attribution 4.0 International (CC
UniversalAdversarialChallenges-AutonomousVehicles             BY 4.0)
     tems, can reduce the risk but may not remove the risk
     entirely.

2   Related Work
There are multiple computer vision datasets for autonomous
vehicle applications. Cameras from an autonomous driving            Figure 1: Natural Adversarial Examples from [Hendrycks et al.,
platform were used to acquire 13k images for the KITTI              2019]
dataset [Geiger et al., 2013], where scenarios include road,
city, residential, campus, etc. KITTI is often used for eval-       2018]. Xie et al. [2020] have used adversarial examples as a
uation only, due to its limited size [Janai et al., 2017]. The      sample space while training the model to prevent over-fitting
Cityscapes dataset [Cordts et al., 2016] contains pixel-level       and improve the overall performance of the model. Madry
semantic labelling for 25k images related to urban scenarios        et al. [2018] have laid out optimization techniques to handle
from 50 cities. It has more detailed annotations than KITTI         “the first-order adversary” and building adversary robustness
but does not cover as many scenarios. The ApolloScape               into models for accurate classification results. In our work,
dataset [Huang et al., 2020] provides 140k labelled images          we are not concerned with advarsiarial attacks through image
of street views for lane detection, car detection, semantic seg-    modification, but with problems arising from “edge cases”
mentation, etc., and is intended to enable performance eval-        that might not be well covered in training sets but that will
uation across different times of day and weather conditions         occur in the real world.
[Janai et al., 2017]. The WoodScape dataset for autonomous             Of more direct relevance to our work, there are a few ad-
cars [Yogamani et al., 2019] provides 10k images from 4             versarial datasets for self-driving applications. For instance,
dedicated fisheye cameras with semantic annotations for 40          WildDash [Zendel et al., 2018] is a test dataset containing
classes.                                                            1800 frames addressing the natural risks in images like dis-
   Since images in our dataset include a wide variety of ob-        tortion, overexposure, windscreen, etc. The dataset consid-
jects that might not be associated with vehicles, we use the        ers the road conditions from diverse geographic locations,
Microsoft COCO dataset (Common Objects in Context) [Lin             weather and lighting conditions to reduce the bias in train-
et al., 2014] as the basis for our trained classifiers and for      ing. Similarly, the Adverse Conditions Dataset with Corre-
comparative evaluation. COCO contains 328k images with              spondences (ACDC) for semantic driving scene understand-
80 labels of commonly available objects in their surround-          ing [Sakaridis et al., 2021] studies the effects of four condi-
ings, that could be recognised by a young child.                    tions: fog, nighttime, rain, and snow on semantic segmen-
   With the increasing use of deep neural network (DNN)             tation using a set of 4006 images. The dataset includes a
models for image processing, there have been multiple anal-         normal-condition image for each adverse-condition image to
yses of how these models can be attacked. Experiments have          identify the challenges with changing weather and lighting
shown that small but carefully chosen perturbations in data         conditions and it aims to be used in conjunction with the ex-
can significantly decrease the performance of models. These         isting datasets to improve the model performance under the
perturbations, known as adversarial examples, can either be         aforementioned conditions. Additionally, the FishyScapes
naturally-occurring unseen scenarios [Hendrycks et al., 2019]       dataset [Blum et al., 2019] tries to place anomalous objects
or user-constructed [Nguyen et al., 2015] [Szegedy et al.,          in front of the vehicle and evaluates them on various state-
2014] to induce mistakes. Szegedy et al. [2014] used a pre-         of-the-art semantic segmentation models. It uses the images
trained network and derived perturbations specific to an im-        from the CityScapes [Cordts et al., 2016] dataset, overlays
age by making small adjustments to specific pixels that are         the objects at random distance and sizes to study the model
not noticed by the human eye but result in the images be-           performance in presence of anomalous objects.
ing misclassified by the network. Conversely, Nguyen et al.            While those datasets include high-quality pixel-level se-
[2015] generated random images that do not appear recognis-         mantic segmentations, they do not cover the diverse range of
able to the human eye but are classified as objects with high       scenarios covered in our Coyote dataset. We hope that the
confidence by a DNN.                                                Coyote dataset can form a basis for testing computer vision
   In [Hendrycks et al., 2019], a set of real-world images          for autonomous vehicles in edge-case scenarios. Moreover,
were collected that contain object classes on which DNNs            the image collection can be extended so that a larger version
are trained, but are challenging for DNNs to classify because       could be used to train computer vision systems that yield bet-
of features such as texture, shape and background. They also        ter performance and resilience to edge-cases and adversarial
added images with unseen classes and found that the DNN             attacks, thereby improving automotive safety.
model made high-confidence incorrect classifications, rather
than having low confidence in recognising unseen classes,           3   Overview of the Coyote Dataset
raising further concerns about the reliability of current DNN       Our Coyote dataset consists of 894 photographs with over
models for handling unseen examples. Some images from               1700 ground-truth labels, grouped into 6 broad categories, as
[Hendrycks et al., 2019] are shown in Fig. 1.                       briefly outlined in Fig. 2 and described in the following sub-
   Other work has focused on the use of adversarial exam-           sections. We have named this dataset after the cartoon char-
ples to improve upon existing state-of-the-art models by har-       acter Wile E. Coyote, who sometimes used realistic murals
nessing these examples to build new DNN models that are             to mislead the Road Runner. We have chosen photographs
resistant to adversarial attacks [Xie et al., 2020; Madry et al.,   that we consider to be easily understood correctly by humans,
but not necessarily parsed correctly by current state-of-the-art   shows a car disguised as a cat. Interestingly, Houston hosts
image recognition systems.                                         an Art Car Parade every year to showcase unique car designs;
                                                                   more examples can be found on their website.
3.1   Collection Methodology                                          Vehicles with Textures: Some companies and individuals
Initially, we collected a sample image set that might po-          use texture as a medium to advertise their brand or decorate
tentially influence the performance of autonomous vehicles         their vehicles. Vehicles are either covered with a specific pat-
and configured the state-of-the-art object detection models.       tern such as grass, cow patches, tiger prints, etc. or by small
We evaluated these images on state-of-the-art object detec-        assorted patterns to create a unique effect on the automobile
tion models and employed an iterative approach and use the         body (Fig. 3(e)). Alternatively, some vehicles can also have a
outcomes to refine the collection process iteratively. As we       scene painted on their body, such as an brand image, a graphic
collected images, we organised them into categories: Street        art book image, or a movie scene.
Signs; Vehicle Art and Textures; Art in Surroundings and              Custom Built Vehicles: Some vehicles are uniquely de-
Murals; Parking Spaces; On-road Scenarios; and Advanced            signed as ‘positional goods’ to have distinguishing features,
Scenarios. The images contain either the front or side view of     such as custom prints, solar panels, dual engines, etc. As
the objects. In almost all cases, the images are un-edited, but    shown in the Fig. 3(f), some of these vehicles are hybrids of
we cropped 3 images in the dataset to reduce the background        different automobiles; for instance, the car that looks similar
noise in them. The images collected are of different sizes and     to a helicopter might hinder object recognition.
aspect ratios.
   General Data Protection Regulation Considerations:              3.4     On-Road Scenarios
All images selected for inclusion in the dataset are publicly      Computer vision systems for autonomous vehicles are trained
available, free for distribution and are labeled for reuse un-     on datasets relevant to the road, which contain road objects
der the Creative Commons license. We avoided many other            and scenarios to help the model identify on-road components.
images because of copyright restrictions.                          However, scenarios across the world are so diverse that it is
                                                                   challenging to ensure that all possible scenarios are included
3.2   Art in Surroundings and Murals
                                                                   in training datasets. The images in this category are unusual
Visual art dates back to ancient civilisations and was used        but realistic on-road scenarios that might be challenging for
as an effective way of communication without using words.          autonomous vehicle object recognition.
Streets and their surroundings, across the world, witness this        Animals on the Street: As humans expand our habit-
form of art; some consider it to be a means of communica-          able land, there are multiple places where encounters with
tion whereas others consider it vandalism. Either way, an au-      animals on roads is not unusual. Hence, images in this cat-
tonomous vehicle must distinguish works of art from reality.       egory show different species of animal, wild and domestic,
Hence, this category aims to identify art that exists near roads   wandering across the streets in rural and urban scenarios (e.g.
and that might deteriorate self-driving applications’ perfor-      Fig. 3(g)). Unlike other categories such as murals and those
mance.                                                             involving other vehicles, the behaviour of animals is difficult
   Re-creation of a road scenario: Some art painted on             to predict and hence can be difficult for the autonomous ve-
walls depicts streets with components like cars, traffic lights,   hicle to handle.
cycles, pedestrians, etc. For example, Fig. 3(a) is a mural           Billboards along the Road: Billboards are commonly
based on the Beatles’ Abbey Road album, which might be             seen along roads. Although they should not interfere with
interpreted as a roadway rather than a wall.                       an autonomous vehicle’s operation, their presence can poten-
   False identification of risks in surroundings: Some mu-         tially confuse it. The image in Fig. 3(h) shows an election
rals contain elements that may be misidentified as a source of     poster. If an autonomous vehicle mistakes a poster for a real
threat for the occupants, e.g. pictures of accidents, wild ani-    human, it might apply emergency brakes, leading to erratic
mals, natural calamities, etc. The mural in Fig. 3(b)) might       driving behaviour.
be misinterpreted as a crashed car.                                   Challenging Driving Scenarios: When employed in the
   Art representing road objects: Sculptures and other art-        physical world, autonomous vehicles are subjected to differ-
works may depict objects typically found on the road, such as      ent lighting conditions and varying weather conditions (fog,
cars, trucks, motorbike, etc. There is a risk that artworks such   rain, snow, etc.) throughout the year. Hence, they must be
as the one in Fig. 3(c)) can be confused with an actual car.       aware of different weather conditions and their resulting im-
3.3   Vehicle Art and Textures                                     pact on the surroundings. The image in Fig. 3(i) is an ex-
                                                                   ample of the change in the weather condition. For instance, if
Most vehicles conform to standard makes and colours. How-          the autonomous vehicle does not correctly identify the objects
ever, some have unusual artwork or textures, either for artistic   in low visibility or in varying weather conditions, it might
reasons or for commercial branding. This category contains         not modify its behaviour accordingly. This category also in-
images of vehicles that are unusual and as such may be chal-       cludes images of regional variations of vehicles (tricycles for
lenging or autonomous vehicle object recognition.                  public transport, cargo bicycles for delivery, etc.), challenging
   Vehicles disguised as other objects: This category in-          roads scenarios (such as mountains, valleys, etc.), and images
cludes images of vehicles camouflaged as different objects         of extreme situations such as accidents, to evaluate how au-
such as a shoe, telephone, animal (cat, peacock, dragon), or
adorned with flowers, skulls, and other designs. Fig. 3(d)               https://www.thehoustonartcarparade.com/
                                           Figure 2: Broad categories of images in Coyote dataset


Figure 3: (a) Sample murals re-creating road scenarios; (b) Paintings on road depicting safety threat; (c) Sculptures that resemble road objects
such as cars and trucks; (d) Motor vehicles camouflaged as other objects; (e) Textures to decorate vehicles; (f) Custom built motor vehicle;
(g) Photos of animals on roads; (h) Road-side signs with pictures of humans; (i) Challenging road scenarios for AVs; (j) Street signs modified
by artists; (k) Street signs in Arabic; (l) Custom traffic signals on the road; (m) Images showing presence of unseen objects in the parking
space; (n) Non-standard signs and warnings; (o) 3-dimensional illusions that may challenge semantic segmentation; (p) Additional art that
might act as adversarial examples; (q) Examples of animal crossing signs; (r) Natural events
tonomous vehicles handle such scenarios. Other datasets such              the regional language (e.g. Fig. 3(k)). While autonomous
as the Yamaha-CMU Off-Road dataset (YCOR) [Maturana et                    vehicles sold in a region would be configured to handle these
al., 2017] and the PASCAL-VOC dataset also includes some                  regional variations, they could cause problems for vehicles
extreme weather scenarios but does not include other scenar-              that travel between regions or that are imported by the owner
ios presented in the Coyote dataset.                                      into an unfamiliar region.
                                                                             Custom Variations in Traffic Signals: While curating
3.5    Street Signs                                                       images relating to street signs, we encountered some custom
Street signs have an important function in guiding and pro-               traffic signals. Instead of standard circular lights, they may
viding instructions to road users, but street signs that are mis-         have custom figures in red and green colours. Signals such as
understood by autonomous vehicle would have potentially                   Fig. 3(l) are easily understood by humans but deviate from
an adverse effect. This category includes regional variations             what autonomous vehicles may have been trained on.
of street signs across the world and modifications made by                   This category also contains images of a large mosaic art-
artists to street signs.                                                  work created from discarded street signs. Identifying any of
   Art on Street Signs: Some artists have modified the exist-             the street signs in the art might lead to an unexpected out-
ing road sign elements to create interesting variations. These            come by the vehicle. The COCO dataset only identifies the
variations generally do not affect humans’ ability to recognise           stop sign; a more comprehensive study with a domain-centred
them but may be more challenging for autonomous vehicles.                 dataset can provide insights into the effects of these street
For example, if the modified speed bump sign or stop sign                 signs on autonomous vehicles.
in Fig. 3(j) is ignored, the autonomous vehicle may fail to               3.6    Parking Spaces
reduce its speed.                                                         This category covers parking spaces and their environments.
   Regional Variations of Street Signs: While street signs                It includes examples of animals or objects in the parking
are generally standardised within a region, there are many                space and non-standard environments such as rural settings
variations across regions. Most regions have street signs in              without the standard parking boxes.
   Unforeseen Objects in Parking Spaces: The images                and umbrella, followed by others with fewer than ten occur-
in this category include miscellaneous objects in the parking      rences. We have released the images and our ground truth
spaces, such as animals, shopping carts, etc. (e.g. Fig. 3(m)).    labels for research purposes. Images have unique file names.
The presence of unidentified objects like shopping carts (left)    An accompanying spreadsheet provides manually-annotated
and animals (right) in the parking spaces might lead to mis-       ground truth, comprising the file name and a count of objects
judgment by the autonomous vehicle.                                of each class in the image, using the same set of classes as
   Unconventional Parking Signs and Warnings: There                used in the COCO dataset. The database also contains an ap-
are cases where authorities display warnings/notices or cus-       pendix with a list of links to the sources of all images.
tomised parking/no-parking signs that are not easily inter-
preted. The sample image on the left in Fig. 3(n) shows one        4     Experiments
of such unconventional parking notice, while the image on the
right shows a warning sign regarding the icy car paths ahead,
                                                                   4.1    Experimental Methodology
that would require significant natural language processing to      As discussed in Section 1, this project employs state-of-the-
interpret.                                                         art object detection and semantic segmentation models to test
                                                                   the collected road scenarios. The models used are pre-trained
3.7   Semantic Segmentation
                                                                   on the COCO dataset to identify 80 common object classes in
While collecting photos in the category of Art in surrounding      the surroundings and are not altered during this experiment.
and Murals, we found images that create an illusion of 3D          The experiments are conducted on a MacBook Pro running
space. To examine how an autonomous vehicle might parse            macOS Catalina version 10.15.6.
these scenarios, we have evaluated these images on a seman-           After collecting the initial sample set, we configured the
tic segmentation model. The images depict multiple scenar-         state-of-the-art models to run inference. The threshold for
ios, for instance, Fig. 3(o) shows a painting of a hole with       the Object Detection models is set as 70%. Subsequently,
water and the presence of wild animals, etc. Such images           we labeled all the object classes in the images manually us-
could result in the vehicle failing to proceed and blocking the    ing the output classes in the MS-COCO dataset to generate
road if access to the road is damaged and there is no way to       the ground truth for the data. Finally, we used the generated
go forward.                                                        ground truth data for the implementation of the evaluation
                                                                   metrics and summarised the results to infer the overall out-
3.8   Advanced Scenarios                                           come of the project.
The COCO dataset covers only 80 commonly available object             To compare the results of the Coyote dataset with bench-
classes, and not all relevant scenarios can be covered by these    mark datasets, we have used the MS-COCO 2017 Validation
80 classes. Hence, the Advanced Scenarios contain images of        set and a random subset of 1715 images from the KITTI
objects or scenarios that may be unrecognisable for a model        dataset testing set. The images from these datasets are
trained on COCO or a similar dataset.                              tested in the same setup as the Coyote dataset. The KITTI
   Additional Artistic Creations: Section 3.2 described            dataset has different class labels from those of MS-COCO, so
scenarios where art can potentially mislead autonomous ve-         we mapped them to the closest matching categories in MS-
hicles. This category contains art images with objects that are    COCO (e.g. Pedestrian maps to Person) and evaluated them
not in the COCO dataset classes, which may be even harder          with the same set of metrics, to enable valid comparisons.
to handle. The image in Fig. 3(p) shows a painting of a road;         Metrics: To evaluate the performance of the models in line
this might lead a vehicle to incorrectly drive ahead, compro-      with what is done in other work such as [Hung et al., 2020;
mising occupants’ safety.                                          Benjdira et al., 2019], we used the following metrics: True
   Animals Crossing: Different forms of animal crossing            Positives (TP), False Positives (FP), False Negatives (FN),
signs are used to ensure all intended users’ safety. It may        Accuracy (Acc), precision (Prec), recall (Rec), and F1-score
be challenging for autonomous vehicles to identify all such        (F1). As usual, True Negatives are not counted since it is not
signs and also understand details like the distance over which     useful to note that an object class does not exist in an im-
the sign applies. This category includes a variety of animal       age and that object class was not detected; since MS-COCO
crossing signs from across the world, e.g. Fig. 3(q).              has 80 classes, we would have a huge number of TNs for ev-
   Natural Calamities: With the ever-changing weather and          ery image. For simplicity, the ground truth labels used in the
environmental conditions, natural calamities are a risk that       Coyote dataset identify the number of occurrence for each of
cannot be eliminated. These incidents often severely damage        the output classes and the present version does not include
the transportation infrastructure around us. This category in-     bounding boxes. Hence, we cannot compute mAP or IoU for
cludes examples of natural calamities. Fig. 3(r) shows road        the Coyote dataset.
damage that occurred as a result of a flood (left) and an earth-
quake (right).                                                     4.2    Results with YOLOv3 & Faster R-CNN
3.9   Summary of Dataset                                           YOLOv3: YOLOv3 internally uses a darknet architecture
The curated dataset contains a total of 894 images across          trained on MS-COCO. It provides a robust single-shot ap-
six categories. The number of images in each category is           proach to object detection and is known to be at par with other
given in Table 1. The highest number of objects in the Coy-            https://github.com/Suruchidgupta/
ote dataset are (in descending order) person, car, truck, bicy-    UniversalAdversarialChallenges-AutonomousVehicles
cle, motorbike, traffic light, bus, stop sign, train, bird, cow,       https://pjreddie.com/darknet/yolo/
                                            Table 1: Summary of type and number of images in the dataset.

 Category Name         Advanced Scenarios   Art-in-surrounding and Murals   On-road Scenario   Parking spaces   Street Signs   Vehicle Art and Textures   Total
 Number of Images             99                         226                      210                41              99                  219               894

Table 2: Cumulative results for Faster R-CNN (FRCNN) model vs. YOLOV3 on Coyote, KITTI, and MS-COCO datasets. The overall result
  show a significant drop in precision for the Coyote dataset. Red color shows lowest performance among the three datasets in the column

 Category Name                       Accuracy         Precision        Recall             F1-score        True Positive       False Positive      False Negative
                                FRCNN YOLOv3     FRCNN YOLOv3     FRCNN YOLOv3       FRCNN YOLOv3      FRCNN YOLOv3        FRCNN YOLOv3        FRCNN YOLOv3
 On-road Scenario               0.64      0.61   0.77      0.88   0.79    0.67       0.78      0.76    550       467       161       61        149        232
 Art in surroundings & Murals   0.15      0.24   0.15      0.25   0.86    0.78       0.26      0.38    84        76        460       223       14         22
 Street Signs                   0.56      0.5    0.63      0.8    0.84    0.58       0.72      0.67    74        51        43        13        14         37
 Parking spaces                 0.78      0.7    0.82      0.91   0.94    0.75       0.88      0.82    108       86        24        8         7          29
 Vehicle Art and Textures       0.59      0.61   0.72      0.9    0.76    0.65       0.74      0.75    542       460       212       49        167        249
 Coyote Total                   0.52      0.49   0.60      0.62   0.79    0.71       0.68      0.66    1358      1211      900       739       351        498
 —                              —         —      —         —      —       —          —         —       —         —         —         —         —          —
 MS-COCO 2017 Val Set           0.53      0.46   0.91      0.79   0.56    0.53       0.69      0.63    11211     10668     1069      2915      8977       9520
 KITTI Testing Subset           0.76      0.63   0.86      0.81   0.87    0.74       0.86      0.77    3778      3213      604       735       564        129


models such as Faster R-CNN. Fig. 4 shows some cases from
the Coyote dataset where YOLOv3 performs correctly. In the
top-left and bottom-right images, the model distinguishes be-
tween the billboard/mural and the person with confidences of
99%. It identifies the car decorated with fruit (top-right) and
the altered stop sign (bottom-left) with confidence values of
87% and 95%.


                                                                                  Figure 5: YOLOv3: some unsuccessful cases. (Top-left) Sculpture
                                                                                  mistaken for a person on bicycle; (Top-right) bike parking confused
                                                                                  for bicycle; (centre-left) stop sign not detected; (centre-right) mural
                                                                                  with a person on a bicycle classified as real; (bottom-left) undetected
Figure 4: YOLOv3: some successful examples. (Top-left) Man                        car with grass texture; (bottom-right) mural containing cart and peo-
standing beside a billboard; (top-right) Car decorated with fruit;                ple identified as real.
(bottom-left) Art on a stop sign; (bottom-right) Woman sitting in                 1024x1024 resolution. Faster R-CNN identifies more details
front of a truck mural.                                                           in the images than the YOLOv3 and hence, provides good
   However, Fig. 5 shows other examples where YOLOv3                              results for many images in the dataset.
fails. In the top-left image, a person-like sculpture is mounted                     Fig. 6 top-left shows the successful identification of a bicy-
on an old bicycle, and the model identifies it as a person with                   cle mounted on the front of a bus. The top-right and bottom-
95% confidence. In the top-right image, parking spaces for                        right images show the instances where YOLOv3 fails but
bikes are bike-shaped, causing false detection of a bicycle                       Faster R-CNN successfully identifies the stop sign and the
with 93% confidence. The stop sign in the centre-left and                         grass-textured car. The bottom-left image shows that a mural
the car in the bottom-left are not detected because of the art                    with a cyclist does not confuse the model. Some unsuccess-
on the sign and the grass texture on the car. Additionally, the                   ful results for the Faster R-CNN model are shown in Fig. 7.
murals in the centre-right and bottom-right images are incor-                     The top-left and centre-left images show that Faster R-CNN
rectly identified as real objects.                                                identified the objects from billboards and murals as real with
   The overall performance for YOLOv3 across all categories                       high confidence (98% and 99%, respectively). The top-right
is provided below in Table 2. The high FP value in the Art                        image shows that the decorated car is not missed while others
in Surroundings and Murals category indicates that multiple                       are successfully classified. The model misclassifies the car on
objects in the art are identified as real objects. Conversely, in                 the centre-right as a cake with 97% confidence and the shop-
other categories, the high number of FNs indicates that there                     ping cart as a bicycle with 91% confidence. The bottom-right
are objects present that the model cannot find. YOLOv3’s                          image shows a sculpture made from discarded street signs,
overall accuracy on the Coyote dataset is low at 49%, with                        but the model identifies some as actual stop signs and mis-
precision=62%, recall=71%, and F1-score=66%. The statis-                          classifies the arrows as a parking meter with 90% confidence.
tics in the table indicate that the images are challenging for                       As shown in Table 2, the overall performance of Faster R-
YOLOv3. Faster R-CNN: The Faster R-CNN model trained                              CNN is not strong. Like YOLOv3, in the Art in Surroundings
on MS-COCO is built on the ResNet101 architecture with                            and Murals category has a high FP value, implying that the
Figure 6: Faster R-CNN: sample successful results. (Top-left)
Bus with bicycle mounted in front; (top-right) modified stop sign;
(bottom-left) mural with a cyclist; (bottom-right) car with grass tex-
ture.

model is confused by the painting and murals. Faster R-CNN
has a high FP level for the Vehicle Art and Texture category
also. In the other categories, there are also significant FN
levels, though they are not quite as high as YOLOv3.
   The number of TPs for Faster R-CNN is significantly
higher than that of YOLOv3, leading to fewer FNs. However,
Faster R-CNN has a very high number of FPs, which reduces                Figure 7: Faster R-CNN: sample unsuccessful results. (Top-left)
the model’s overall performance. Hence, its accuracy is very             Photo in poster identified as real person; (top-right) car not detected;
low at 52%, precision=60%, recall=79%, and F1-score=68%.                 (centre-left) painting of motorcycle identified as real; (centre-right)
On the whole, the Coyote dataset affects the performance of              decorated car identified as cake; (bottom-left) Shopping cart identi-
                                                                         fied as bicycle; (bottom-right) art built using discarded street signs.
both Faster R-CNN and YOLOv3 models.
   Comparison with KITTI and MS-COCO: We per-
formed further experiments to evaluate Faster R-CNN and
YOLOv3 on the MS-COCO 2017 validation set and a sub-
set of the KITTI dataset. The objective was to determine
whether the Coyote dataset is indeed more challenging than
existing benchmark datasets. The results in Table 2 show
that the precision of the DNN models on the Coyote dataset
drops by 31% and 26% with Faster-RCNN and 17% and 8%
with YOLOv3, relative to MS-COCO and KITTI, respec-
                                                                         Figure 8: DeepLabv3 semantic segmentation samples. (Top) Illu-
tively. In addition, the F1-scores are lower on the Coyote               sion of ice blocks on the road and a person fishing; (bottom) people
dataset. Hence, a vehicle that uses such models would be at              beside an illusion of a road and cyclist tearing through a wall.
risk of accidents when it faces real-world edge-case examples
as exemplified by the Coyote dataset.                                    standing beside a painting of a bicycle and road with a tear-
                                                                         ing illusion. The model classifies the cyclist and the bicycle
4.3    Semantic Segmentation with DeepLabv3                              as real, along with the real people standing beside them.
Semantic segmentation is an extension of the proposed                       On analysing the results for all 19 images, we observe that
project. DeepLabv3 [Chen et al., 2017] is pre-trained on the             the reason why the images in Fig. 8(top) and are correctly
MS-COCO dataset and uses the colour map from the PAS-                    segmented can be explained by the limited number of output
CAL VOC dataset with 21 output labels (background + 20                   classes in the PASCAL VOC dataset, which do not include
PASCAL VOC labels). The DeepLabv3 model used for this                    ice, for example. The most common failures in semantic seg-
project operates on the MobileNetv2 architecture with a depth            mentation are in classifying art as real objects; this can make
multiplier of 0.5 [Sandler et al., 2018]. We apply it to a small         the autonomous vehicles susceptible to failure in presence of
collection of 19 images and evaluate the results manually.               3-D illusions.
   The output of semantic segmentation for some images
show correct results, where the model is not affected by the             5     Conclusion
presence of confusing art around them. For example, Fig.
8(top) shows art creating an illusion of ice blocks and a per-           5.1    Analysis of Results
son fishing. DeepLabV3 correctly segments the people in the              In this paper, we have presented a new publicly available
image is not affected by the art. In Fig. 8(bottom), people are          test dataset, called Coyote, of real-world photographs that are
easily understood by humans but might not be successfully             Common Sense Reasoning: Work is being done on
parsed by computer vision systems, along with manual anno-         simulation-based reasoning systems that aim to synthesize
tations of objects in all of the photographs. The photographs      understanding of a domain. It is conceivable that such sys-
are grouped into six broad categories: (1) art in surrounding      tems could be extended to recognise, for example, that a pho-
and murals; (2) vehicle art and textures; (3) On-road scenar-      tograph of the head of a person is not actually a person.
ios; (4) street sign; (5) parking spaces; and (6) advanced sce-       Better Treatment of Scale: Depth and scale are important
narios.                                                            factors for humans in distinguishing real items from artificial
   We have used the Coyote dataset to evaluate the perfor-         ones; e.g. a car decorated to look like a cat. Current CNNs
mance of current state-of-the-art CNN-based computer vision        rely to a large extent on textures and patterns and are designed
systems, to identify challenging scenarios that can lead to er-    to be scale-invariant.
roneous performance in self-driving applications. We have             Spatio-Temporal Reasoning: While a single image of a
found that the paintings in Art in surrounding and Murals          mural may look realistic to a human, when viewed over time
category confuse the models the most. Both YOLOv3 and              from slightly different viewpoints, it rapidly becomes clear
Faster R-CNN models perform worse on this category than            that it is 2D. This requires Spatio-temporal reasoning that is
any other, with a high number of FPs, showing that the mod-        beyond the capacity of current systems.
els identify the paintings’ components as real objects. In the     5.3   Future Work
case of Street Signs category, embellishments to street signs      There is substantial scope for building on the work presented
and regional variations of road signs degrade the performance      here. Most obviously, the Coyote dataset can be extended by
of both of the models. The decorated vehicles in the Vehi-         adding more images. For example, [Mufson, 2017] presents
cle Art and Textures category show contrasting failure modes       art created by Filip Piekniewski containing scenarios that can
for Faster R-CNN and YOLOv3 models. While the YOLOv3               confuse autonomous vehicles. In addition, the images could
model cannot identify many of the objects in the images (high      be annotated with bounding boxes and pixel-level semantic
FNs), Faster R-CNN incorrectly identifies a large number of        annotations. Such fine-grained ground truth will be used in
objects that are not present (high FPs).                           localization of objects detection which is critical information
   The unseen road scenarios presented in the On-Road sce-         for autonomous vehicles.
narios category yields somewhat better performance for both           Further work will be done on increasing the number of im-
of the models. However, both models have a high number of          ages by increasing the challenging scenarios; e.g. occluded
FNs, indicating that the models are often unable to identify       objects such as a person occluded under an umbrella, the field
the objects in these new scenarios. The Parking Spaces cate-       of views from each object, and by capturing it from a camera
gory, which contains the smallest number of examples, shows        within a vehicle itself. Additionally, other forms of data(such
the best performance for the models among all of the cate-         as LiDAR, multiple Cameras, FishEye, GPS, temporal con-
gories. It is possible that the models are not misled due to the   sistency between frames, etc.) can be brought in context to
COCO dataset having a limited set of class labels. Nonethe-        analyse how much the additional information can support the
less, this category highlights some interesting adversarial sce-   autonomous vehicles in such challenging scenarios as being
narios for self-driving applications. In Advanced Scenarios        done in e.g. nuScenes dataset [Caesar et al., 2020].
category with 3-D art on roads, DeepLab can correctly seg-
ment humans and the background but it fails in some cases
when 3-D representations of objects are painted on walls.          References
However, these scenarios could be assessed further using a         [Benjdira et al., 2019] B.    Benjdira,     T.    Khursheed,
more comprehensive model trained specifically for road sce-           A. Koubaa, A. Ammar, and K. Ouni. Car detection
narios.                                                               using unmanned aerial vehicles: Comparison between
   Analysing the aggregate results for both models shows              faster r-cnn and yolov3. In 2019 1st International Confer-
that the Faster R-CNN model captures more details than the            ence on Unmanned Vehicle Systems-Oman (UVS), pages
YOLOv3 model. This helps the Faster R-CNN model make                  1–6, 2019.
better inferences; however, it also makes its model more vul-      [Blum et al., 2019] Hermann Blum, Paul-Edouard Sarlin,
nerable to failure from edge-cases.                                   Juan Nieto, Roland Siegwart, and Cesar Cadena.
5.2   Risks and Mitigation                                            Fishyscapes: A benchmark for safe semantic segmentation
The key risk associated with the kinds of errors we identify          in autonomous driving. In Proceedings of the IEEE/CVF
in this work is that a vehicle may react inappropriately, for         International Conference on Computer Vision Workshops,
example braking sharply to avoid hitting a “person” that is           pages 0–0, 2019.
actually an image on the back of a van or preparing to drive
                                                                   [Caesar et al., 2020] Holger Caesar, Varun Bankiti, Alex H
ahead when the street ahead is actually a mural. Some strate-
gies for risk mitigation are summarised below.                        Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush
   Sensor Fusion: For example, data fused from camera and             Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom.
lidar systems might indicate that an image of a person on the         nuscenes: A multimodal dataset for autonomous driving.
back of a van is flat and therefore cannot be real, thereby re-       In Proceedings of the IEEE/CVF conference on computer
ducing false positives. However, if the vision system detects         vision and pattern recognition, pages 11621–11631, 2020.
a person ahead with high confidence and lidar does not, must       [Chen et al., 2017] Liang-Chieh Chen, George Papandreou,
the autonomous vehicle act conservatively to preserve life?           Florian Schroff, and Hartwig Adam. Rethinking atrous
   convolution for semantic image segmentation. CoRR,              conference on computer vision and pattern recognition,
   abs/1706.05587, 2017.                                           pages 427–436, 2015.
[Cordts et al., 2016] Marius Cordts, Mohamed Omran, Se-         [Papert, 1966] Seymour A Papert. The summer vision
   bastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo          project. 1966.
   Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.        [Russakovsky et al., 2015] Olga Russakovsky, Jia Deng,
   The cityscapes dataset for semantic urban scene under-          Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
   standing. pages 3213–3223, 2016.                                Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael
[Geiger et al., 2013] Andreas Geiger, Philip Lenz, Christoph       Bernstein, et al. Imagenet large scale visual recogni-
   Stiller, and Raquel Urtasun. Vision meets robotics: The         tion challenge. International journal of computer vision,
   KITTI dataset. Int. J. Robotics Res., 32(11):1231–1237,         115(3):211–252, 2015.
   2013.                                                        [Sakaridis et al., 2021] Christos Sakaridis, Dengxin Dai, and
[Hendrycks et al., 2019] Dan Hendrycks, Kevin Zhao,                Luc Van Gool. Acdc: The adverse conditions dataset with
   Steven Basart, Jacob Steinhardt, and Dawn Song. Natural         correspondences for semantic driving scene understand-
   adversarial examples. CoRR, abs/1907.07174, 2019.               ing. arXiv preprint arXiv:2104.13395, 2021.
[Huang et al., 2020] Xinyu Huang, Peng Wang, Xinjing            [Sandler et al., 2018] Mark Sandler, Andrew Howard, Men-
   Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang.             glong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.
   The apolloscape open dataset for autonomous driving and         Mobilenetv2: Inverted residuals and linear bottlenecks. In
   its application. IEEE Trans. Pattern Anal. Mach. Intell.,       Proceedings of the IEEE conference on computer vision
   42(10):2702–2719, 2020.                                         and pattern recognition, pages 4510–4520, 2018.
[Hung et al., 2020] Goon Li Hung, Mohamad Safwan Bin            [Szegedy et al., 2014] Christian      Szegedy,      Wojciech
   Sahimi, Hussein Samma, Tarik Adnan Almohamad, and               Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Er-
   Badr Lahasan. Faster R-CNN Deep Learning Model for              han, Ian J. Goodfellow, and Rob Fergus. Intriguing
   Pedestrian Detection from Drone Images. SN Computer             properties of neural networks. In Yoshua Bengio and
   Science, 1(2):1–9, 2020.                                        Yann LeCun, editors, 2nd International Conference on
                                                                   Learning Representations, ICLR, Banff, AB, Canada,
[Janai et al., 2017] Joel Janai, Fatma Güney, Aseem Behl,         April 14-16,, 2014.
   and Andreas Geiger. Computer Vision for Autonomous
                                                                [Szeliski, 2011] Richard Szeliski. Computer Vision - Al-
   Vehicles: Problems, Datasets and State of the Art. arXiv
   e-prints, page arXiv:1704.05519, April 2017.                    gorithms and Applications. Texts in Computer Science.
                                                                   Springer, 2011.
[Krizhevsky et al., 2017] Alex Krizhevsky, Ilya Sutskever,
                                                                [Xie et al., 2020] Cihang Xie, Mingxing Tan, Boqing Gong,
   and Geoffrey E Hinton. Imagenet classification with deep
                                                                   Jiang Wang, Alan L Yuille, and Quoc V Le. Adversar-
   convolutional neural networks. Communications of the
                                                                   ial examples improve image recognition. In IEEE/CVF
   ACM, 60(6):84–90, 2017.
                                                                   Conference on Computer Vision and Pattern Recognition,
[Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Be-          pages 819–828, 2020.
   longie, James Hays, Pietro Perona, Deva Ramanan, Piotr       [Yogamani et al., 2019] Senthil Yogamani, Ciarán Hughes,
   Dollár, and C Lawrence Zitnick. Microsoft coco: Com-           Jonathan Horgan, Ganesh Sistu, Padraig Varley, Derek
   mon objects in context. In European conference on com-          O’Dea, Michal Uricár, Stefan Milz, Martin Simon, Karl
   puter vision, pages 740–755. Springer, 2014.                    Amende, et al. Woodscape: A multi-task, multi-camera
[Madry et al., 2018] Aleksander        Madry,      Aleksandar      fisheye dataset for autonomous driving. In IEEE Interna-
   Makelov, Ludwig Schmidt, Dimitris Tsipras, and                  tional Conference on Computer Vision, pages 9308–9318,
   Adrian Vladu. Towards deep learning models resistant to         2019.
   adversarial attacks. In 6th International Conference on      [Zendel et al., 2018] Oliver Zendel, Katrin Honauer, Markus
   Learning Representations, ICLR, Canada, April 30 - May          Murschitz, Daniel Steininger, and Gustavo Fernandez
   3, 2018.                                                        Dominguez.        Wilddash-creating hazard-aware bench-
[Maturana et al., 2017] Daniel Maturana, Po-Wei Chou,              marks. In Proceedings of the European Conference on
   Masashi Uenoyama, and Sebastian A. Scherer. Real-time           Computer Vision (ECCV), pages 402–416, 2018.
   semantic mapping for autonomous off-road navigation. In
   Field and Service Robotics, 11th International Confer-
   ence, FSR, Zurich, Switzerland, 12-15 September, vol-
   ume 5, pages 335–350. Springer, 2017.
[Mufson, 2017] Beckett Mufson. An engineer is painting
   surreal scenarios that confuse self-driving cars, 2017.
[Nguyen et al., 2015] Anh Nguyen, Jason Yosinski, and Jeff
   Clune. Deep neural networks are easily fooled: High con-
   fidence predictions for unrecognizable images. In IEEE

</pre>