CSK-SNIFFER: Commonsense Knowledge for Sniffing
Object Detection Errors
Anurag Garg1 , Niket Tandon2 and Aparna S. Varde3
1
  PQRS Research, Dehradun, India
2
  Allen Institute for AI, Seattle, USA
3
  Montclair State University, Montclair, USA


                                             Abstract
                                             This paper showcases the demonstration of a system called CSK-SNIFFER to automatically predict failures of an object
                                             detection model on images in big data sets from a target domain by identifying errors based on commonsense knowledge.
                                             CSK-SNIFFER can be an assistant to a human (as sniffer dogs are assistants to police searching for problems at airports). To
                                             cut through the clutter after deployment, this “sniffer” identifies where a given model is probably wrong. Alerted thus, users
                                             can visually explore within our demo, the model’s explanation based on spatial correlations that make no sense. In other
                                             words, it is impossible for a human without the help of a sniffer to flag false positives in such large data sets without knowing
                                             ground truth (unknown earlier since it is found after deployment). CSK-SNIFFER spans human-AI collaboration. The AI role
                                             is harnessed via embedding commonsense knowledge in the system; while an important human part is played by domain
                                             experts providing labeled data for training (besides human commonsense deployed by AI). Another highly significant aspect
                                             is that the human-in-the-loop can improve the AI system by the feedback it receives from visualizing object detection errors,
                                             while the AI provides actual assistance to the human in object detection. CSK-SNIFFER exemplifies visualization in big data
                                             analytics through spatial commonsense and a visually rich demo with numerous complex images from target domains. This
                                             paper provides excerpts of the CSK-SNIFFER system demo with a synopsis of its approach and experiments.


                                                                                                                                                                 car
1. Introduction                                                                                                        Trained
                                                                                                                       object
                                                                                                                                                         Bbox2


Human-AI collaboration, the realm of humans and AI
                                                                                                                       detection
                                                                                                                       model M           bbox1                            bbox3
                                                                                                                                                                                    visualized
                                                                                                                                                                                    bbx1
                                                                                                                                                                                    bbx2
                                                                                                                                                                                                 💁       H
                                                                                                                                                                                                 collaborate
                                                                                                                                                                                    bbx3
systems working together, typically achieves better per-
formance than either one working alone [1]. Big data
                                                                                                                                         Prediction
visualization and analytics can be used to foster interac-                                                                               bbx1 bbx2 bbx3
tion [2]. Such areas receive attention, e.g. NEIL (Never
Ending Image Learner) [3], active learning approaches                                                                                                    CSK Sniffer C                              Update
                                                                                                                        Input
[4], human-in-the-loop learning [5] etc. To that end,                                                                   image x         flag spatially
we demonstrate a system “CSK-SNIFFER” exemplifying                                                                                      implausible
                                                                                                                                        object pairs                     retrieve object
human-AI collaboration via enhancing object detection                                                                                                    Inference       relevant spatial
                                                                                                                                                                                                   ㎅
by visualizing potential errors in large complex data sets,                                                                                                              knowledge
harnessing spatial commonsense. This system “sniffs” er-
rors in object detection using spatial collocation anoma-                                                             Figure 1: CSK-SNIFFER and the human-in-the-loop: A car
lies, assisting humans analogous to sniffer dogs aiding                                                               was detected in the image which was flagged bad by CSK-
police at airports. The process, (Figure 1), is as follows,                                                           SNIFFER based on its spatial knowledge w.r.t KB which the
with the CSK-SNIFFER system (𝐶), human-in-the-loop                                                                    human-in-the-loop can update after visualizing errors
(𝐻), and inference model (𝑀 ) for object detection.
   System 𝐶 interacts with human 𝐻 and provides object
detection output visualizing potential errors, over model                                                             sion and recall, based on the spatial KB. Hence, the two
𝑀 , by deploying commonsense knowledge through a                                                                      directions in this learning loop are as follows.
spatial knowledge base (𝐾𝐵). The 𝐾𝐵 is derived by
capturing spatial commonsense, especially as collocation                                                                      • 𝐻 to 𝐶: Feedback-based interactive learning
anomalies. Then 𝐻 sees the visualized errors (output by                                                                       • 𝐶 to 𝐻: Assistance in object detection
𝐶) and can thereby enhance 𝐶 by increasing its preci-
                                                                                                                      Thus, the human and the AI work together with the goal
                                                                                                                      of enhancing object detection in big data. Further, the
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint                                                     inference model 𝑀 can potentially improve, as an added
Conference (March 29-April 1, 2022), Edinburgh, UK
                                                                                                                      benefit of this adversarial learning via human-AI collab-
$ anuraggarg1209@gmail.com (A. Garg); nikett@allenai.org
(N. Tandon); vardea@montclair.edu (A. S. Varde)                                                                       oration. The obtained information can be used to supply
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      more examples to 𝑀 on the misclassified categories to
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        make it more robust. If certain labels are inappropriate
consistently across examples, it is a valuable insight. As       People crossing city streets on pedestrian crossings
                                                                 Vehicles coming to a full hault at red signals
data sets get bigger in volume and variety, such automa-
                                                                 Vehicles stopping or slowing down at stop signs
tion is even more significant in assisting object detection      Street lights dimming when occupants are few
errors.                                                          Street lights brightening when occupants are many
   The use case in our work focuses on the smart mo-             Buses running on traffic-optimal routes
bility domain [6]. It entails autonomous vehicles, self-         Service dogs helping blind people
operating traffic signals, energy-saving street lights dim-      People charging phones at WiFi stations
ming / brightening as per pedestrian usage etc. In such          People reading useful information at roadside kiosks
AI systems, it is crucial to detect objects accurately, espe-    People parking bikes at share-ride spots
cially due to issues such as safety. CSK-SNIFFER plays an        Vehicles flashing turning lights for L/R turns
important role here, generating large adversarial training       Bikes riding on bike routes only
                                                                 Traffic cops making hand signals in regular operations
data sets by sniffing object detection errors.
                                                                 Vehicles driving beneath an overpass
                                                                 Dogs on a leash walking with their owners
2. The CSK-SNIFFER Approach                                      People jogging on sidewalks
                                                                 People entering and leaving trains when doors open
We summarize the CSK-SNIFFER approach as per its                 People using prams for kids in buses
                                                                 Trees existing on sidewalks
design and execution [7]. In this approach. we represent
                                                                 Ropeways carrying passengers to tourist spots
and construct a 𝐾𝐵, along with the function 𝑓 (𝑏𝑏𝑜𝑥)             Bikers wearing smart watches
that generates a triple, i.e. < 𝑜𝑖 , 𝑣ˆ𝑖𝑗 , 𝑜𝑗 > from the        Maglev trains running between airports and cities
predicted bounding boxes of objects 𝑜𝑖 and 𝑜𝑗 such that          Grass existing on freeway sides and city streets
𝑣𝑖𝑗 is a binary vector over relations 𝑟𝑒𝑙(𝐾𝐵).                   Solar panels existing on roofs of buildings
   Gather 𝑋𝑇 using action-vocab(𝑇 ): Table 1                     People using smartphones for talking anytime anywhere
presents some examples from action-vocab(𝑇 ) where               Canal lights dimming when occupants are few
𝑇 =smart mobility domain. These entries represent                Canal lights brightening with many occupants
mostly typical and some unique scenes in this do-                People wheeling shopping carts in grocery stores
main. This list was manually compiled by a domain               Table 1
expert. Images 𝑋𝑇 are compiled using Web queries ∈              ∼10% examples from action-vocab(𝑇 ) where 𝑇 =smart
action-vocab(𝑇 ) (on an image search engine), and an            mobility domain. These are used as queries to compile the
object detector predicts bounding boxes over 𝑥𝑇 ∈ 𝑋𝑇 .          input to the object detector, and then CSK-SNIFFER can flag
   KB construction: While in principle, we can directly         images in 𝑇 where the detector failed to predict the correct
use existing 𝐾𝐵s, these have errors as elaborated in            bounding boxes.
some works [8]. CSK-SNIFFER isolates the effect of these
errors by instead manually creating a 𝐾𝐵 at a very low
cost. The 𝐾𝐵 is defined over a set of objects 𝑂 and rela-       available at https://tinyurl.com/kb-for-csksniffer
tions 𝑟𝑒𝑙(𝐾𝐵). The relation set 𝑟𝑒𝑙(𝐾𝐵) comprises                   Function f(bbox): Similar to the triples in the 𝐾𝐵,
5 relations (isAbove, isBelow, isInside, isNear,                we define a function 𝑓 (𝑏𝑏𝑜𝑥) to construct triples <
overlapsWith). We are inspired by other works in the            𝑜𝑖 , 𝑣ˆ𝑖𝑗 , 𝑜𝑗 > using the predicted bounding boxes of im-
literature such as [9] in picking these relations, and be-      age 𝑥𝑇 . The 𝑓 (𝑏𝑏𝑜𝑥) input consists of predicted bound-
cause our initial analysis proposed their suitability for       ing boxes on an image, and the output is a list of triples
bounding box relative relations. An entry in the 𝐾𝐵             in the format: < 𝑜𝑖 , 𝑣ˆ𝑖𝑗 , 𝑜𝑗 >, for every pair of objects
comprises of a pair of objects 𝑜𝑖 , 𝑜𝑗 ∈ 𝑂 and a binary         𝑜𝑖 , 𝑜𝑗 ∈ the objects detected in the image. For every such
vector 𝑣𝑖𝑗 denoting 𝑜𝑖 , 𝑜𝑗 ’s and their spatial relations      pair, 𝑓 (𝑏𝑏𝑜𝑥) compares the coordinates of the bounding
over 𝑟𝑒𝑙(𝐾𝐵). These spatial relations are manually an-          boxes of 𝑜𝑖 and 𝑜𝑗 (this is a known process e.g. [9]). We
notated by a domain expert, according to general likeli-        illustrate this for the isInside relation. Let coordinates
hood e.g., it is more likely that a dog is observed near a      of a bounding box be 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , then 𝑜𝑖 .𝑦1 denotes
human, and much less likely that it is observed near a          𝑦1 coordinate of 𝑜𝑖 . If 𝑜𝑖 .𝑦2 ≤ 𝑜𝑗 .𝑦1 and 𝑜𝑗 .𝑥2 > 𝑜𝑖 .𝑥1
whale. The 𝑘 most popular objects on MSCOCO training            and 𝑜𝑗 .𝑥1 < 𝑜𝑖 .𝑥2 then 𝑜𝑖 is inside 𝑜𝑗 . Similarly,
data make up 𝑂; in our experiments 𝑘 = 10, and this             other relations in 𝑟𝑒𝑙(𝐾𝐵) are built, compiling which
leads to 𝑘2 entries in 𝐾𝐵 that need to be annotated with        provides 𝑣ˆ𝑖𝑗 . For anomaly detection, we compare vectors
𝑣𝑖𝑗 . It is remarkable that our experiments demonstrate         𝑣ˆ𝑖𝑗 and 𝑣𝑖𝑗 for overlapping object-pairs 𝑜𝑖 , 𝑜𝑗 , detected
that even with 𝑘 = 10, the 𝐾𝐵 allows CSK-SNIFFER to             in the image and present in the 𝐾𝐵.
achieve good performance. We can infer that selecting               Based on this discussion, the following algorithm
a popular subset helps, even if it is small. An entry in        summarizes the execution of CSK-SNIFFER.
the KB is denoted as < 𝑜𝑖 , 𝑣𝑖𝑗 , 𝑜𝑗 > The 𝐾𝐵 is publicly
  Algorithm 1: CSK-SNIFFER Approach

       Input: Object detector 𝑀 trained on source do-
       main 𝑆
       Manually compiled action-vocab(𝑇 ) in target
       domain 𝑇
       Images 𝑋𝑇 compiled using Web queries ∈
       action-vocab(𝑇 )


    1. Define    𝑟𝑒𝑙(𝐾𝐵)      comprising    5   relations:
       isAbove, isBelow,
       isInside, isNear, overlapsWith
    2. Define commonsense 𝐾𝐵, each entry <
       𝑜𝑖 , 𝑣𝑖𝑗 , 𝑜𝑗 > where 𝑣𝑖𝑗 is a binary vector over
       𝑟𝑒𝑙(𝐾𝐵)
    3. Generate triples < 𝑜𝑖 , 𝑣ˆ𝑖𝑗 , 𝑜𝑗 > from predicted
                                                              Figure 2: A sample from the gathered images (𝑋𝑇 ) by search-
       bounding boxes of 𝑥𝑇 ∈ 𝑋𝑇 using function               ing the Web for "People crossing city streets on pedestrian
       𝑓 (𝑏𝑏𝑜𝑥).                                              crossings" in action-vocab(𝑇 )
    4. For each image 𝑥𝑇 ∈ 𝑋𝑇 , compare 𝑣ˆ𝑖𝑗 and 𝑣𝑖𝑗 ,
       from bounding box triples < 𝑜𝑖 , 𝑣ˆ𝑖𝑗 , 𝑜𝑗 > and
                                                                 Inferred spatial relation on pre-            Frequency
       𝐾𝐵 triples < 𝑜𝑖 , 𝑣𝑖𝑗 , 𝑜𝑗 >
                                                                 dicted bounding boxes
    5. For each 𝑥𝑇 , if 𝑣ˆ𝑖𝑗 ̸= 𝑣𝑖𝑗 then flag 𝑥𝑇 : 𝑤𝑟𝑜𝑛𝑔,
       add 𝑥𝑇 to 𝑋𝑇′                                             person, overlapsWith, person                 18
                                                                 car, is_near, car                            10
                                                                 person, overlapsWith, car                    8
       Output: Subset 𝑋𝑇′ where 𝑀 failed                         car, overlapsWith, person                    8
                                                                 car, overlapsWith, car                       8
                                                                 person, is_near, backpack                    4
3. Excerpts from System Demo                                     backpack, is_near, person                    4
                                                                 car, is_near, backpack                       4
We have built a live demo to depict the working of CSK-          backpack, is_near, car                       4
SNIFFER. This demo illustrates the functioning of CSK-           traffic light, is_near, traffic light        4
SNIFFER to enhance its actual comprehension and aug-          Table 2
ment its usage. In addition, this demo paper presents         Distribution of spatial relations in 𝑋𝑇 . Spatial relations are
the principles behind the human-in-the-loop functioning       of the form: < 𝑜𝑖 , 𝑣^𝑖𝑗 , 𝑜𝑗 >
of CSK-SNIFFER for sniffing object detection errors in
large, complex data sets, thereby being added contribu-
tions over our earlier work [7]. While this human-in-the-
                                                              predicted, the demo moves to the home page. This con-
loop functioning is explained in the introduction with
                                                              tains details on the output files generated. The “Images”
an illustration and theoretical justification, its detailed
                                                              option displays downloaded images as shown in Figure
empirical validations with respect to interactive 𝐾𝐵 up-
                                                              2 herewith.
dates constitute ongoing work, based on CSK-SNIFFER
                                                                 Output files generated by CSK-SNIFFER are illustrated
being actively deployed in real-world settings. In fact,
                                                              as follows. Table 2 shows the first output file “Colloca-
this demo paves the way for such interactive 𝐾𝐵 updates
                                                              tions Map” with triples predicted by CSK-SNIFFER in
via augmenting the usage of CSK-SNIFFER in suitable
                                                              images with their respective counts. The final output
applications to provide the human-in-the-loop feedback
                                                              file “Error Set” Table 3 contains names of images with
for the addition of such interactive 𝐾𝐵 updates.
                                                              some odd visual collocations. It also indicates the triple
   We present some screenshots illustrating the demo.
                                                              that actually got predicted versus the expectation from
Many more can be provided in a live setting. The user
                                                              the model. These files help fathom the functioning of
enters any search query related to smart mobility [6].
                                                              CSK-SNIFFER.
Images are downloaded from Google Images based on
this query. Object detection is then performed on the
images using YOLO [10] to start predicting triples in the
image using the 𝑓 (𝑏𝑏𝑜𝑥) function. Once the triples are
  Image Inferred spatial re-             Expected spatial re-
  id    lation on predicted              lation between these
        bounding boxes                   objects present in
                                         𝐾𝐵
  𝑖 1 , 𝑖3   person, overlapsWith,       person,         is_near,
             car                         is_inside, car
  𝑖 1 , 𝑖3   car, overlapsWith, per-     car, is_near, person
             son                                                    Figure 3: Images actually bad, flagged bad (In the 1st image
  𝑖 1 , 𝑖4   backpack, is_inside,        backpack,       is_near,   ”buildings” are detected as “truck” and ”TV monitor”; in the
             person                      overlapsWith, person       2nd image “buildings” are detected as ”bus” and ”truck”).
  𝑖1         backpack, is_near, car      backpack, is_inside,
                                         car
  𝑖2         traffic light, is_inside,   traffic light, is_near,
             traffic light               is_above, traffic light
  𝑖2         traffic light, overlap-     traffic light,is_near,
             sWith, traffic light        is_above, traffic light
  𝑖3         truck, overlapsWith,        truck, is_near, car
             car
  𝑖4         person, is_above, back-     person,is_near, over-
             pack                        lapsWith, backpack
  𝑖4         backpack, is_below,         backpack,     is_near,     Figure 4: Images actually good, flagged good. CSK-SNIFFER
             person                      overlapsWith, person       has a high success rate in not flagging images with meaningful
  𝑖4         backpack, is_inside,        backpack,     is_near,     bounding box collocations.
             backpack                    backpack
  𝑖4         backpack,        overlap-   backpack,     is_near,
             sWith, backpack             backpack                   each other, such that the AI (CSK-SNIFFER) provides
Table 3                                                             a visual demo of the object detection errors sniffed by
Canonical examples of errors flagged by CSK-SNIFFER If in-          spatial CSK, thus generating large adversarial data sets
ferred spatial relations over model-generated bounding boxes        to assist object detection, while the human can use this
are not consistent with expected spatial relations between ob-      feedback to enhance the performance of CSK-SNIFFER,
jects, then predicted bounding boxes are flagged as erroneous.      thereby playing its role in the learning loop.

                                                                    4.2. Error Analysis
4. Experimental Evaluation
                                                                    We now present the precision and recall shortcomings.
We present examples from our experiments, showing the                  Recall issues: Actually bad, flagged good: Figure 5
correct and wrong predictions made by CSK-SNIFFER,                  depicts examples of these types of images. The reason for
along with the error analysis. Here, “bad” refers to images         CSK-SNIFFER predicting these images as good instead
containing object detection errors while “good” refers to           of bad is that the objects wrongly detected in the image
correctly identified images with no such errors.                    are not present in our 𝐾𝐵, hence it does not check for
                                                                    their locations. Thus, they are not found in any of the
                                                                    triples predicted, they are skipped so that they do not
4.1. Appropriate Identifications
                                                                    make their way to the error set.
Actually bad, flagged bad: Figure 3 illustrates examples               Precision issues: Actually good, flagged bad: Our
in this category. Experimental evaluation shows that our            investigation of the source of these errors (see Figure
model is good at identifying odd bounding boxes.                    6), concluded that the 𝐾𝐵 relations are authored with a
   Actually good, flagged good: Figure 4 portrays ex-               3D space in perspective, while the images only contain
amples of this type. Experimental evaluation shows that             2𝐷 information. Therefore, relations such as above and
CSK-SNIFFER is able to distinguish good predictions.                below may be confused with farther and nearer. For
   Other benefits: Interestingly, while analyzing mis-              example, if a car is detected in the background, 𝑓 (𝑏𝑏𝑜𝑥)
takes of CSK-SNIFFER , we find that ∼10% of the refer-              function make an incorrect interpretation as car is
ence data on which 𝑀 is trained (MSCOCO, expected to                above person. The 𝐾𝐵 will flag this as unlikely and
be a high quality), contains wrong bounding boxes. This             hence an erroneous detection, leading to a possibly good
provides insights into potentially improving MSCOCO,                prediction flagged as an error.
constituting an added benefit of this work.                            Addressing 2D vs. 3D errors: We calculate the area
   On the whole, the human and the AI collaborate with              covered by a bounding box, such that if the area is less
                                                                  tially noisy and incomplete commonsense 𝐾𝐵𝑠 in CSK-
                                                                  SNIFFER Another direction is to study whether auto-
                                                                  matic adversarial datasets compiled with assistance from
                                                                  CSK-SNIFFER help train better models on novel target
                                                                  domains. Our work presents interesting facets from big
                                                                  data visualization and analytics along with human-AI
                                                                  collaboration.


                                                                  6. Acknowledgments
Figure 5: Images with bounding boxes actually bad, flagged
good (In the 1st image ”suitcase” is also detected as ”mi-        A. Varde has NSF grants 2018575 (MRI: Acquisition of
crowave”; in the 2nd image, ”car” is detected as ”cell phone”).
                                                                  a High-Performance GPU Cluster for Research & Ed-
                                                                  ucation); 2117308 (MRI: Acquisition of a Multimodal
                                                                  Collaborative Robot System (MCROS) to Support Cross-
                                                                  Disciplinary Human-Centered Research & Education).
                                                                  She is a visiting researcher at Max Planck Institute for
                                                                  Informatics, Germany.


                                                                  References
Figure 6: Images actually good, flagged bad (In the 1st im-
                                                                   [1] D. Wang, E. Churchill, P. Maes, X. Fan, B. Shnei-
age CSK-SNIFFER predicts “person inside person”; in the 2nd            derman, Y. Shi, Q. Wang, From human-human col-
image it predicts “car above car”, hence flagged as bad).              laboration to human-ai collaboration: Designing
                                                                       ai systems that can work together with people, in:
                                                                       CHI, 2020, pp. 1–6.
                                                                   [2] F. Zuo, J. Wang, J. Gao, K. Ozbay, X. J. Ban, Y. Shen,
than an empirically estimated threshold that object is                 H. Yang, S. Iyer, An interactive data visualization
considered to be detected in the background and there-                 and analytics tool to evaluate mobility and sociabil-
fore does not predict the triple, e.g. car above person                ity trends during covid-19, arXiv:2006.14882 (2020).
in that image. This helps to increase accuracy to ∼80%.            [3] X. Chen, A. Shrivastava, A. Gupta, Neil: Extracting
                                                                       visual knowledge from web data, in: ICCV, 2013,
5. Conclusions and Roadmap                                             pp. 1409 – 1416.
                                                                   [4] K. Konyushkova, R. Sznitman, P. Fua, Learning
This paper synopsizes the demo (with approach and ex-                  active learning from data, Advances in Neural In-
periments) of a system “CSK-SNIFFER” that “sniffs” ob-                 formation Processing Systems 30 (2017).
ject detection errors in a big data on an unseen target            [5] D. Xin, L. Ma, J. Liu, S. Macke, S. Song,
domain using spatial commonsense, with high accuracy                   A. Parameswaran, Accelerating human-in-the-loop
at no additional annotation cost. Based on human-AI                    machine learning: Challenges and opportunities, in:
collaboration, the AI angle entails spatial CSK imbibed in             ACM SIGMOD (DEEM workshop), 2018, pp. 1–4.
the system deployed via visual analytics to assist humans,         [6] A. Orlowski, P. Romanowska, Smart cities concept:
while an important human role comes from the domain                    Smart mobility indicator, Cybernetics and Systems
expert perspective in image tagging and task identifica-               (Taylor & Francis) 50 (2019) 118–131.
tion for training the system (in addition to the obvious           [7] A. Garg, N. Tandon, A. S. Varde, I am guessing you
human contribution of commonsense knowledge in the                     can’t recognize this: Generating adversarial images
system). More significantly, the human and the AI make                 for object detection using spatial commonsense, in:
contributions to the learning loop by feedback-based                   AAAI, 2020, pp. 13789–13790.
interactive learning, and assistance in object detection           [8] N. Tandon, A. S. Varde, G. de Melo, Commonsense
respectively. It is promising to note that our approach                knowledge in machine intelligence, ACM SIGMOD
based on simplicity can automatically discover errors in               Record 46 (2017) 49–52.
data of significant volume and variety, and be potentially         [9] M. Yatskar, V. Ordonez, A. Farhadi, Stating the
useful in this learning setting. We demonstrate that with              obvious: Extracting visual common sense, NAACL
high quality, we can generate large complex adversarial                (2016).
datasets on target domains such as smart mobility.                [10] J. Redmon, A. Farhadi, Yolo9000: Better, faster,
   Future work includes harnessing existing, poten-                    stronger, CVPR (2016).