=Paper=
{{Paper
|id=Vol-3135/bigvis_short2
|storemode=property
|title=CSK-SNIFFER: Commonsense Knowledge for Sniffing Object Detection Errors
|pdfUrl=https://ceur-ws.org/Vol-3135/bigvis_short2.pdf
|volume=Vol-3135
|authors=Anurag Garg,Niket Tandon,Aparna Varde
|dblpUrl=https://dblp.org/rec/conf/edbt/GargTV22
}}
==CSK-SNIFFER: Commonsense Knowledge for Sniffing Object Detection Errors==
CSK-SNIFFER: Commonsense Knowledge for Sniffing
Object Detection Errors
Anurag Garg1 , Niket Tandon2 and Aparna S. Varde3
1
PQRS Research, Dehradun, India
2
Allen Institute for AI, Seattle, USA
3
Montclair State University, Montclair, USA
Abstract
This paper showcases the demonstration of a system called CSK-SNIFFER to automatically predict failures of an object
detection model on images in big data sets from a target domain by identifying errors based on commonsense knowledge.
CSK-SNIFFER can be an assistant to a human (as sniffer dogs are assistants to police searching for problems at airports). To
cut through the clutter after deployment, this βsnifferβ identifies where a given model is probably wrong. Alerted thus, users
can visually explore within our demo, the modelβs explanation based on spatial correlations that make no sense. In other
words, it is impossible for a human without the help of a sniffer to flag false positives in such large data sets without knowing
ground truth (unknown earlier since it is found after deployment). CSK-SNIFFER spans human-AI collaboration. The AI role
is harnessed via embedding commonsense knowledge in the system; while an important human part is played by domain
experts providing labeled data for training (besides human commonsense deployed by AI). Another highly significant aspect
is that the human-in-the-loop can improve the AI system by the feedback it receives from visualizing object detection errors,
while the AI provides actual assistance to the human in object detection. CSK-SNIFFER exemplifies visualization in big data
analytics through spatial commonsense and a visually rich demo with numerous complex images from target domains. This
paper provides excerpts of the CSK-SNIFFER system demo with a synopsis of its approach and experiments.
car
1. Introduction Trained
object
Bbox2
Human-AI collaboration, the realm of humans and AI
detection
model M bbox1 bbox3
visualized
bbx1
bbx2
π H
collaborate
bbx3
systems working together, typically achieves better per-
formance than either one working alone [1]. Big data
Prediction
visualization and analytics can be used to foster interac- bbx1 bbx2 bbx3
tion [2]. Such areas receive attention, e.g. NEIL (Never
Ending Image Learner) [3], active learning approaches CSK Sniffer C Update
Input
[4], human-in-the-loop learning [5] etc. To that end, image x flag spatially
we demonstrate a system βCSK-SNIFFERβ exemplifying implausible
object pairs retrieve object
human-AI collaboration via enhancing object detection Inference relevant spatial
γ
by visualizing potential errors in large complex data sets, knowledge
harnessing spatial commonsense. This system βsniffsβ er-
rors in object detection using spatial collocation anoma- Figure 1: CSK-SNIFFER and the human-in-the-loop: A car
lies, assisting humans analogous to sniffer dogs aiding was detected in the image which was flagged bad by CSK-
police at airports. The process, (Figure 1), is as follows, SNIFFER based on its spatial knowledge w.r.t KB which the
with the CSK-SNIFFER system (πΆ), human-in-the-loop human-in-the-loop can update after visualizing errors
(π»), and inference model (π ) for object detection.
System πΆ interacts with human π» and provides object
detection output visualizing potential errors, over model sion and recall, based on the spatial KB. Hence, the two
π , by deploying commonsense knowledge through a directions in this learning loop are as follows.
spatial knowledge base (πΎπ΅). The πΎπ΅ is derived by
capturing spatial commonsense, especially as collocation β’ π» to πΆ: Feedback-based interactive learning
anomalies. Then π» sees the visualized errors (output by β’ πΆ to π»: Assistance in object detection
πΆ) and can thereby enhance πΆ by increasing its preci-
Thus, the human and the AI work together with the goal
of enhancing object detection in big data. Further, the
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint inference model π can potentially improve, as an added
Conference (March 29-April 1, 2022), Edinburgh, UK
benefit of this adversarial learning via human-AI collab-
$ anuraggarg1209@gmail.com (A. Garg); nikett@allenai.org
(N. Tandon); vardea@montclair.edu (A. S. Varde) oration. The obtained information can be used to supply
Β© 2022 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
more examples to π on the misclassified categories to
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) make it more robust. If certain labels are inappropriate
consistently across examples, it is a valuable insight. As People crossing city streets on pedestrian crossings
Vehicles coming to a full hault at red signals
data sets get bigger in volume and variety, such automa-
Vehicles stopping or slowing down at stop signs
tion is even more significant in assisting object detection Street lights dimming when occupants are few
errors. Street lights brightening when occupants are many
The use case in our work focuses on the smart mo- Buses running on traffic-optimal routes
bility domain [6]. It entails autonomous vehicles, self- Service dogs helping blind people
operating traffic signals, energy-saving street lights dim- People charging phones at WiFi stations
ming / brightening as per pedestrian usage etc. In such People reading useful information at roadside kiosks
AI systems, it is crucial to detect objects accurately, espe- People parking bikes at share-ride spots
cially due to issues such as safety. CSK-SNIFFER plays an Vehicles flashing turning lights for L/R turns
important role here, generating large adversarial training Bikes riding on bike routes only
Traffic cops making hand signals in regular operations
data sets by sniffing object detection errors.
Vehicles driving beneath an overpass
Dogs on a leash walking with their owners
2. The CSK-SNIFFER Approach People jogging on sidewalks
People entering and leaving trains when doors open
We summarize the CSK-SNIFFER approach as per its People using prams for kids in buses
Trees existing on sidewalks
design and execution [7]. In this approach. we represent
Ropeways carrying passengers to tourist spots
and construct a πΎπ΅, along with the function π (ππππ₯) Bikers wearing smart watches
that generates a triple, i.e. < ππ , π£Λππ , ππ > from the Maglev trains running between airports and cities
predicted bounding boxes of objects ππ and ππ such that Grass existing on freeway sides and city streets
π£ππ is a binary vector over relations πππ(πΎπ΅). Solar panels existing on roofs of buildings
Gather ππ using action-vocab(π ): Table 1 People using smartphones for talking anytime anywhere
presents some examples from action-vocab(π ) where Canal lights dimming when occupants are few
π =smart mobility domain. These entries represent Canal lights brightening with many occupants
mostly typical and some unique scenes in this do- People wheeling shopping carts in grocery stores
main. This list was manually compiled by a domain Table 1
expert. Images ππ are compiled using Web queries β βΌ10% examples from action-vocab(π ) where π =smart
action-vocab(π ) (on an image search engine), and an mobility domain. These are used as queries to compile the
object detector predicts bounding boxes over π₯π β ππ . input to the object detector, and then CSK-SNIFFER can flag
KB construction: While in principle, we can directly images in π where the detector failed to predict the correct
use existing πΎπ΅s, these have errors as elaborated in bounding boxes.
some works [8]. CSK-SNIFFER isolates the effect of these
errors by instead manually creating a πΎπ΅ at a very low
cost. The πΎπ΅ is defined over a set of objects π and rela- available at https://tinyurl.com/kb-for-csksniffer
tions πππ(πΎπ΅). The relation set πππ(πΎπ΅) comprises Function f(bbox): Similar to the triples in the πΎπ΅,
5 relations (isAbove, isBelow, isInside, isNear, we define a function π (ππππ₯) to construct triples <
overlapsWith). We are inspired by other works in the ππ , π£Λππ , ππ > using the predicted bounding boxes of im-
literature such as [9] in picking these relations, and be- age π₯π . The π (ππππ₯) input consists of predicted bound-
cause our initial analysis proposed their suitability for ing boxes on an image, and the output is a list of triples
bounding box relative relations. An entry in the πΎπ΅ in the format: < ππ , π£Λππ , ππ >, for every pair of objects
comprises of a pair of objects ππ , ππ β π and a binary ππ , ππ β the objects detected in the image. For every such
vector π£ππ denoting ππ , ππ βs and their spatial relations pair, π (ππππ₯) compares the coordinates of the bounding
over πππ(πΎπ΅). These spatial relations are manually an- boxes of ππ and ππ (this is a known process e.g. [9]). We
notated by a domain expert, according to general likeli- illustrate this for the isInside relation. Let coordinates
hood e.g., it is more likely that a dog is observed near a of a bounding box be π₯1 , π¦1 , π₯2 , π¦2 , then ππ .π¦1 denotes
human, and much less likely that it is observed near a π¦1 coordinate of ππ . If ππ .π¦2 β€ ππ .π¦1 and ππ .π₯2 > ππ .π₯1
whale. The π most popular objects on MSCOCO training and ππ .π₯1 < ππ .π₯2 then ππ is inside ππ . Similarly,
data make up π; in our experiments π = 10, and this other relations in πππ(πΎπ΅) are built, compiling which
leads to π2 entries in πΎπ΅ that need to be annotated with provides π£Λππ . For anomaly detection, we compare vectors
π£ππ . It is remarkable that our experiments demonstrate π£Λππ and π£ππ for overlapping object-pairs ππ , ππ , detected
that even with π = 10, the πΎπ΅ allows CSK-SNIFFER to in the image and present in the πΎπ΅.
achieve good performance. We can infer that selecting Based on this discussion, the following algorithm
a popular subset helps, even if it is small. An entry in summarizes the execution of CSK-SNIFFER.
the KB is denoted as < ππ , π£ππ , ππ > The πΎπ΅ is publicly
Algorithm 1: CSK-SNIFFER Approach
Input: Object detector π trained on source do-
main π
Manually compiled action-vocab(π ) in target
domain π
Images ππ compiled using Web queries β
action-vocab(π )
1. Define πππ(πΎπ΅) comprising 5 relations:
isAbove, isBelow,
isInside, isNear, overlapsWith
2. Define commonsense πΎπ΅, each entry <
ππ , π£ππ , ππ > where π£ππ is a binary vector over
πππ(πΎπ΅)
3. Generate triples < ππ , π£Λππ , ππ > from predicted
Figure 2: A sample from the gathered images (ππ ) by search-
bounding boxes of π₯π β ππ using function ing the Web for "People crossing city streets on pedestrian
π (ππππ₯). crossings" in action-vocab(π )
4. For each image π₯π β ππ , compare π£Λππ and π£ππ ,
from bounding box triples < ππ , π£Λππ , ππ > and
Inferred spatial relation on pre- Frequency
πΎπ΅ triples < ππ , π£ππ , ππ >
dicted bounding boxes
5. For each π₯π , if π£Λππ ΜΈ= π£ππ then flag π₯π : π€ππππ,
add π₯π to ππβ² person, overlapsWith, person 18
car, is_near, car 10
person, overlapsWith, car 8
Output: Subset ππβ² where π failed car, overlapsWith, person 8
car, overlapsWith, car 8
person, is_near, backpack 4
3. Excerpts from System Demo backpack, is_near, person 4
car, is_near, backpack 4
We have built a live demo to depict the working of CSK- backpack, is_near, car 4
SNIFFER. This demo illustrates the functioning of CSK- traffic light, is_near, traffic light 4
SNIFFER to enhance its actual comprehension and aug- Table 2
ment its usage. In addition, this demo paper presents Distribution of spatial relations in ππ . Spatial relations are
the principles behind the human-in-the-loop functioning of the form: < ππ , π£^ππ , ππ >
of CSK-SNIFFER for sniffing object detection errors in
large, complex data sets, thereby being added contribu-
tions over our earlier work [7]. While this human-in-the-
predicted, the demo moves to the home page. This con-
loop functioning is explained in the introduction with
tains details on the output files generated. The βImagesβ
an illustration and theoretical justification, its detailed
option displays downloaded images as shown in Figure
empirical validations with respect to interactive πΎπ΅ up-
2 herewith.
dates constitute ongoing work, based on CSK-SNIFFER
Output files generated by CSK-SNIFFER are illustrated
being actively deployed in real-world settings. In fact,
as follows. Table 2 shows the first output file βColloca-
this demo paves the way for such interactive πΎπ΅ updates
tions Mapβ with triples predicted by CSK-SNIFFER in
via augmenting the usage of CSK-SNIFFER in suitable
images with their respective counts. The final output
applications to provide the human-in-the-loop feedback
file βError Setβ Table 3 contains names of images with
for the addition of such interactive πΎπ΅ updates.
some odd visual collocations. It also indicates the triple
We present some screenshots illustrating the demo.
that actually got predicted versus the expectation from
Many more can be provided in a live setting. The user
the model. These files help fathom the functioning of
enters any search query related to smart mobility [6].
CSK-SNIFFER.
Images are downloaded from Google Images based on
this query. Object detection is then performed on the
images using YOLO [10] to start predicting triples in the
image using the π (ππππ₯) function. Once the triples are
Image Inferred spatial re- Expected spatial re-
id lation on predicted lation between these
bounding boxes objects present in
πΎπ΅
π 1 , π3 person, overlapsWith, person, is_near,
car is_inside, car
π 1 , π3 car, overlapsWith, per- car, is_near, person
son Figure 3: Images actually bad, flagged bad (In the 1st image
π 1 , π4 backpack, is_inside, backpack, is_near, βbuildingsβ are detected as βtruckβ and βTV monitorβ; in the
person overlapsWith, person 2nd image βbuildingsβ are detected as βbusβ and βtruckβ).
π1 backpack, is_near, car backpack, is_inside,
car
π2 traffic light, is_inside, traffic light, is_near,
traffic light is_above, traffic light
π2 traffic light, overlap- traffic light,is_near,
sWith, traffic light is_above, traffic light
π3 truck, overlapsWith, truck, is_near, car
car
π4 person, is_above, back- person,is_near, over-
pack lapsWith, backpack
π4 backpack, is_below, backpack, is_near, Figure 4: Images actually good, flagged good. CSK-SNIFFER
person overlapsWith, person has a high success rate in not flagging images with meaningful
π4 backpack, is_inside, backpack, is_near, bounding box collocations.
backpack backpack
π4 backpack, overlap- backpack, is_near,
sWith, backpack backpack each other, such that the AI (CSK-SNIFFER) provides
Table 3 a visual demo of the object detection errors sniffed by
Canonical examples of errors flagged by CSK-SNIFFER If in- spatial CSK, thus generating large adversarial data sets
ferred spatial relations over model-generated bounding boxes to assist object detection, while the human can use this
are not consistent with expected spatial relations between ob- feedback to enhance the performance of CSK-SNIFFER,
jects, then predicted bounding boxes are flagged as erroneous. thereby playing its role in the learning loop.
4.2. Error Analysis
4. Experimental Evaluation
We now present the precision and recall shortcomings.
We present examples from our experiments, showing the Recall issues: Actually bad, flagged good: Figure 5
correct and wrong predictions made by CSK-SNIFFER, depicts examples of these types of images. The reason for
along with the error analysis. Here, βbadβ refers to images CSK-SNIFFER predicting these images as good instead
containing object detection errors while βgoodβ refers to of bad is that the objects wrongly detected in the image
correctly identified images with no such errors. are not present in our πΎπ΅, hence it does not check for
their locations. Thus, they are not found in any of the
triples predicted, they are skipped so that they do not
4.1. Appropriate Identifications
make their way to the error set.
Actually bad, flagged bad: Figure 3 illustrates examples Precision issues: Actually good, flagged bad: Our
in this category. Experimental evaluation shows that our investigation of the source of these errors (see Figure
model is good at identifying odd bounding boxes. 6), concluded that the πΎπ΅ relations are authored with a
Actually good, flagged good: Figure 4 portrays ex- 3D space in perspective, while the images only contain
amples of this type. Experimental evaluation shows that 2π· information. Therefore, relations such as above and
CSK-SNIFFER is able to distinguish good predictions. below may be confused with farther and nearer. For
Other benefits: Interestingly, while analyzing mis- example, if a car is detected in the background, π (ππππ₯)
takes of CSK-SNIFFER , we find that βΌ10% of the refer- function make an incorrect interpretation as car is
ence data on which π is trained (MSCOCO, expected to above person. The πΎπ΅ will flag this as unlikely and
be a high quality), contains wrong bounding boxes. This hence an erroneous detection, leading to a possibly good
provides insights into potentially improving MSCOCO, prediction flagged as an error.
constituting an added benefit of this work. Addressing 2D vs. 3D errors: We calculate the area
On the whole, the human and the AI collaborate with covered by a bounding box, such that if the area is less
tially noisy and incomplete commonsense πΎπ΅π in CSK-
SNIFFER Another direction is to study whether auto-
matic adversarial datasets compiled with assistance from
CSK-SNIFFER help train better models on novel target
domains. Our work presents interesting facets from big
data visualization and analytics along with human-AI
collaboration.
6. Acknowledgments
Figure 5: Images with bounding boxes actually bad, flagged
good (In the 1st image βsuitcaseβ is also detected as βmi- A. Varde has NSF grants 2018575 (MRI: Acquisition of
crowaveβ; in the 2nd image, βcarβ is detected as βcell phoneβ).
a High-Performance GPU Cluster for Research & Ed-
ucation); 2117308 (MRI: Acquisition of a Multimodal
Collaborative Robot System (MCROS) to Support Cross-
Disciplinary Human-Centered Research & Education).
She is a visiting researcher at Max Planck Institute for
Informatics, Germany.
References
Figure 6: Images actually good, flagged bad (In the 1st im-
[1] D. Wang, E. Churchill, P. Maes, X. Fan, B. Shnei-
age CSK-SNIFFER predicts βperson inside personβ; in the 2nd derman, Y. Shi, Q. Wang, From human-human col-
image it predicts βcar above carβ, hence flagged as bad). laboration to human-ai collaboration: Designing
ai systems that can work together with people, in:
CHI, 2020, pp. 1β6.
[2] F. Zuo, J. Wang, J. Gao, K. Ozbay, X. J. Ban, Y. Shen,
than an empirically estimated threshold that object is H. Yang, S. Iyer, An interactive data visualization
considered to be detected in the background and there- and analytics tool to evaluate mobility and sociabil-
fore does not predict the triple, e.g. car above person ity trends during covid-19, arXiv:2006.14882 (2020).
in that image. This helps to increase accuracy to βΌ80%. [3] X. Chen, A. Shrivastava, A. Gupta, Neil: Extracting
visual knowledge from web data, in: ICCV, 2013,
5. Conclusions and Roadmap pp. 1409 β 1416.
[4] K. Konyushkova, R. Sznitman, P. Fua, Learning
This paper synopsizes the demo (with approach and ex- active learning from data, Advances in Neural In-
periments) of a system βCSK-SNIFFERβ that βsniffsβ ob- formation Processing Systems 30 (2017).
ject detection errors in a big data on an unseen target [5] D. Xin, L. Ma, J. Liu, S. Macke, S. Song,
domain using spatial commonsense, with high accuracy A. Parameswaran, Accelerating human-in-the-loop
at no additional annotation cost. Based on human-AI machine learning: Challenges and opportunities, in:
collaboration, the AI angle entails spatial CSK imbibed in ACM SIGMOD (DEEM workshop), 2018, pp. 1β4.
the system deployed via visual analytics to assist humans, [6] A. Orlowski, P. Romanowska, Smart cities concept:
while an important human role comes from the domain Smart mobility indicator, Cybernetics and Systems
expert perspective in image tagging and task identifica- (Taylor & Francis) 50 (2019) 118β131.
tion for training the system (in addition to the obvious [7] A. Garg, N. Tandon, A. S. Varde, I am guessing you
human contribution of commonsense knowledge in the canβt recognize this: Generating adversarial images
system). More significantly, the human and the AI make for object detection using spatial commonsense, in:
contributions to the learning loop by feedback-based AAAI, 2020, pp. 13789β13790.
interactive learning, and assistance in object detection [8] N. Tandon, A. S. Varde, G. de Melo, Commonsense
respectively. It is promising to note that our approach knowledge in machine intelligence, ACM SIGMOD
based on simplicity can automatically discover errors in Record 46 (2017) 49β52.
data of significant volume and variety, and be potentially [9] M. Yatskar, V. Ordonez, A. Farhadi, Stating the
useful in this learning setting. We demonstrate that with obvious: Extracting visual common sense, NAACL
high quality, we can generate large complex adversarial (2016).
datasets on target domains such as smart mobility. [10] J. Redmon, A. Farhadi, Yolo9000: Better, faster,
Future work includes harnessing existing, poten- stronger, CVPR (2016).