CSK-SNIFFER: Commonsense Knowledge for Sniffing Object Detection Errors Anurag Garg1 , Niket Tandon2 and Aparna S. Varde3 1 PQRS Research, Dehradun, India 2 Allen Institute for AI, Seattle, USA 3 Montclair State University, Montclair, USA Abstract This paper showcases the demonstration of a system called CSK-SNIFFER to automatically predict failures of an object detection model on images in big data sets from a target domain by identifying errors based on commonsense knowledge. CSK-SNIFFER can be an assistant to a human (as sniffer dogs are assistants to police searching for problems at airports). To cut through the clutter after deployment, this β€œsniffer” identifies where a given model is probably wrong. Alerted thus, users can visually explore within our demo, the model’s explanation based on spatial correlations that make no sense. In other words, it is impossible for a human without the help of a sniffer to flag false positives in such large data sets without knowing ground truth (unknown earlier since it is found after deployment). CSK-SNIFFER spans human-AI collaboration. The AI role is harnessed via embedding commonsense knowledge in the system; while an important human part is played by domain experts providing labeled data for training (besides human commonsense deployed by AI). Another highly significant aspect is that the human-in-the-loop can improve the AI system by the feedback it receives from visualizing object detection errors, while the AI provides actual assistance to the human in object detection. CSK-SNIFFER exemplifies visualization in big data analytics through spatial commonsense and a visually rich demo with numerous complex images from target domains. This paper provides excerpts of the CSK-SNIFFER system demo with a synopsis of its approach and experiments. car 1. Introduction Trained object Bbox2 Human-AI collaboration, the realm of humans and AI detection model M bbox1 bbox3 visualized bbx1 bbx2 πŸ’ H collaborate bbx3 systems working together, typically achieves better per- formance than either one working alone [1]. Big data Prediction visualization and analytics can be used to foster interac- bbx1 bbx2 bbx3 tion [2]. Such areas receive attention, e.g. NEIL (Never Ending Image Learner) [3], active learning approaches CSK Sniffer C Update Input [4], human-in-the-loop learning [5] etc. To that end, image x flag spatially we demonstrate a system β€œCSK-SNIFFER” exemplifying implausible object pairs retrieve object human-AI collaboration via enhancing object detection Inference relevant spatial γŽ… by visualizing potential errors in large complex data sets, knowledge harnessing spatial commonsense. This system β€œsniffs” er- rors in object detection using spatial collocation anoma- Figure 1: CSK-SNIFFER and the human-in-the-loop: A car lies, assisting humans analogous to sniffer dogs aiding was detected in the image which was flagged bad by CSK- police at airports. The process, (Figure 1), is as follows, SNIFFER based on its spatial knowledge w.r.t KB which the with the CSK-SNIFFER system (𝐢), human-in-the-loop human-in-the-loop can update after visualizing errors (𝐻), and inference model (𝑀 ) for object detection. System 𝐢 interacts with human 𝐻 and provides object detection output visualizing potential errors, over model sion and recall, based on the spatial KB. Hence, the two 𝑀 , by deploying commonsense knowledge through a directions in this learning loop are as follows. spatial knowledge base (𝐾𝐡). The 𝐾𝐡 is derived by capturing spatial commonsense, especially as collocation β€’ 𝐻 to 𝐢: Feedback-based interactive learning anomalies. Then 𝐻 sees the visualized errors (output by β€’ 𝐢 to 𝐻: Assistance in object detection 𝐢) and can thereby enhance 𝐢 by increasing its preci- Thus, the human and the AI work together with the goal of enhancing object detection in big data. Further, the Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint inference model 𝑀 can potentially improve, as an added Conference (March 29-April 1, 2022), Edinburgh, UK benefit of this adversarial learning via human-AI collab- $ anuraggarg1209@gmail.com (A. Garg); nikett@allenai.org (N. Tandon); vardea@montclair.edu (A. S. Varde) oration. The obtained information can be used to supply Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). more examples to 𝑀 on the misclassified categories to CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) make it more robust. If certain labels are inappropriate consistently across examples, it is a valuable insight. As People crossing city streets on pedestrian crossings Vehicles coming to a full hault at red signals data sets get bigger in volume and variety, such automa- Vehicles stopping or slowing down at stop signs tion is even more significant in assisting object detection Street lights dimming when occupants are few errors. Street lights brightening when occupants are many The use case in our work focuses on the smart mo- Buses running on traffic-optimal routes bility domain [6]. It entails autonomous vehicles, self- Service dogs helping blind people operating traffic signals, energy-saving street lights dim- People charging phones at WiFi stations ming / brightening as per pedestrian usage etc. In such People reading useful information at roadside kiosks AI systems, it is crucial to detect objects accurately, espe- People parking bikes at share-ride spots cially due to issues such as safety. CSK-SNIFFER plays an Vehicles flashing turning lights for L/R turns important role here, generating large adversarial training Bikes riding on bike routes only Traffic cops making hand signals in regular operations data sets by sniffing object detection errors. Vehicles driving beneath an overpass Dogs on a leash walking with their owners 2. The CSK-SNIFFER Approach People jogging on sidewalks People entering and leaving trains when doors open We summarize the CSK-SNIFFER approach as per its People using prams for kids in buses Trees existing on sidewalks design and execution [7]. In this approach. we represent Ropeways carrying passengers to tourist spots and construct a 𝐾𝐡, along with the function 𝑓 (π‘π‘π‘œπ‘₯) Bikers wearing smart watches that generates a triple, i.e. < π‘œπ‘– , 𝑣ˆ𝑖𝑗 , π‘œπ‘— > from the Maglev trains running between airports and cities predicted bounding boxes of objects π‘œπ‘– and π‘œπ‘— such that Grass existing on freeway sides and city streets 𝑣𝑖𝑗 is a binary vector over relations π‘Ÿπ‘’π‘™(𝐾𝐡). Solar panels existing on roofs of buildings Gather 𝑋𝑇 using action-vocab(𝑇 ): Table 1 People using smartphones for talking anytime anywhere presents some examples from action-vocab(𝑇 ) where Canal lights dimming when occupants are few 𝑇 =smart mobility domain. These entries represent Canal lights brightening with many occupants mostly typical and some unique scenes in this do- People wheeling shopping carts in grocery stores main. This list was manually compiled by a domain Table 1 expert. Images 𝑋𝑇 are compiled using Web queries ∈ ∼10% examples from action-vocab(𝑇 ) where 𝑇 =smart action-vocab(𝑇 ) (on an image search engine), and an mobility domain. These are used as queries to compile the object detector predicts bounding boxes over π‘₯𝑇 ∈ 𝑋𝑇 . input to the object detector, and then CSK-SNIFFER can flag KB construction: While in principle, we can directly images in 𝑇 where the detector failed to predict the correct use existing 𝐾𝐡s, these have errors as elaborated in bounding boxes. some works [8]. CSK-SNIFFER isolates the effect of these errors by instead manually creating a 𝐾𝐡 at a very low cost. The 𝐾𝐡 is defined over a set of objects 𝑂 and rela- available at https://tinyurl.com/kb-for-csksniffer tions π‘Ÿπ‘’π‘™(𝐾𝐡). The relation set π‘Ÿπ‘’π‘™(𝐾𝐡) comprises Function f(bbox): Similar to the triples in the 𝐾𝐡, 5 relations (isAbove, isBelow, isInside, isNear, we define a function 𝑓 (π‘π‘π‘œπ‘₯) to construct triples < overlapsWith). We are inspired by other works in the π‘œπ‘– , 𝑣ˆ𝑖𝑗 , π‘œπ‘— > using the predicted bounding boxes of im- literature such as [9] in picking these relations, and be- age π‘₯𝑇 . The 𝑓 (π‘π‘π‘œπ‘₯) input consists of predicted bound- cause our initial analysis proposed their suitability for ing boxes on an image, and the output is a list of triples bounding box relative relations. An entry in the 𝐾𝐡 in the format: < π‘œπ‘– , 𝑣ˆ𝑖𝑗 , π‘œπ‘— >, for every pair of objects comprises of a pair of objects π‘œπ‘– , π‘œπ‘— ∈ 𝑂 and a binary π‘œπ‘– , π‘œπ‘— ∈ the objects detected in the image. For every such vector 𝑣𝑖𝑗 denoting π‘œπ‘– , π‘œπ‘— ’s and their spatial relations pair, 𝑓 (π‘π‘π‘œπ‘₯) compares the coordinates of the bounding over π‘Ÿπ‘’π‘™(𝐾𝐡). These spatial relations are manually an- boxes of π‘œπ‘– and π‘œπ‘— (this is a known process e.g. [9]). We notated by a domain expert, according to general likeli- illustrate this for the isInside relation. Let coordinates hood e.g., it is more likely that a dog is observed near a of a bounding box be π‘₯1 , 𝑦1 , π‘₯2 , 𝑦2 , then π‘œπ‘– .𝑦1 denotes human, and much less likely that it is observed near a 𝑦1 coordinate of π‘œπ‘– . If π‘œπ‘– .𝑦2 ≀ π‘œπ‘— .𝑦1 and π‘œπ‘— .π‘₯2 > π‘œπ‘– .π‘₯1 whale. The π‘˜ most popular objects on MSCOCO training and π‘œπ‘— .π‘₯1 < π‘œπ‘– .π‘₯2 then π‘œπ‘– is inside π‘œπ‘— . Similarly, data make up 𝑂; in our experiments π‘˜ = 10, and this other relations in π‘Ÿπ‘’π‘™(𝐾𝐡) are built, compiling which leads to π‘˜2 entries in 𝐾𝐡 that need to be annotated with provides 𝑣ˆ𝑖𝑗 . For anomaly detection, we compare vectors 𝑣𝑖𝑗 . It is remarkable that our experiments demonstrate 𝑣ˆ𝑖𝑗 and 𝑣𝑖𝑗 for overlapping object-pairs π‘œπ‘– , π‘œπ‘— , detected that even with π‘˜ = 10, the 𝐾𝐡 allows CSK-SNIFFER to in the image and present in the 𝐾𝐡. achieve good performance. We can infer that selecting Based on this discussion, the following algorithm a popular subset helps, even if it is small. An entry in summarizes the execution of CSK-SNIFFER. the KB is denoted as < π‘œπ‘– , 𝑣𝑖𝑗 , π‘œπ‘— > The 𝐾𝐡 is publicly Algorithm 1: CSK-SNIFFER Approach Input: Object detector 𝑀 trained on source do- main 𝑆 Manually compiled action-vocab(𝑇 ) in target domain 𝑇 Images 𝑋𝑇 compiled using Web queries ∈ action-vocab(𝑇 ) 1. Define π‘Ÿπ‘’π‘™(𝐾𝐡) comprising 5 relations: isAbove, isBelow, isInside, isNear, overlapsWith 2. Define commonsense 𝐾𝐡, each entry < π‘œπ‘– , 𝑣𝑖𝑗 , π‘œπ‘— > where 𝑣𝑖𝑗 is a binary vector over π‘Ÿπ‘’π‘™(𝐾𝐡) 3. Generate triples < π‘œπ‘– , 𝑣ˆ𝑖𝑗 , π‘œπ‘— > from predicted Figure 2: A sample from the gathered images (𝑋𝑇 ) by search- bounding boxes of π‘₯𝑇 ∈ 𝑋𝑇 using function ing the Web for "People crossing city streets on pedestrian 𝑓 (π‘π‘π‘œπ‘₯). crossings" in action-vocab(𝑇 ) 4. For each image π‘₯𝑇 ∈ 𝑋𝑇 , compare 𝑣ˆ𝑖𝑗 and 𝑣𝑖𝑗 , from bounding box triples < π‘œπ‘– , 𝑣ˆ𝑖𝑗 , π‘œπ‘— > and Inferred spatial relation on pre- Frequency 𝐾𝐡 triples < π‘œπ‘– , 𝑣𝑖𝑗 , π‘œπ‘— > dicted bounding boxes 5. For each π‘₯𝑇 , if 𝑣ˆ𝑖𝑗 ΜΈ= 𝑣𝑖𝑗 then flag π‘₯𝑇 : π‘€π‘Ÿπ‘œπ‘›π‘”, add π‘₯𝑇 to 𝑋𝑇′ person, overlapsWith, person 18 car, is_near, car 10 person, overlapsWith, car 8 Output: Subset 𝑋𝑇′ where 𝑀 failed car, overlapsWith, person 8 car, overlapsWith, car 8 person, is_near, backpack 4 3. Excerpts from System Demo backpack, is_near, person 4 car, is_near, backpack 4 We have built a live demo to depict the working of CSK- backpack, is_near, car 4 SNIFFER. This demo illustrates the functioning of CSK- traffic light, is_near, traffic light 4 SNIFFER to enhance its actual comprehension and aug- Table 2 ment its usage. In addition, this demo paper presents Distribution of spatial relations in 𝑋𝑇 . Spatial relations are the principles behind the human-in-the-loop functioning of the form: < π‘œπ‘– , 𝑣^𝑖𝑗 , π‘œπ‘— > of CSK-SNIFFER for sniffing object detection errors in large, complex data sets, thereby being added contribu- tions over our earlier work [7]. While this human-in-the- predicted, the demo moves to the home page. This con- loop functioning is explained in the introduction with tains details on the output files generated. The β€œImages” an illustration and theoretical justification, its detailed option displays downloaded images as shown in Figure empirical validations with respect to interactive 𝐾𝐡 up- 2 herewith. dates constitute ongoing work, based on CSK-SNIFFER Output files generated by CSK-SNIFFER are illustrated being actively deployed in real-world settings. In fact, as follows. Table 2 shows the first output file β€œColloca- this demo paves the way for such interactive 𝐾𝐡 updates tions Map” with triples predicted by CSK-SNIFFER in via augmenting the usage of CSK-SNIFFER in suitable images with their respective counts. The final output applications to provide the human-in-the-loop feedback file β€œError Set” Table 3 contains names of images with for the addition of such interactive 𝐾𝐡 updates. some odd visual collocations. It also indicates the triple We present some screenshots illustrating the demo. that actually got predicted versus the expectation from Many more can be provided in a live setting. The user the model. These files help fathom the functioning of enters any search query related to smart mobility [6]. CSK-SNIFFER. Images are downloaded from Google Images based on this query. Object detection is then performed on the images using YOLO [10] to start predicting triples in the image using the 𝑓 (π‘π‘π‘œπ‘₯) function. Once the triples are Image Inferred spatial re- Expected spatial re- id lation on predicted lation between these bounding boxes objects present in 𝐾𝐡 𝑖 1 , 𝑖3 person, overlapsWith, person, is_near, car is_inside, car 𝑖 1 , 𝑖3 car, overlapsWith, per- car, is_near, person son Figure 3: Images actually bad, flagged bad (In the 1st image 𝑖 1 , 𝑖4 backpack, is_inside, backpack, is_near, ”buildings” are detected as β€œtruck” and ”TV monitor”; in the person overlapsWith, person 2nd image β€œbuildings” are detected as ”bus” and ”truck”). 𝑖1 backpack, is_near, car backpack, is_inside, car 𝑖2 traffic light, is_inside, traffic light, is_near, traffic light is_above, traffic light 𝑖2 traffic light, overlap- traffic light,is_near, sWith, traffic light is_above, traffic light 𝑖3 truck, overlapsWith, truck, is_near, car car 𝑖4 person, is_above, back- person,is_near, over- pack lapsWith, backpack 𝑖4 backpack, is_below, backpack, is_near, Figure 4: Images actually good, flagged good. CSK-SNIFFER person overlapsWith, person has a high success rate in not flagging images with meaningful 𝑖4 backpack, is_inside, backpack, is_near, bounding box collocations. backpack backpack 𝑖4 backpack, overlap- backpack, is_near, sWith, backpack backpack each other, such that the AI (CSK-SNIFFER) provides Table 3 a visual demo of the object detection errors sniffed by Canonical examples of errors flagged by CSK-SNIFFER If in- spatial CSK, thus generating large adversarial data sets ferred spatial relations over model-generated bounding boxes to assist object detection, while the human can use this are not consistent with expected spatial relations between ob- feedback to enhance the performance of CSK-SNIFFER, jects, then predicted bounding boxes are flagged as erroneous. thereby playing its role in the learning loop. 4.2. Error Analysis 4. Experimental Evaluation We now present the precision and recall shortcomings. We present examples from our experiments, showing the Recall issues: Actually bad, flagged good: Figure 5 correct and wrong predictions made by CSK-SNIFFER, depicts examples of these types of images. The reason for along with the error analysis. Here, β€œbad” refers to images CSK-SNIFFER predicting these images as good instead containing object detection errors while β€œgood” refers to of bad is that the objects wrongly detected in the image correctly identified images with no such errors. are not present in our 𝐾𝐡, hence it does not check for their locations. Thus, they are not found in any of the triples predicted, they are skipped so that they do not 4.1. Appropriate Identifications make their way to the error set. Actually bad, flagged bad: Figure 3 illustrates examples Precision issues: Actually good, flagged bad: Our in this category. Experimental evaluation shows that our investigation of the source of these errors (see Figure model is good at identifying odd bounding boxes. 6), concluded that the 𝐾𝐡 relations are authored with a Actually good, flagged good: Figure 4 portrays ex- 3D space in perspective, while the images only contain amples of this type. Experimental evaluation shows that 2𝐷 information. Therefore, relations such as above and CSK-SNIFFER is able to distinguish good predictions. below may be confused with farther and nearer. For Other benefits: Interestingly, while analyzing mis- example, if a car is detected in the background, 𝑓 (π‘π‘π‘œπ‘₯) takes of CSK-SNIFFER , we find that ∼10% of the refer- function make an incorrect interpretation as car is ence data on which 𝑀 is trained (MSCOCO, expected to above person. The 𝐾𝐡 will flag this as unlikely and be a high quality), contains wrong bounding boxes. This hence an erroneous detection, leading to a possibly good provides insights into potentially improving MSCOCO, prediction flagged as an error. constituting an added benefit of this work. Addressing 2D vs. 3D errors: We calculate the area On the whole, the human and the AI collaborate with covered by a bounding box, such that if the area is less tially noisy and incomplete commonsense 𝐾𝐡𝑠 in CSK- SNIFFER Another direction is to study whether auto- matic adversarial datasets compiled with assistance from CSK-SNIFFER help train better models on novel target domains. Our work presents interesting facets from big data visualization and analytics along with human-AI collaboration. 6. Acknowledgments Figure 5: Images with bounding boxes actually bad, flagged good (In the 1st image ”suitcase” is also detected as ”mi- A. Varde has NSF grants 2018575 (MRI: Acquisition of crowave”; in the 2nd image, ”car” is detected as ”cell phone”). a High-Performance GPU Cluster for Research & Ed- ucation); 2117308 (MRI: Acquisition of a Multimodal Collaborative Robot System (MCROS) to Support Cross- Disciplinary Human-Centered Research & Education). She is a visiting researcher at Max Planck Institute for Informatics, Germany. References Figure 6: Images actually good, flagged bad (In the 1st im- [1] D. Wang, E. Churchill, P. Maes, X. Fan, B. Shnei- age CSK-SNIFFER predicts β€œperson inside person”; in the 2nd derman, Y. Shi, Q. Wang, From human-human col- image it predicts β€œcar above car”, hence flagged as bad). laboration to human-ai collaboration: Designing ai systems that can work together with people, in: CHI, 2020, pp. 1–6. [2] F. Zuo, J. Wang, J. Gao, K. Ozbay, X. J. Ban, Y. Shen, than an empirically estimated threshold that object is H. Yang, S. Iyer, An interactive data visualization considered to be detected in the background and there- and analytics tool to evaluate mobility and sociabil- fore does not predict the triple, e.g. car above person ity trends during covid-19, arXiv:2006.14882 (2020). in that image. This helps to increase accuracy to ∼80%. [3] X. Chen, A. Shrivastava, A. Gupta, Neil: Extracting visual knowledge from web data, in: ICCV, 2013, 5. Conclusions and Roadmap pp. 1409 – 1416. [4] K. Konyushkova, R. Sznitman, P. Fua, Learning This paper synopsizes the demo (with approach and ex- active learning from data, Advances in Neural In- periments) of a system β€œCSK-SNIFFER” that β€œsniffs” ob- formation Processing Systems 30 (2017). ject detection errors in a big data on an unseen target [5] D. Xin, L. Ma, J. Liu, S. Macke, S. Song, domain using spatial commonsense, with high accuracy A. Parameswaran, Accelerating human-in-the-loop at no additional annotation cost. Based on human-AI machine learning: Challenges and opportunities, in: collaboration, the AI angle entails spatial CSK imbibed in ACM SIGMOD (DEEM workshop), 2018, pp. 1–4. the system deployed via visual analytics to assist humans, [6] A. Orlowski, P. Romanowska, Smart cities concept: while an important human role comes from the domain Smart mobility indicator, Cybernetics and Systems expert perspective in image tagging and task identifica- (Taylor & Francis) 50 (2019) 118–131. tion for training the system (in addition to the obvious [7] A. Garg, N. Tandon, A. S. Varde, I am guessing you human contribution of commonsense knowledge in the can’t recognize this: Generating adversarial images system). More significantly, the human and the AI make for object detection using spatial commonsense, in: contributions to the learning loop by feedback-based AAAI, 2020, pp. 13789–13790. interactive learning, and assistance in object detection [8] N. Tandon, A. S. Varde, G. de Melo, Commonsense respectively. It is promising to note that our approach knowledge in machine intelligence, ACM SIGMOD based on simplicity can automatically discover errors in Record 46 (2017) 49–52. data of significant volume and variety, and be potentially [9] M. Yatskar, V. Ordonez, A. Farhadi, Stating the useful in this learning setting. We demonstrate that with obvious: Extracting visual common sense, NAACL high quality, we can generate large complex adversarial (2016). datasets on target domains such as smart mobility. [10] J. Redmon, A. Farhadi, Yolo9000: Better, faster, Future work includes harnessing existing, poten- stronger, CVPR (2016).