Knowledge Engineering with Image Data in Real-World Settings Margaret Warrena, David A. Shammab, and Patrick Hayesa a Institute for Human and Machine Cognition, Pensacola, Florida USA. b Centrum Wiskunde en Informatica (CWI), Amsterdam, The Netherlands Abstract We report on experiences in adding ML-trained visual recognition modules to a human- oriented image semantic annotation tool which creates RDF descriptions of images and scene contents. We conclude that ML cannot replace expert humans but can aid them in various ways, some unexpected. Semantic markup systems can be to designed to align human and machine blind spots. Finally, we briefly outline directions for future work. Keywords 1 human-centered, knowledge engineering, image annotation, AI, ML, computer vision, HCI 1. Introduction We have a mature semantic markup system for images that allows subject-matter expert users to construct RDF knowledge graphs as image annotations, intended for use in domains where objects and relationships are specialized and require expertise to identify. With a view to improving the functionality of this system, we recently extended it by adding modern pre-trained visual classifiers and object recognition software to automate bounding box creation and suggest classification labels for objects in the image. While this automation has its advantages, principally to speed through the rapid localization of items in the photo, we see the addition of automatic vision systems as a technique for assisting rather than replacing human annotation. 2. Structured Relationship Annotations There is no shortage of tooling for annotating images with object bounding boxes that enclose specific classes for training. Much of the work on this class of tools seeks to speed up the task of drawing or specifying the points around the target object. [4] With simple boxes, relationships became important for scene understanding; for example, knowing a coffee cup is on a table has a specific relationship (in this case: "is on") which provides more information than simply knowing an image contains both objects. Annotations of relationships also bring a distinctly new set of tooling from just bounding box labeling. These visual relationships are principally represented in the 2016 Visual Genome [8] project, which contains over 100,000 images with 3.8 million object instances and 2.3 million relationships. Beyond the Visual Genome project's overall scale, the relationships included in the dataset are dense where a plurality of relationships can exist between the same set of objects. However, while mapped to Wordnet Synsets, In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford University, Palo Alto, California, USA, March 22- 24, 2021. EMAIL: mwarren@ihmc.us (A. 1); aymans@acm.org (A. 2); phayes@ihmc.us (A. 3) ORCID: 0000-0002-6680-2431 (A. 1); 0000-0003-2399-9374 (A. 2); 0000-0002-6639-9187 (A. 3) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) Visual Genome lacks commonly used structures, like RDF or OWL, in its representations and provides no tooling for creating annotations. In contrast, the ImageSnippets2 tool was designed to experiment with ways to produce structured semantic image markup by allowing users with minimal training to create machine-readable, ontology- based image and scene descriptions in the RDF syntax. Image descriptions, referred to as semantic markup or image graphs, are created as RDF triple stores which use the image identifier as root, and a core Lightweight Image Ontology (LIO) vocabulary of 11 relations, allowing users to quickly describe a variety of relations between objects and the scene, including the level of importance of an object to a scene, whether an object is in the foreground, background or has some other function and several other relations [6]. Users can also use the tool to add additional properties, describe spatial relations between objects, or engineer new ontologies, including those with OWL-type structures based on image contents. Scene objects and image properties are mapped to entities found in DBpedia, Wikidata, and other publicly accessible linked- data corpora, or custom-created if no existing concept can be found. Entity lookup is semi-automatic but guided by users, using an intuitive interface. 3. Experts Throughout the Loop Figure 1 Remington bullets that the classifier predicted as lipstick were later verified incorrectly by a human editor. Our annotation applications typically involve specialist domain knowledge and create structured data backed by formal ontologies. In many annotation settings, image classification and object recognition are error-prone even with human verification (see Figure 1). It is essential for the outputs of automated image classification and recognition tools to be evaluated by how they aid and support, rather than replace, expert human users. More, human expertise is necessary not only as a final verification check but throughout the entire process. [2] Our experiences in adding automated annotations have highlighted several findings. First, the utility of locating and isolating items of potential interest in complex images is useful, mostly independent of the predicted annotation label. Second, the predicted label of a targeted bounding box may be helpful as a base qualification at a high level of generality in a typical formal ontological classification. This can aid the human annotator by directing their attention to the relevant topic and guiding the search for formal concepts. In other words, even trivial detections have utility in expert domains. We note that image recognition and human visual abilities often complement each other in these situations when working 2 Demonstration available at http://imagesnippets.com/. quickly with complex or crowded images. For these reasons, we assert human experts must be involved throughout the annotation process's lifecycle when it comes to specialized domains. 4. Domain Example Figure 2 Annotating an image of a room in a hospital after an airstrike. Figure 2 shows an example using images collected after an airstrike on a hospital where the user's goal is to engineer a knowledge graph from the accumulated evidence of war crimes. Beyond identifying objects like oxygen bottles, hospitals, classified aircraft, people, and damage type, one must also account for knowledge of the terrain and context. Further shown in the figure is the systems interface at the point where the user has called on the object detector, which has found a chair, tv, and dining table in the image of a hospital room after an airstrike. At this point, the user can decide whether each detected region should serve as a subject of a triple in the RDF annotation. If so, then—regardless of whether the object is accurately identified—the user can accept the region and either accept the object label provided by the computer vision system (in which case the detected object label is automatically mapped to correct DBpedia and Wikidata values which become an object of the triple) or simply ignore the offered label and instead manually insert a correct label value. In this example, a copier was located as a region of interest but misidentified as a television. However, even in this misidentified case, the vision detector plays a significant role in 'noticing' the object of interest and locating it in the image with a bounding box far more rapidly and reliably than a human user. The result is a correctly identified object, accurately located in the image by a synergistic collaboration between human expertise and ML-trained identification and classification, each strengthening each other's weakness. The user can then further adjust the triple by altering the relationship of the objects to the overall image using other terms in the LIO vocabulary, perhaps by specifying unique relationships such as 'desk isUnder wall', or by adding context: 'image hasSetting Office' or establishing scene relationships: 'this image hasInBackground motorcycle.' 5. Future Work To date, our work has primarily focused on integrating contemporary classifiers and detectors into a human- centered semantic annotation system. But this work has illuminated and suggested several new future pathways. By observing first-hand problems such as underspecification in the ML pipelines [1], we see utility for semantic annotation methods to become part of environments where data excellence can be incentivized [5], and machine learning algorithms can be examined by people as part of internal AI auditing frameworks [9]. Future work will include the generation of expert-created training and test sets that can be fed back, using transfer/active learning methods, to create models trained to return increasingly more precise object suggestions, as well as the production of test sets for spatial relation scene graphs research [3] and the creation of adversarial training sets through the rapid human identification of machine blind spots in current image classifiers. Combined with the already apparent utility for the machine to find human blind spots, we feel we are working towards a beneficial synergy of blind spot alignment, visual learning, and knowledge representation. 6. Acknowledgments Many thanks to Jonathan Dotan at Stanford and Stephen Honan at Hala Systems for providing the domain-related discussions and examples. 7. References [1] A. D'Amour, K. Heller, D. Moldovan B. Adlam, B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J. Eisenstein, M. D. Hoffman, and F. Hormozdiari, 2020. Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395. [2] D. A. Shamma, L. Kennedy, J. Li, B. Thomee, H. Jin, and J. Yuan. 2016. Finding Weather Photos: Community-Supervised Methods for Editorial Curation of Online Sources. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '16). Association for Computing Machinery, New York, NY, USA, 86–96. DOI: https://doi.org/10.1145/2818048.2819989 [3] D. Nunes, L. Ferreira, P. Santos, A. Pease, "Representation and Retrieval of Images by Means of Spatial Relations Between Objects" AAAI Spring Symposium on Combining Machine Learning with Knowledge Engineering AAAI-MAKE (2019) http://ceur-ws.org/Vol-2350/paper7.pdf [4] D. P. Papadopoulos, J. R. Uijlings, F. Keller, & V. Ferrari, (2017). Training object class detectors with click supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6374-6383). [5] I. D. Raji, A. Smart., R. N. White, M. Mitchell, T. Gebru, B. Hutchinson, J. Smith-Loud, D. Theron, and P. Barnes, 2020, January. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 33-44). [6] M. Warren, P. J. Hayes, "Bounding Ambiguity: Experiences with an Image Annotation System." 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing SAD/CrowdBias@HCOMP. (2018). http://ceur-ws.org/Vol-2276/paper5.pdf [7] N. Sambasivan, S.Kapania, H. Highfill, D. Akrong, P. Paritosh, L.Aroyo. "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, To appear. [8] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, J. Li, D. A. Shamma, M. Bernstein, F. Li, "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations." International Journal of Computer Vision 123, 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7 [9] Ramya Ramakrishnan, Ece Kamar, Besmira Nushi, Debadeepta Dey, Julie Shah, Eric Horvitz, 2019, "Overcoming Blind Spots in the Real World: Leveraging Complementary Abilities for Joint Execution." Proceedings of the AAAI Conference on Artificial Intelligence 33 (01), 6137-6145