A Concept for Automated Polarized Web Content Annotation based on Multimodal Active Learning Marek Herde1[0000−0003−4908−122X] , Denis Huseljic1[0000−0001−6207−1494] , Jelena Mitrović2[0000−0003−3220−8749] , Michael Granitzer2[0000−0003−3566−5507] , and Bernhard Sick1[0000−0001−9467−656X] 1 University of Kassel, Wilhelmshöher Allee 73, 34121 Kassel, Germany {marek.herde | dhuseljic | bsick}@uni-kassel.de 2 University of Passau, Innstrasse 43, 94032 Passau, Germany {michael.granitzer | jelena.mitrovic}@uni-passau.de Abstract. Active learning (AL) techniques hardly cope with complex annotations tasks, where, for example, annotations might express rela- tionships across data modalities. As a use case, we consider the task of automatically detecting and reporting multimodal, polarized web con- tent (PWC). Samples of this content type emerge dynamically, covering a broad spectrum of topics. Thus, training machine learning systems for detecting PWC is challenging, particularly if it needs to be done with minimum annotation cost. In this article, we propose the concept of mul- timodal AL for complex annotations in the context of PWC detection and formulate the resulting challenges as questions for future research. Keywords: Active Learning · Multimodal Data · Semantic Annotation · Polarized Web Content · Hateful Memes. 1 Motivation Supervised machine learning (ML) relies on vast amounts of annotated data of- ten provided by human annotators in a labor-intensive process. Active learning (AL) addresses this problem of costly data annotation by intelligently querying annotators [2]. The goal is to maximize an ML system’s performance while min- imizing the annotation cost. Although AL techniques have shown their benefit for classification and regression tasks [7], they hardly cope with more complex annotation tasks, where annotations might – express relationships across data modalities (A1), – describe (semantic) relationships between concepts (A2), – come along with a high level of error-proneness and potential disagreement among annotators due to an ambiguous context (A3), – or require modeling background knowledge and sociodemographic factors of annotators to estimate the quality of annotations (A4). © 2022 for this paper by its authors. Use permitted under CC BY 4.0. 2 Herde M. et al.D. Huseljic, J. Mitrović, M. Granitzer, B. Sick Herde, As a use case, we consider the task of automatically detecting and reporting potential multimodal [9], abusive web content in political communication, which is in most cases strongly polarized. We use polarized web content (PWC) instead of related expressions such as hateful memes [4,5] to highlight this polarized nature. Generally, PWC comes in many forms, is subjective, depends on the context, and frequently requires background knowledge to be understood [13]. In this article, we refer to PWC as multimodal online content, mainly text and images, that can be found on social media and has, e.g., defamatory or abusive characteristics (at least from the viewpoint of certain groups of persons). The left side of Fig. 1 shows a PWC sample composed of an image of the burning World Trade Center on 09/11 and an image of a Muslim congresswoman, Mrs. Ilhan Abdullahi Omar. These two images are combined with a textual contradiction of “never forget” and “you have forgotten”. The polarized context arises from combining images and text (A1), which relates the concepts Twin Towers to Muslims and terrorism (A2). Identifying this polarization requires knowledge about American history and politics (A4) or otherwise may result in erroneous annotations (A3). Such PWC samples emerge dynamically and unforeseeably, covering a broad spectrum of concepts. Thus, training ML systems for detecting PWC is challenging, particularly if it needs to be done annotation cost-efficiently. Within this article, we view PWC detection as a challenging sample appli- cation with real-world impact [11] to initiate research on extending AL systems toward complex annotations of multimodal data. Therefore, we propose our con- cept of multimodal active learning for complex annotations (MALCOM) and formulate the associated challenges as questions for future research. 2 Concept We envision MALCOM as an extension of traditional AL [2], which assumes a single omniscient annotator providing categorical labels as annotations, toward (1) semantic annotation graphs (SAGs) [15] as complex, multimodal annotations and (2) an AL strategy selecting pairs of annotators and queries, e.g., samples. The objective is to semi-automatically build models that can identify PWC and analyze it by annotating a potential PWC sample with an SAG. Such an SAG describes the PWC samples’ contents, explains why its contents can be seen as polarized, and reflects the potential uncertainty in that analysis. Fig. 1 shows a PWC sample and its SAG to illustrate this objective. In the following, we outline our two envisioned extensions of AL and PWC detection in more detail. Extension 1 – Complex, Multimodal Annotations: Existing PWC de- tection approaches focus on standard supervised learning settings with cate- gorical labels as annotations [1,6,16]. The outputs or embeddings of vision and language models are typically combined as input for a final decision model. Our proposed SAGs represent an alternative combination strategy for the two modal- ities of images and text. SAGs allow decisions on a higher semantic level, which fosters explainability and decouples objective annotation tasks such as concept analysis of images and texts from more subjective decisions on polarization. We Polarized Web Content Annotation based on Multimodal Active Learning 3 ( ( ( ( ( ( Fig. 1. PWC sample1 with racist motive (left) and corresponding SAG (right) obtained by combined image and text analysis: Rounded rectangles represent concepts, arrows represent relations, and rectangular boxes represent inferred concepts. As a typical indicator of PWC, a contradicting relation is highlighted in red. AL (center) is applied for (1) unimodal image and text analysis and for (2) inferring whether a sample is polarized from the SAG through multimodal semantic analysis. In this simplified figure, we do not show additional information that is provided with the SAG, e.g., uncertainty regarding object classes or positions in images, relations beyond contradictions, etc. argue that this is a more efficient way of generating precise automatic classifica- tions of PWC. Methodologically, we have to go far beyond annotating images or text individually but considering their relationships. Annotations may describe positions of objects in images (regions of interest), comparisons of two images or texts, the importance of specific contexts for decisions, a degree of polarization, confidence estimates regarding decisions, etc. We need to develop a proper se- mantic model, e.g., ontologies [8,12], covering the different modalities and being understandable for annotators. This also includes the ability to express very dif- ferent PWC concepts over different modalities that go beyond contradictions but include more fuzzy concepts such as antitheses or correlations between concepts. Extension 2 – Query and Annotator Selection: Identifying PWC re- quires contextual knowledge of (very recent) events, e.g., pandemics [14]. So instead of building one generic model, we aim at building specialized models for different kinds of PWC, which use pre-trained models (per modality), and fine- tune them in an AL cycle. Extending the AL cycle towards complex annotations of multimodal data, as sketched in Fig. 2, starts with the question of integrating different modalities. First, we consider a pool of annotated unimodal data, i.e., texts and images, which we use to create unimodal models that can annotate 1 Image above is a compilation of assets, including ©Getty Images/Spencer Platt and ©Getty Images/Adam Bettcher, used under the “Hateful Memes Dataset License Agreement”. It is taken from “The Hateful Memes Challenge” [5] for illustrative purposes only and any person depicted in the content is a model. 4 Herde M. et al.D. Huseljic, J. Mitrović, M. Granitzer, B. Sick Herde, Error-prone Annotators useful queries annotated queries (1) selects extends training set (2) query-annotator pairs and ontology Annotator Query Set Model Training Set Image Data Text Data SAGs Image Data Text Data SAGs Text 49: We blame ... Text 67: We blame ... Text 1: We blame ... Active Text 2: We blame ... Text 3: We blame ... Learning Annotations ... Text N: We blame ... Strategy assesses utilities ML Models controls training/ of queries fine-tuning procedure (4) (3) information through models updated PWC corpus Fig. 2. AL cycle for MALCOM with four main steps: (1) Useful queries are selected from a set of all possible queries regarding potential PWC. For example, we may query annotations for the objects in an image or ask whether an SAG is polarized. (2) Selected queries are presented to a subset of annotators with possibly different (e.g., educational) backgrounds. This subset is determined through an ML-based annotator model estimating the annotators’ qualifications. Subsequently, the annotated queries update the training set. (3) The training set representing the current PWC corpus is used to (re-)train several ML models, e.g., an object detection model. (4) The trained models provide information regarding the query set such that the AL cycle starts again using this information for query selection. the unimodal data semantically. This process in each case results in an SAG, i.e., a typed, attributed graph defined through an ontology-based annotation scheme. Later, the SAGs are merged into a joint, multimodal SAG. Similar to traditional AL strategies, we need to identify promising candidates – initially images and texts, later multimodal SAGs – to be annotated. To consider the problem’s multimodal nature, the annotations’ semantic properties, and the an- notators’ diverse backgrounds, we must develop new AL selection strategies that account not only for the respective data sample but also for the different kinds of queries and the qualifications of certain annotators regarding the PWC sample at hand. These qualifications (also referred to as annotator performance [2]) may depend on various aspects such as the respective PWC category (e.g., politics) or educational background (e.g., Master’s degree in political sciences). The an- notator model predicting such qualifications needs to be sensitive to annotator minorities, e.g., by estimating similarities between annotators. Otherwise, we risk ignoring annotator minorities’ opinions regarding PWC. Moreover, we must consider that answers regarding the degree to which content is polarized may be highly subjective, i.e., uncertain from an ML perspective [3]. Establishing an objective definition of PWC, similar to hate speech research [10], is a possible way of reducing the subjectivity of PWC annotation. Polarized Web Content Annotation based on Multimodal Active Learning 5 3 Research Questions We conclude this article with the following six research questions derived from the above key research objective and the required extensions. – How can we define ontology-based annotation schemes to express a human’s reasoning over classifying web content as (gradually) polarized or not? – How can we extract image descriptions (part of the SAG) from potentially polarized images (part of the PWC) considering different uncertainty types? – How can we extend AL for object detection in potentially polarized images? – How can we extend AL over text extracted from the images to identify rhetorical figures and automatically analyze textual content to create se- mantic annotations automatically? – How can we merge unimodal SAGs and extend AL to train models, e.g., graph convolutional networks [17], assessing PWC via multimodal SAGs? – How can we evaluate the above techniques and build or extend data cor- pora [5] for research? Acknowledgements SPONSORED BY THE The project on which this article is based was partly funded by the German Federal Ministry of Education and Research (BMBF) under the funding code 01|S20049. The authors are responsible for the content of this publication. Furthermore, the authors thank Chandana Priya Nivarthi, Stephan Vogt, Mohammad Wazed Ali, and the anonymous reviewers for their insightful comments to improve this article. References 1. Gomez, R., Gibert, J., Gomez, L., Karatzas, D.: Exploring Hate Speech Detection in Multimodal Publications. In: WACV. pp. 1470–1478. Snowmass Village, CO (2020) 2. Herde, M., Huseljic, D., Sick, B., Calma, A.: A Survey on Cost Types, Interaction Schemes, and Annotator Performance Models in Selection Algorithms for Active Learning in Classification. IEEE Access 9, 166970–166989 (2021) 3. Huseljic, D., Sick, B., Herde, M., Kottke, D.: Separation of Aleatoric and Epistemic Uncertainty in Deterministic Deep Neural Networks. In: ICPR. pp. 9172–9179. Virtual (2021) 4. Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Fitzpatrick, C.A., Bull, P., Lipstein, G., Nelli, T., Zhu, R., et al.: The Hateful Memes Challenge: Com- petition Report. In: NeurIPS 2020 Competition and Demonstration Track. pp. 344–360. Virtual (2021) 5. Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., Testuggine, D.: The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. In: NeurIPS. pp. 2611–2624. Virtual (2020) 6 Herde M. et al.D. Huseljic, J. Mitrović, M. Granitzer, B. Sick Herde, 6. Kumar, A., Sachdeva, N.: Multimodal cyberbullying detection using capsule net- work with dynamic routing and deep convolutional neural network. Multimed. Syst. (2021) 7. Kumar, P., Gupta, A.: Active Learning Query Strategies for Classification, Regres- sion, and Clustering: A Survey. JCST 35(4), 913–945 (2020) 8. Kühn, R., Mitrović, J., Granitzer, M.: GRhOOT: Ontology of Rhetorical Figures in German. In: LREC. Marseille, France (2022) 9. Lahat, D., Adali, T., Jutten, C.: Multimodal Data Fusion: An Overview of Meth- ods, Challenges, and Prospects. Proceedings of the IEEE 103(9), 1449–1477 (2015) 10. MacAvaney, S., Yao, H.R., Yang, E., Russell, K., Goharian, N., Frieder, O.: Hate speech detection: Challenges and solutions. PLOS ONE 14(8), 1–16 (2019) 11. Mishra, P., Yannakoudakis, H., Shutova, E.: Tackling Online Abuse: A Survey of Automated Abuse Detection Methods. arXiv:1908.06024 (2019) 12. Mitrović, J., O’Reilly, C., Mladenović, M., Handschuh, S.: Ontological represen- tations of rhetorical figures for argument mining. Argument & Computat. 8(3), 267–287 (2017) 13. Sood, S.O., Antin, J., Churchill, E.: Using Crowdsourcing to Improve Profanity Detection. In: AAAI Spring Symposium 2012 – Wisdom of the Crowd. pp. 69–74. Palo Alto, CA (2012) 14. Uyheng, J., Carley, K.M.: Bots and online hate during the COVID-19 pandemic: case studies in the United States and the Philippines. JCSS 3(2), 445–468 (2020) 15. Vidal, J.C., Lama, M., Otero-Garcı́a, E., Bugarı́n, A.: Graph-based semantic an- notation for enriching educational content with linked data. KBS 55, 29–42 (2014) 16. Yang, F., Peng, X., Ghosh, G., Shilon, R., Ma, H., Moore, E., Predovic, G.: Ex- ploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification. In: ALW. pp. 11–18. Florence, Italy (2019) 17. Zhang, S., Tong, H., Xu, J., Maciejewski, R.: Graph convolutional networks: a comprehensive review. Comput. Soc. Netw. 6(1), 1–23 (2019)