-

A Concept for Automated Polarized Web Content Annotation based on Multimodal Active Learning

University of Kassel

Wilhelmsh¨oher Allee

Kassel

Germany

marek.herde

dhuseljic

bsick}@uni-kassel.de

0 University of Passau , Innstrasse 43, 94032 Passau , Germany

Active learning (AL) techniques hardly cope with complex annotations tasks, where, for example, annotations might express relationships across data modalities. As a use case, we consider the task of automatically detecting and reporting multimodal, polarized web content (PWC). Samples of this content type emerge dynamically, covering a broad spectrum of topics. Thus, training machine learning systems for detecting PWC is challenging, particularly if it needs to be done with minimum annotation cost. In this article, we propose the concept of multimodal AL for complex annotations in the context of PWC detection and formulate the resulting challenges as questions for future research.

Active Learning Multimodal Data Semantic Annotation Polarized Web Content Hateful Memes

Supervised machine learning (ML) relies on vast amounts of annotated data often provided by human annotators in a labor-intensive process. Active learning (AL) addresses this problem of costly data annotation by intelligently querying annotators [ 2 ]. The goal is to maximize an ML system’s performance while minimizing the annotation cost. Although AL techniques have shown their benefit for classification and regression tasks [ 7 ], they hardly cope with more complex annotation tasks, where annotations might – express relationships across data modalities (A1), – describe (semantic) relationships between concepts (A2), – come along with a high level of error-proneness and potential disagreement among annotators due to an ambiguous context (A3), – or require modeling background knowledge and sociodemographic factors of annotators to estimate the quality of annotations (A4).

© 2022 for this paper by its authors. Use permitted under CC BY 4.0. As a use case, we consider the task of automatically detecting and reporting potential multimodal [ 9 ], abusive web content in political communication, which is in most cases strongly polarized. We use polarized web content (PWC) instead of related expressions such as hateful memes [ 4,5 ] to highlight this polarized nature. Generally, PWC comes in many forms, is subjective, depends on the context, and frequently requires background knowledge to be understood [ 13 ]. In this article, we refer to PWC as multimodal online content, mainly text and images, that can be found on social media and has, e.g., defamatory or abusive characteristics (at least from the viewpoint of certain groups of persons). The left side of Fig. 1 shows a PWC sample composed of an image of the burning World Trade Center on 09/11 and an image of a Muslim congresswoman, Mrs. Ilhan Abdullahi Omar. These two images are combined with a textual contradiction of “never forget” and “you have forgotten”. The polarized context arises from combining images and text (A1), which relates the concepts Twin Towers to Muslims and terrorism (A2). Identifying this polarization requires knowledge about American history and politics (A4) or otherwise may result in erroneous annotations (A3). Such PWC samples emerge dynamically and unforeseeably, covering a broad spectrum of concepts. Thus, training ML systems for detecting PWC is challenging, particularly if it needs to be done annotation cost-eficiently.

Within this article, we view PWC detection as a challenging sample application with real-world impact [ 11 ] to initiate research on extending AL systems toward complex annotations of multimodal data. Therefore, we propose our concept of multimodal active learning for complex annotations (MALCOM) and formulate the associated challenges as questions for future research. 2

Concept

We envision MALCOM as an extension of traditional AL [ 2 ], which assumes a single omniscient annotator providing categorical labels as annotations, toward (1) semantic annotation graphs (SAGs) [ 15 ] as complex, multimodal annotations and (2) an AL strategy selecting pairs of annotators and queries, e.g., samples. The objective is to semi-automatically build models that can identify PWC and analyze it by annotating a potential PWC sample with an SAG. Such an SAG describes the PWC samples’ contents, explains why its contents can be seen as polarized, and reflects the potential uncertainty in that analysis. Fig. 1 shows a PWC sample and its SAG to illustrate this objective. In the following, we outline our two envisioned extensions of AL and PWC detection in more detail.

Extension 1 – Complex, Multimodal Annotations: Existing PWC detection approaches focus on standard supervised learning settings with categorical labels as annotations [ 1,6,16 ]. The outputs or embeddings of vision and language models are typically combined as input for a final decision model. Our proposed SAGs represent an alternative combination strategy for the two modalities of images and text. SAGs allow decisions on a higher semantic level, which fosters explainability and decouples objective annotation tasks such as concept analysis of images and texts from more subjective decisions on polarization. We argue that this is a more eficient way of generating precise automatic classifications of PWC. Methodologically, we have to go far beyond annotating images or text individually but considering their relationships. Annotations may describe positions of objects in images (regions of interest), comparisons of two images or texts, the importance of specific contexts for decisions, a degree of polarization, confidence estimates regarding decisions, etc. We need to develop a proper semantic model, e.g., ontologies [ 8,12 ], covering the diferent modalities and being understandable for annotators. This also includes the ability to express very different PWC concepts over diferent modalities that go beyond contradictions but include more fuzzy concepts such as antitheses or correlations between concepts.

Extension 2 – Query and Annotator Selection: Identifying PWC requires contextual knowledge of (very recent) events, e.g., pandemics [ 14 ]. So instead of building one generic model, we aim at building specialized models for diferent kinds of PWC, which use pre-trained models (per modality), and finetune them in an AL cycle. Extending the AL cycle towards complex annotations of multimodal data, as sketched in Fig. 2, starts with the question of integrating diferent modalities. First, we consider a pool of annotated unimodal data, i.e., texts and images, which we use to create unimodal models that can annotate 1 Image above is a compilation of assets, including ©Getty Images/Spencer Platt and ©Getty Images/Adam Bettcher, used under the “Hateful Memes Dataset License Agreement”. It is taken from “The Hateful Memes Challenge” [ 5 ] for illustrative purposes only and any person depicted in the content is a model.

useful queries

selects query-annotator pairs

annotated queries extends training set

and ontology Error-prone Annotators

Annotator

Model Active Learning Strategy

ML Models Query Set Text Data Text 1: We blame ...

Text 2: We blame ...

Text 3: We blame ... ...

Text N: We blame ...

(1) (4) Training Set

Text Data Text 49: We blame ...

Text 67: We blame ...

Annotations (2) (3) Image Data

SAGs

Image Data

SAGs assesses utilities of queries controls training/ fine-tuning procedure information through models updated PWC corpus the unimodal data semantically. This process in each case results in an SAG, i.e., a typed, attributed graph defined through an ontology-based annotation scheme. Later, the SAGs are merged into a joint, multimodal SAG. Similar to traditional AL strategies, we need to identify promising candidates – initially images and texts, later multimodal SAGs – to be annotated. To consider the problem’s multimodal nature, the annotations’ semantic properties, and the annotators’ diverse backgrounds, we must develop new AL selection strategies that account not only for the respective data sample but also for the diferent kinds of queries and the qualifications of certain annotators regarding the PWC sample at hand. These qualifications (also referred to as annotator performance [ 2 ]) may depend on various aspects such as the respective PWC category (e.g., politics) or educational background (e.g., Master’s degree in political sciences). The annotator model predicting such qualifications needs to be sensitive to annotator minorities, e.g., by estimating similarities between annotators. Otherwise, we risk ignoring annotator minorities’ opinions regarding PWC. Moreover, we must consider that answers regarding the degree to which content is polarized may be highly subjective, i.e., uncertain from an ML perspective [ 3 ]. Establishing an objective definition of PWC, similar to hate speech research [ 10 ], is a possible way of reducing the subjectivity of PWC annotation.

Research Questions

We conclude this article with the following six research questions derived from the above key research objective and the required extensions.

– How can we define ontology-based annotation schemes to express a human’s reasoning over classifying web content as (gradually) polarized or not? – How can we extract image descriptions (part of the SAG) from potentially polarized images (part of the PWC) considering diferent uncertainty types? – How can we extend AL for object detection in potentially polarized images? – How can we extend AL over text extracted from the images to identify rhetorical figures and automatically analyze textual content to create semantic annotations automatically? – How can we merge unimodal SAGs and extend AL to train models, e.g., graph convolutional networks [ 17 ], assessing PWC via multimodal SAGs? – How can we evaluate the above techniques and build or extend data corpora [ 5 ] for research?

Acknowledgements