=Paper=
{{Paper
|id=Vol-3180/paper-252
|storemode=property
|title=Aramis at Touché 2022: Argument Detection in Pictures using Machine Learning
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-252.pdf
|volume=Vol-3180
|authors=Jan Braker,Lorenz Heinemann,Tobias Schreieder
|dblpUrl=https://dblp.org/rec/conf/clef/BrakerHS22
}}
==Aramis at Touché 2022: Argument Detection in Pictures using Machine Learning==
Aramis at Touché 2022: Argument Detection in Pictures using Machine Learning Notebook for the Touché Lab on Argument Retrieval at CLEF 2022 Jan Braker1 , Lorenz Heinemann2 and Tobias Schreieder3 1 Student at Leipzig University, Computer Science (M.Sc.) 2 Student at Leipzig University, Computer Science (M.Sc.) 3 Student at Leipzig University, Data Science (M.Sc.) Abstract This work deals with the classifying and retrieving images in a data set. Images that argue for or against a topic should be recognized and ordered according to their argumentativeness. Therefor, different approaches are tested and compared with each other. The best results are provided by a neural network, which has been trained to recognize argumentative images with a total of 10,000 labeled images. The model received various features as input, including color, image text and other features. In addition, initial attempts are made to classify the images and their websites in relation to a given question according to their stance into "pro" (the thesis from the question is supported), "con" (the thesis from the question is attacked) and "neutral" (the thesis from the question is supported to the same extent as it is attacked). Keywords machine learning, search engine, argument retrieval, neural network, image search 1. Introduction The availability of information on the Internet is constantly growing. All topics are represented on the Internet with public statements of various points of view. Due to the unequal distribution of opinions and one-sided reports, it is often difficult to guarantee a neutral and balanced search result. Even the evaluation of such a search result causes problems. The retrieval of thematically critical search queries is a particular noticeable effort. For this purpose, there are special argument search engines that filter arguments related to a topic [1] [2] [3]. However, this is often only possible in text form at the moment [4]. Although there are opinions that images cannot be argumentative on their own [5], there are also hypotheses to the contrary. Kjeldsen et al. describe in their work the argumentative character of images and graphics and their various functions [6]. Through the visual component, they can clarify a problem to the viewer and highlight arguments in text form. Certain facts can be presented more convincingly by means of a picture than would be possible in written form [7]. CLEF’22: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ jb64vyso@studserv.uni-leipzig.de (J. Braker); lh31gyzy@studserv.uni-leipzig.de (L. Heinemann); fp83rusi@studserv.uni-leipzig.de (T. Schreieder) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Accident at the finish sprint at a professional bike race. Six drivers were seriously injured. Source: Tomasz Markowski/Associated Press [8] For example, in professional cycling, there is a debate about whether the barriers in the finish- area should be improved and replaced, as too many accidents are caused by the old fences [8]. A presentation of last year’s accident statistics can be a useful and convincing argument for increasing safety in cycling. It is also possible to describe the consequences to highlight the scale of the accident. “In a professional race, six riders crashed heavily, with three of them suffering brain and bone damage.“ [8] People can generally imagine text worse than a photo [6]. A picture of the accident increases the visualization and may reinforce the importance of renewing safety measures. A photo of the accident can be seen in Figure 1. The effective interplay of text and visual graphics is also exploited by the law of mandatory pictorial warnings on cigarette packs in Germany. The dissuasive images of long-term conse- quences of smoking are intended to discourage people from smoking. According to research by the German Bundestag, these warnings are more effective than a simple warning in text form. Especially the combination of picture and text achieves high efficiency [7]. For this reason, it is of great interest to highlight such arguments and argumentative images in a search query. This would be a useful extension, especially for special argument search engines. A method that can classify pixel-based representations as argumentative has not yet been extensively researched in the literature. The aim of this work is to develop a system that can assess and evaluate images according to their argumentative power in order to drive development forward. One possible application would be to obtain arguments from a search query not only in text but also in image form, to gain an even more detailed overview of the searched problem. Finally, an attempt should be made to assign these arguments to a supportive, neutral or negative stance, whereby the focus of this work is clearly on the recognition of argumentativeness. 2. Related Work Many previous works have dealt with the problem of finding arguments in text collections. The three largest search engines specifically designed for this task are args.me [1], IBM-debater [2] and ArgumenText [3]. They all make it possible to search for arguments in texts for a controversial issue and provide arguments in an ordered and clear manner. Wachsmuth et al. have already dealt with this problem in many works [9, 10, 4] and have successfully shown that it is basically possible to extract text excerpts reliably and determine their stance to the topic. However, they are limited to arguments in text form. Currently there is no published argument search engine that includes images in a search, but conventional image search engines can also achieve adequate results if clearly structured search queries are used [11]. This work will investigate whether it is possible to integrate images as a result of argument search queries. 2.1. Image search and image features Compared to working with texts, the steps of indexing and feature extraction have to be adjusted in the retrieval process for images. Latif et al. summarize in their review article the current technologies and procedures very well [12]. The different features of images are presented. Particularly important for this work are the color-based properties. The concept of dominant colors is presented and referred to the work of Shao et al. [13]. They reduce the countless colors of an image to a few representative colors to search for color-level images faster and more effectively. From this it can be concluded that only a few colors are enough to represent the color conditions of a picture. According to Solli et al. it is also possible to establish a connection between emotions and colors. [14]. They show that people experience similar emotions when looking at certain shades in pictures. Emotions are an important part of [14] arguments. From this it can be concluded that colors might also play a central role in the assessment of the argumentativeness of an image. In addition, represented objects are important for the message of the image. These can be reliably detected by object recognition. Mokshin et al. use distinctive structures and shapes in the images [15]. However, detection often requires specially trained neural networks, which are trained to reliably detect only a few objects using large training data sets [16]. For the assessment and viewing of many different images this method is not suitable for this work, since for each object to be recognized a training data set with several images of it would be necessary. 2.2. Can pictures contain an argument? Finding arguments in pictures is indirectly also critically considered by many works. Fleming et al. have the opinion that images cannot contain an argument on their own, and only can be supportive [17]. According to Fleming et al., an argument must consist of a claim and a support for that claim. An image cannot perform both functions at the same time. This view is also supported by Champagne et. al [5]. Both complain that a visual representation lacks a textual component to be a clear argument. However, many pictures also contain texts and diagrams. This would allow them to meet the requirements by Flemming et al. for an argument. The text on images gives the visual elements a context in which to understand and classify them, which according to Kjeldsen et al. is an important part when looking at pictures [18]. This fact must be taken into account in the evaluation and creation of the retrieval system and will be dealt with later in this work. However, it can be clearly stated that it is of great importance to examine the additional elements, such as texts and diagrams, on a picture in order to be able to assess the argumentativeness. 2.3. Argument search in pictures Image search has been widespread for many years and has been enhanced by several algorithms and technologies. The most common methods are described and summarized by Meharban et al. [19]. The search for argumentative images is much less researched than the search for arguments in text form. One approach for image search is to use a search query extension [11], whereby the index for the elements to be searched was created only on the texts of the pages of the images. In a search query, only the words good and anti were added and searched for matches in the documents. The result were already on a good level. However, no information from the image was used and the process is based on the assumption that the text used on a website is representative for each image embedded in it. This assumption has to be questioned critically: several images of different content can be integrated on one page, which makes it difficult to assign the texts clearly. This paper aims to take a different approach and to focus exclusively on the picture when assessing argumentativeness. Information from the website should only be used later, when classifying the stance of an argument. 3. Retrieval System To determine the relevance of an image to a given query, Kiesel et al. [11] distinguish three level of relevance. An image is considered relevant to the search engine if it is both topic-relevant, argumentative, and stance-relevant [11]. Stance-relevance refers to the attitude within the discussion. The tripartite division was adopted for this work and a separate model was developed for each of the three level. The goal of the retrieval system is to find the top k relevant images for both the "pro" and the "con" side. Figure 2 shows the retrieval system in a simplified form. The starting point is always a query as input. The query and most of the texts processed by the system first go through a preprocessing. It uses the Spacy (en-core-web-sm) language model [20] to get a tokenization on the text first. Subsequently, punctuation marks and stopwords are removed. The remaining tokens are finally lemmatized before they are passed on to the appropriate part of the retrieval model. A preprocessed query is first entered into the topic model, which calculates the affiliation to a topic for each image in the data set by using a DirichletLM model. According to the studies of Figure 2: Overview of the retrieval system with the sketched sequence of the image search via the IDs. Argument and stance models evaluate the images independently. After combining the scores, the pictures are sorted accordingly. The images with the highest score can be considered as pro, and those with the lowest score as con. Potthast et al. the DirichletLM performs significantly better in argument retrieval compared to TFIDF and BM25 [21]. The text of the HTML page of the image is used as input. Since the retrieval of the topic relevance is not be focused by this work, further information provided by the underlying data set is taken into account. This includes, that each image has already been assigned to at least one topic. Additionally, each topic has a handful of example queries. With these and the user query, the topic of the query is determined and a DirichletLM retrieval is performed on the images of the determined topic. Because of that, it is assumed that the topic model returns subject-related images. No further evaluation will take place. As a result, the topic model returns a list of image IDs with a calculated score representing the topic relevance in relation to the query. The next step is to pass the list of image IDs in parallel to the argument model and the stance model. The argument model now calculates a score for the argumentativeness of the respective image. At the same time, the stance model classifies the same images into the classes "pro", "con" and "neutral". Note, however, that only the image IDs that have been classified as either "pro" or "con" are returned by the model in two separate lists. Neutral images are ignored. In some cases, this can lead to an unbalanced retrieval result of the search engine if fewer images are classified for one of the sides. This approach was deliberately chosen because the number of arguments per page gives the user additional value in the search query. In the final step, both the "pro" and "con" lists of the stance model are sorted according to the scores calculated by the argument model. The results are two lists of image IDs in descending order, which ideally contain the images, which support or attack the thesis sought the most. The argument and stance models are discussed in much more detail in the following two chapters. 4. Argument Model To be able to rank images according to their argumentativeness, each image must be given a score. The higher this score is, the better an image argues in relation to a topic. An argument that is critical and/or supportive results in a high score. The position for which it argues is not considered. This score makes it possible to search for argumentative images without having to specifically understand their content and assign it to the issue. A diagram for example often has an argumentative supporting character. This can be considered in the argument model and evaluated accordingly. In this case, only the diagram must be recognized, since the expression of the stance is unimportant for the argument model. There are several such features that can be searched for on an image. Whether and how strongly the occurrence and use of these features correlates with the argumentativeness of an image is to be examined later. 4.1. Image Features The first thing to look at are the features that can be obtained from the image alone. All images are in PNG format and have a rather low resolution, since they are downloads of embedded images on web pages. 4.1.1. Color Features Colors can be represented on a computer by the RGB color model. It uses three numerical values between 0 and 255 for the colors red, green and blue. From them, a color is described exactly by means of an additive color model. Each color value was normalized between 0 and 1 by means of the maximum value equal to 255. Average Color The first feature used for the argument score is the average color of an image. For this purpose, all color values of the pixels of an image are averaged. Possibly, a general color mood and also an emotion when looking at the image, can be detected via this. One hypothesis to be tested is that certain colors are used more often in argumentative images than others. For example, the colors red and green, following the colors of a traffic light, could be used to highlight positive and negative elements as indicators [14]. This feature consists of three values, respectively for the red, green and blue values. Dominant Color However, the average color has a decisive disadvantage. If an image has many red and green elements, which could possibly stand for a strong argumentativeness, this cannot be detected by this feature. The average color mixes the colors together to a new color. Thus, in the additive color model, red and green make yellow, making it impossible to distinguish whether the image is yellow, or red and green. One solution is to use dominant colors. Here, the most used colors of an image are considered. The color values of the pixels are grouped and the most used color is output as the dominant color. To avoid grouping by exact colors, the image can optionally be reduced to fewer colors beforehand, creating color intervals, which results in higher accuracy, especially for photographs and color gradients. As a feature, only the first dominant color is used, which in turn consists of the three RGB values. Consequently, the second most used color is no longer considered. The effects of this decision could be analyzed in further studies. Percentage Color In order to investigate the hypothesis that different colors are used more frequently in argumen- tative images than in non-argumentative images, the color proportions are determined. The colors red, green and blue as the three colors of the color model were considered, additionally yellow as a neutral color between red and green was added. The color proportion is determined by examining each pixel color to see whether it lies within the specified color interval of the color to be examined. The interval is necessary because, for example, the color green describes a variety of hues. These intervals were specified in the HSV model because the brightness and saturation of a color can be considered independently. These have no effect on the hue. A binary image mask is then created, where a pixel is colored white if the pixel color value is in the interval, and black if not. The ratio of black and white pixels in the image gives the color portion of the color being searched for. The following color intervals were defined (Hue, Saturation, Light value): • Red: (0, 50, 80) to (20, 255, 255) and (160, 50, 80) to (255, 255, 255) • Green: (36, 50, 80) to (70, 255, 255) • Blue: (100, 50, 80) to (130, 255, 255) • Yellow: (20, 50, 80) to (36, 255, 255) Since the color red lies in the middle of the zero point of the color scale, two color intervals are necessary to capture the entire color. The same procedure can be applied to the brightness of the image. The HSV color model makes it very easy to filter the light and dark color areas. The thesis in this context is that possibly light colors could be used more in the positive context and dark colors in the negative context. • Light: (0, 0, 200) to (255, 60, 255) • Dark: (0, 0, 0) to (255, 255, 60) 4.1.2. Optical Character Recognition Optical Character Recognition is a common technique that Google Inc. used to develop their text recognition Tesseract [22]. This open-source software is used to recognize text on images. In this work, each image was checked for recognizable texts using Tesseract. It is noticeable that handwriting is not recognized. Standard fonts with high contrast to the background are found best on images. This is problematic for demonstrations with handwritten posters or photographs with low-contrast fonts. Text in languages other than English is also recognized more poorly. Normally, Tesseract only outputs letters that could be identified with a high probability. How- ever, for this work, the threshold for the probability of a match was lowered, which allowed more text to be detected on the images. The output contains many single letters and symbols, making it advisable to filter for words with a letter length greater than two. In addition, the words found were checked against a lexicon and words not in the English lexicon were also filtered out. Text Length Text length is a feature, which corresponds to the number of words found after post-processing. The assumption is that a long text has a higher potential to be argumentative. The text length is then normalized with the following function, where 𝑥 is the number of words found: 𝑡𝑒𝑥𝑡𝐿𝑒𝑛𝑔𝑡ℎ𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 = 1 − 𝑒−0.01·𝑥 (1) This ensures that a text twice as long not results in a feature value twice as large. Furthermore, from a text length of 200 onwards, there is no longer a large change and the value is asymptotically one, because the subtrahend becomes zero by a large divisor. Sentiment Score Furthermore, the text is given into a sentiment analysis, which returns a sentiment score between -1 and 1. A sentiment analysis is a topic of text mining and examines a text snippet for a positive or negative stance. Basically, different words are associated with a certain stance. For example, the word "anti" might indicate a negative stance. The average stance of a text is returned as a continuous numerical value, where -1 indicates a negative text, and 1 indicates a positive text. In this work, a lexicographic approach was taken using the VADER lexicon, with image texts evaluated using NLTK sentiment analysis [23]. However, many images do not include text, or it was not detected. In addition, sometimes individual words can strongly influence the sentiment score, even though the context would suggest the opposite influence. For the argument model, the absolute value of the score was used, since it is irrelevant whether the argument is for the positive or negative side. Text Area In addition to the individual letters, Tesseract also returns the position and size of the text. Two points on the image plane describe the upper left and lower right point of the smallest possible rectangle that can be placed around the font elements. These areas can be added and put into a ratio with the total area of the image. The resulting percentage value describes how much area of the image is taken up by the text (seen in Figure 3). Possibly, the area taken up by the text should be understood as a kind of weighting of the sentiment score. For example, logos or source citations are often included, which are displayed small at the bottom. These may be less important than large headlines. Text Position Furthermore, the text position on the image is examined. The theory behind this is that many images contain text captions or headings, with many texts found on the edges of the images. These texts could possibly be attributed a different meaning than the text found in the middle of the image. This was realized using a heat map, with the image divided into an even 8x8 grid. For each field, the amount of text is determined and the 64 values are stored in a two-dimensional array. This can be seen as an example in Figure 3. Figure 3: (left) Text Area calculated based on the area occupied by the text. (right) Text Position calculated as an example of a heat map over a 4x4 grid. In the work, a 8x8 grid was used. A red color represents a high text content. 4.1.3. Additional features Two additional features were created after an analysis of the data set revealed that many images were graphics and not photographs: Image Type and Diagram Detection. The assumption was made that graphics have a higher potential to be argumentative than photographs. This was said to be due to the more frequent presence of text and diagrams, which form assertion and/or support of the argument [17]. To do this, it must be recognized what type of image is involved. Also, it would be reasonable to assume that multiple and larger diagrams have a higher argumentative character. Image Type A distinction is made between graphics (cartoons, clipart) and photographs. Abd Zaid et al. showed in their work that cartoons generally consist of significantly fewer colors than photographs [24]. This can be used to build a classifier which distinguish between cartoons and photographs. The classifier looks at the ten dominant colors and their image proportion. If these take up more than 30% of the image, it is assumed to be a graphic. Diagram Detection To recognize diagrams, the image is preprocessed in several steps. The procedure was described this way by the user nathancy on the page Stackoverflow.com [25]. First, the image is converted to a binary image by a threshold-value. Now the image consists only of black and white pixels. If the contrast between text and background is high enough, the text is clearly visible. The text can be removed using an extended horizontal kernel. The kernel removes all elements and image-areas, which looks like small horizontal lines, just like a line of text. By assuming that the text was written horizontally, all lines of text will be removed and colored black. All remaining image elements, which are the left over white areas, are no text and combined and extracted. The ratio of the size of these elements to the total image size forms the actual feature. It becomes problematic if several diagrams are contained and recognized as one by the algorithm. Since the smallest possible bounding box is determined, the area portion can have a large error. Also problematic are logos in the corners of the images, which makes the bounding box unnecessarily large if additional diagrams are included on the image. Colored diagrams are recognized worse due to the binary filter, if they are not sufficiently different from the background. This filter is necessary, however, because otherwise the kernel cannot recognize the horizontal structures. Overall, the diagram recognition works well enough to add a benefit to the project. However, the percentage value as a feature implies that a diagram that takes up the entire image is the best. Because of the described error, the feature value should be urgently 0 for a diagram image percentage of 100%. The optimal image proportion is assumed to be 80%. From these defaults, a function can be derived, which converts the image portion accordingly. For this purpose, a log-normal distribution was used. The variable 𝑥 is the determined value of the diagram recognition between 0 and 1. The value range of the function is also between 0 and 1. 4.2. Argument Standard Model The first attempt was to build a formula with the normalized features, whose output is a numerical value that describes the argumentativeness of an image. For this purpose, an attempt was made to implement the assumptions made within the formula. A higher result should be associated with a higher argumentativeness. 𝑎𝑟𝑔𝑢𝑚𝑒𝑛𝑡𝑆𝑐𝑜𝑟𝑒 = 𝛼(𝑐𝑜𝑙𝑜𝑟𝑆𝑐𝑜𝑟𝑒) + 𝛽(𝑡𝑒𝑥𝑡𝑆𝑐𝑜𝑟𝑒) + 𝛾(𝑑𝑖𝑎𝑔𝑟𝑎𝑚𝐹 𝑒𝑎𝑡𝑢𝑟𝑒) (2) The factors 𝛼, 𝛽, 𝛾 describe the influence of the individual scores on the argumentScore. They should lie between 0 and 1, whereby the optimal weighting is to be determined later by an evaluation. The colorScore implements the assumption that light and green colors are more likely to be found on positive images, and dark and red colors on negative images. However, this only applies to photographs, as cliparts contradict the assumption due to their often white background. For this reason, the colorScore is included when calculating the image-type. Since the stance of the image does not matter in a discussion, the positive and negative assumptions do not need to be distinguished. Therefore, they are added in the formula. if Image Type == ’photo’: 1 𝑐𝑜𝑙𝑜𝑟𝑆𝑐𝑜𝑟𝑒 = 𝜎(%𝑔𝑟𝑒𝑒𝑛 + %𝑟𝑒𝑑) + (%𝑏𝑟𝑖𝑔ℎ𝑡 + %𝑑𝑎𝑟𝑘) (3) 𝜎 𝜎 is the weighting ratio between the color component (red, green) and the brightness (light, dark). In this work, a 𝜎 of 0.8 is used, which weights the colors significantly higher. if Image Type == ’clipArt’: 𝑐𝑜𝑙𝑜𝑟𝑆𝑐𝑜𝑟𝑒 = (%𝑔𝑟𝑒𝑒𝑛 + %𝑟𝑒𝑑) (4) Figure 4: Topology of the used neural network for argument detection. The textScore describes the assumption that longer texts have a higher potential to be argumen- tative. Furthermore, the sentiment score is included. 𝑡𝑒𝑥𝑡𝑆𝑐𝑜𝑟𝑒 = 𝑡𝑒𝑥𝑡𝐿𝑒𝑛𝑔𝑡ℎ · |𝑡𝑒𝑥𝑡𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡| (5) The formula thus formed for calculating the argumentScore evaluates images in the value range 0 to 3, where each term can become a maximum of 1. This formula is based on many assumptions and theories that could not be substantiated in this work due to time constraints. This would require an extensive analysis on a data set labeled in this regard. For this reason, another approach was tested which is based on a neural network and is described below. 4.3. Argument NeuralNet The goal in using a neural network is to not have to set the parameters 𝛼, 𝛽, and 𝛾 yourself. In addition, the network should ignore possibly incorrect assumptions, such as the color assumptions on argumentativeness. It first needs a topology that allows all features to be reasonably available to the network. The simplest variant is to use a fully-connected network, where all neurons of one layer are connected to the neurons of the next layer. The relu function serves as an activation function of the neurons. However, it is not recommended to put all features into the input layer. Although they are already all normalized, the color features receive a presumably higher importance Figure 5: Evaluation of the significance of the input features. If the accuracy of the network is low without a certain feature, this feature has high influence on the predictive performance of the network. For the baseline, all features are used. due to their number. For this reason, it is advisable to use a combined network in which color features are reduced in number in an upstream network. In addition, it offers itself to analyze the text position by means of a Convolutional network. The 64 values of the heatmap are reduced to three values by three filters and a small fully-connected layer. Color features are also reduced to three values. Together with the remaining features, the six values are given to a larger fully-connected network (illustrated in Figure 4). The evaluated images in the data set serve as training data, where an evaluation with strong is mapped to a value of 1 and an evaluation with weak was mapped to 0.5. The rating scale is defined in more detail in chapter 6.1. To prevent overfitting, an automatic stopping was used that terminates the training as soon as the validation accuracy stops increasing within 10 epochs. Since in summary 10 features are given to the network, it is advisable to evaluate which features actually add value and which do not. Each feature is based on an assumption for the detection of argumentativeness. However, since these are not proven, the evaluation is at the same time a hypothesis testing. The features were evaluated by training a baseline model. This contains all the features presented. Subsequently, each feature was omitted once in the training. The average of accuracy was calculated from 10 trained and evaluated models. The results are shown in Figure 5. It can be seen that the network relies heavily on the textLengths feature, as the accuracy decreases when the feature is left out. Furthermore, it can be seen that the accuracies increase when Percentage blue&yellow, Image Type and Text Position are excluded during training. In terms of assumptions, this means that argumentativeness does not depend on the yellow and blue content. The same is true for the text position on the image. The argumentativeness of an image seems to be relatively strongly dependent on the text length found on the image. The result from the evaluation is that Percentage blue&yellow, Image Type and Text Position are no longer considered. It was also tried not to include Average Color, but the accuracy decreased in relation to the baseline when excluding all four features. 5. Stance Model Even though the focus of this work is on the recognition of argumentativeness, an attempt was made to recognize the expression of stance relevance within the discussion. 5.1. Stance detection features An image can be supportive, neutral or rejective towards a question. To recognize this, information about the question itself is required. Since it is difficult to make the content of the question understandable to an algorithm, the following features try to establish a connection between image text and query. However, on many images no text is recog- nized, or they do not possess any, whereby a comparison is not possible and an attitude concerning the question cannot be recognized. For this reason, the HTML page of the image was included. There, the surrounding text was extracted and special emphasis was placed on captions. Thus, HTML text and image text can be compared with the query regarding features. Query Equality The Query Equality describes the equality between the query and the image or HTML text. This is based on the assumption that if the query itself, or parts of it, are found in the text, there is a high correlation between them. This is implemented by a simple term index, for which the occurrences of the preprocessed query terms in the text are counted. Query Alignment Similar to the Query Equality, the Query Alignment also searches for matches between query and text. The Needleman-Wunsch algorithm is used to find optimal alignments [26]. Due to the computation time and the sometimes very long HTML texts, the feature was only computed between query and image text. Query Context However, finding query terms in the text is not enough to make statements about the attitude of the text. Negations, which can reverse statements using a word, pose a major problem. The Query Context feature looks at query term occurrences in the text and evaluates the surrounding words in a 𝜎 environment with respect to the sentiment score. 𝜎 is to be chosen depending on the text at hand. Since in this work the text was preprocessed, 𝜎 is chosen quite low, since filler and stop words are no longer included. Recommended is 𝜎 = 6. The assumption underlying the feature is that if a negative sentiment score is found around query term occurrences, there is also a negative correlation between the text excerpt and the query term. For each occurrence in the text, the sentiment score is determined and the average is calculated at the end. Figure 6: Topology of the neural network for stance determination. This results in a total of 5 features, since Query Equality and Query Context are each calculated between query and either image text or HTML text. 5.2. Stance NeuralNet Similar to the argument model, an attempt was made to determine the attitude of an image with respect to a search query by means of a formula. This formula included the described features from section 5.1, but also some features from the standard argument model. However, the accuracy of the used formula and a random assignment couldn’t be distinguished. For this reason, a neural network was also trained, which has similarities to the argument model. It uses the stance detection features from section 5.1 and some argument model features. The stance NeuralNet model can be seen in Figure 6. The three classes have different occurrences in the data set. Because of this, they were weighted differently in the training. The weight of the pro and con classes is the number of neutral images divided by the number of pro/con images. The neutral class has a weight of 1. These weights ensure that the network don’t prefer the neutral class due to its frequency. It should be emphasized that unlike the argument model, no continuous value is to be predicted. The stance model functions as a classifier. Each output neuron represents one of the three stance expressions of the image "con", "neutral" and "pro". During training, the information is converted into binary vectors (0, 0, 1) for "pro", (0, 1, 0) for "neutral", and (1, 0, 0) for "con". In prediction, the largest value of the three neurons is interpreted as the predicted class. Figure 7: Evaluation of the significance of the input features. If the accuracy of the network is low without a certain feature, this feature has high influence on the predictive power of the network. For the baseline, all features are used. An evaluation of the features can also be made for this network. Only the sentiment score of the query does seem to be disadvantageous for the network. The accuracy in relation to the baseline when excluding this feature increases significantly (to be seen in Figure 7). Consequently, the feature is not further used in the network. 6. Evaluation Subsequent to the presentation of the various models and features, these should now be evaluated. For this purpose, the labeling process and the resulting outcomes will be presented. In connection with this, both the argument model and the stance model will be evaluated and their performance is to be determined using suitable metrics. 6.1. Data Set This work is based on a data set with a total of over 23,000 images, divided into 50 different topics. The data set contains both the images and the HTML text in which each image appears, as well as some additional information. In order to be able to train models on this data, the data must first be labeled. For texts, there are ready-made platforms for semantic annotation, such as the INCEpTION [27] platform. In order to also label images in terms of their topic relevance, argumentativeness, and stance relevance, a separate web frontend was first developed, as can be seen in Figure 8. This allowed a large portion of the data set to be annotated as efficiently as possible. The recognizable tripartition of the labels refers to the considerations of Kiesel et al. [11], whereas the different expressions were adapted to the questions and problems of this work. Figure 8: HTML frontend for labeling the images of a selected topic. Different topics can be labeled by several users at the same time in terms of their topic relevance, their argumentativeness and their stance relevance. Below is an explanation of the different labels [11]: • Topic – True: From the image (recognizable content) you can see the topic. – False: From the image (recognizable content) you can not see the topic. • Argumentativeness – None: There is no argument recognizable in the image that argues for any position in a topic. If the image does not belong to the topic, the argumentativeness must be set to "none". – Weak: Few arguments are recognizable in the image or/and the arguments are not clear. – Strong: Several arguments are recognizable in the image or/and a clear stance is recognizable in each argument. • Stance – Pro: A clear attitude can be seen throughout the image, which supports the thesis of the topic. – Neutral: The picture is not argumentative or arguments are made in equal measure for and against a topic. If the image does not belong to the topic, the stance must be "neutral". – Con: A clear attitude can be seen throughout the picture, which attacks the thesis of the topic. When assigning labels, it is important to note that images that are not topic relevant (topic = false) cannot be argumentative and are assigned a neutral stance. The underlying assumption is that both argumentativeness and stance relevance are directly dependent on topic relevance. For example, an image is labeled "neutral" in terms of stance relevance if the topic reference is missing, even if the image could be "pro" or "con" to another topic. With regard to argumentativeness, a distinction is made between strongly and weakly argumentative images. This is intended to train the model in such a way that strongly argumentative images get a higher score and can therefore ideally be shown first by the search engine. In the further course, a more precise distinction is made in the evaluation with the addition "strong", which means that only strongly argumentative images are considered here. A total of almost 10,000 images were annotated through the labeling process. There are clear differences between the various topics in the distribution of the label characteristics. Table 1 shows the label results averaged across all annotated images. First of all, it can be deduced from the results that only 72% of the images in the data set that have already been assigned to a topic are actually topic relevant. Since argumentative images must also be topic relevant, the proportion of argumentative images is obviously lower at 46%. However, a clear difference can also be seen with the strongly argumentative images. With 14% of the images, these are represented rather rarely in the data set. A clear stance relevance is still recognisable in 34% of the images, whereby "con" images are represented less frequently in the data set with 14% compared to "pro" images with 20%. In conclusion, 34% of the images are topic relevant, argumentative as well as stance-relevant and are therefore relevant for the search engine. When considering only the strongly argumentative images here, the proportion is reduced to 13%. Even more serious are the differences between the various topics. For example, in the topic "Should the penny stay in circulation?" only 9% of the images are argumentative and only 2% of the images have a stance relevance of the type "pro" (see Table 8 in the appendix). These outliers mean that the retrieval system can only find very few relevant images. In a rank-based evaluation, the Precision@k evaluates the performance of the model for the best k results. If k is chosen to be greater than the number of relevant images contained, the performance of the model is underestimated. This is already the case from a k of 20 for some topics. Category Percentage in data set Topic Relevance 72% Argumentativeness 46% Argumentativeness (Strong) 14% Stance Relevance 34% Stance Pro 20% Stance Con 14% Stance Neutral 66% Relevant Images 34% Relevant Images (Strong) 13% Table 1 The analysis shows the ratio of the number of images that fulfil the respective property to the total number of images. All 20 labeled topics are taken into account. The addition of "strong" to the argumentative and relevant images means that only the strong argumentative images are taken into account, whereas without the addition both strong and weak argumentative images are meant. Relevant images are all images that show a topic relevance, are argumentative and have a stance relevance with the characteristic "pro" or "con". In the best case, these should be displayed by the search engine. 6.2. Evaluation Argument Model In the following, the argument model will be evaluated. A differentiation is made between the presented standard and NeuralNet model. All calculated and presented values were cross-validated and averaged from multiple runs to obtain representative data. For the training of the model, topics are needed that contain enough relevant images. Based on the analysis of the labels, skip topics were defined, which are ignored for the training of the models and partly also for the subsequent testing. The analysis of the data set has shown that some topics (skip topics) lead to a degradation of the model performance when they are used. This is due to the fact that they contain too few argumentative images for the network to train. However, there are two separate skip lists for the argument model and the stance model. In the argument model, all topics that do not have at least 20 strong argumentative images are added to the list and later ignored during training. The stance skip list contains topics that do not have at least 20 "pro" and 20 "con" images. The choice of the value 20 is taken into account in the following subsections for calculating the model performance. There are two topics in the argument skip list. If the images of these topics are removed from the data, the remaining data is called valid. For the evaluation of the standard model, a separate examination and evaluation takes place with all and the valid data. A Precision@20 is calculated in each case. Strong@20 evaluates how many of the best 20 search results were actually labeled as strong. Since the valid data for each topic contains at least 20 strongly argumentative images in the data set, in the optimal case these images would also be at the top of the ranking. For Both@20, both strong and weak argumentative images are considered correct and are not distinguished. |{𝑟 ∈ 𝐼|𝑟 ≤ 𝑘}| 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘 = (6) |𝐼| The Precision@k describes according to Berrendorf et al. [28] the proportion of hits or instances for which the true entity appears among the first 𝑘 entities in a sorted list. The 𝑟 represents the respective rank from the set of individual rank scores 𝐼. The presented model assumes that the inputs are topic relevant. Due to the fact that there is no further check for this, images without topic relevance can also be found in the output. Therefore, only the first k topic relevant images are considered when calculating the Precision@k. 6.2.1. Argument Standard Model The results from table 2 show that the model performance decreases slightly when all topics are considered. This is because the topics in the skip list do not contain 20 strongly argumentative images, and the Precision@k for these topics can never become 1. With a precision Both@20 of 0.875, this means that among the 20 top-rated images, on average 17.5 images are actually strongly or weakly argumentative. The other predicted images were classified as "none" in the labeling process. The predictions can thus be considered acceptable, but not particularly good, since the precision decreases especially with increasing k. Topics Strong@20 Both@20 all 0.5225 0.8475 valid 0.5472 0.8750 Table 2 Results of the argument standard model: To calculate the results, the features described in chapter 4.2 are included in the model with equal weighting. Strong@20 means Precision@k with k=20, whereby only the strongly argumentative images are considered. With Both@20, both strong and weak argumentative images are considered. With topics "all" all labeled topics are used, while with topics "valid" all topics are used excluding the previously defined skip topics. Figure 9: Histograms for the argument standard model over a selected sample of valid topics with the distributions of the scores assigned by the model and the calculated Precision@20 values. PBoth@20 shows the determined precision for weakly and strongly argumentative images. With PStrong@20, only strongly argumentative images are considered. ImageCount shows the absolute number of images that are available in the data set for the category ("Strong" or "Both") under consideration. Figure 9 shows the scores assigned by the standard argument model for two representative topics in the form of histograms. Ideally, strongly argumentative images should be found in the right part of each plot (green), weakly argumentative images in the middle part (red) and non-argumentative images in the left part (blue). Three distributions would thus be visible, which are spatially separated from each other on the x-axis. The argument standard model does not produce such clearly independent distributions. The right-skewed distributions produce good Precision@20 values, but this performance drops sharply as k is increased. 6.2.2. Argument NeuralNet For the NeuralNet model, data splits for training and testing must be defined. This is realised in two different ways. On the one hand, there is a split at the image level, where all labeled topics are taken into account and the images of all topics are divided into 2/3 training data and 1/3 test data. This split allows the model to learn the characteristics of each topic. In order to investigate possible overfitting to the given topics, another split is made at topic level. Here, individual topics are left out from training and only used for testing. These test topics contain enough argumentative and strong argumentative images and are taken from the valid topics. The retrieval system is set up in such a way that individual topics, but not individual images, can be processed in the evaluation. As a result, a correct evaluation on only the test data is not possible in case of a split on image level. Since the training data is also included in the evaluation data, overfitting is difficult to detect. For this reason, the ratio of images serving as train and test data was varied and different models were trained with them. The test data serves the neural network as a validation data set and, with the accuracy calculated on it, determines when the training is stopped in order to avoid overfitting. It was discovered, that already from a training share of the data of 10% no remarkable changes in the model performance can be recognized (For further results see figure 12 in the appendix). Accordingly, this proportion is already sufficient for the model to learn the essential features of argumentative images. Since, in the worst case, overfitting occurs on the training data, and the test data is significantly worse predicted, there should be a change in overall precision when varying the split between the two. Since this is not the case, overfitting on the data can be excluded with suffi- cient certainty and an evaluation can also take place with training data contained in the test data. Data Split Topics Strong@20 Both@20 Image Level all 0.5572 0.8472 Image Level valid 0.5850 0.8764 Topic Level all 0.5420 0.8568 Topic Level test 0.5550 0.8729 Table 3 Results of the argument NeuralNet model. The table shows the Precision@20 values determined on the one hand for exclusively strong argumentative images (Strong@20) and on the other hand for strong and weak argumentative images (Both@20). A distinction is made as to whether a data split was applied at image level or at topic level, and whether all topics, the valid topics or only the test topics were taken into account. Table 3 shows the results for the NeuralNet model for both an image level and topic level data split. The column topics describes which data was used for the evaluation. The model for the data split at image and topic level was the same in each case and was trained with the valid data (skip topics excluded). As can already be seen with the standard model, the results are also slightly better here when only the valid topics are considered. Furthermore, it can be seen that the results of the data split at the image level are roughly comparable with those at the topic level. This indicates that no overfitting takes place and that the features of certain topics are not learned by heart. With the split at topic level, it can be seen that when predicting unknown data (topics = test), approximately the same precision is achieved as when predicting data that has already been partially seen in training (topics = all). Compared to the standard model, the results of the NeuralNet model in Figure 10 show that separate distributions can be seen for all three classes. The majority of the non-argumentative images are found on the left of the plots and the strongly argumentative images on the right. Thus, good Precision@k results can be expected even for larger k values. Figure 10: Histograms for the argument NeuralNet model over a selected section of valid topics with the distributions of the scores assigned by the model and the calculated Precision@20 values. The topics shown were not trained by the model, but are used exclusively as test data. PBoth@20 shows the determined precision for weakly and strongly argumentative images. With PStrong@20, only strongly argumentative images are considered. ImageCount shows the absolute number of images that are available in the data set for the category ("Strong" or "Both") under consideration. 6.3. Evaluation Stance Model Analogous to the argument NeuralNet model, the stance NeuralNet model should now also be evaluated. For this purpose, all valid topics are first searched for, whereby seven topics are no longer considered. The metric accuracy is used for the stance NeuralNet model to measure the performance, since the images are to be classified into the classes "pro", "con" and "neutral". 𝑇𝑃 + 𝑇𝑁 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (7) 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 In this application, TP (True Positive) and TN (True Negative) represent the number of images that were correctly classified. The number of incorrectly classified images, FP (False Positive) and FN (False Negative), are added to the denominator of the sum, which means that the total number of all classified images is used here. The quotient of both sums calculates the accuracy. Data Split Topics Accuracy Image Level all 0.4723 Image Level valid 0.4961 Topic Level all 0.4812 Topic Level test 0.4215 Table 4 Results of the stance NeuralNet model. The table shows the calculated accuracy values. A distinction is made as to whether a data split was applied at image level or at topic level, and whether all topics, the valid topics or only the test topics were taken into account. Similar conclusions as for the argument models can be drawn from the results in table 4 for the stance NeuralNet model. The model performs better with the split at image level if only valid topics are considered. This is not the case with the split at topic level. Here, the model is 6 Figure 11: Confusion matrices for the stance NeuralNet model over a selected section of valid topics with the predicted classes and the calculated accuracy values. The topics shown were not trained by the model, but are used exclusively as test data. percentage points better when all topics are considered than when predicting only unknown topics. This difference could be due to the variance of the trained models. However, it is also conceivable that the model increasingly classifies "neutral", which results in a higher accuracy for the non-valid topics, which achieves a noticeable difference on average across all topics. A slight overfitting would also be conceivable. Figure 11 shows the results of the out-of-sample prediction on the test data in the form of confusion matrices. Ideally, a dark blue diagonal line from top-left to bottom-right should be visible. This would be the case if the actual stance of the images corresponds to the majority of the predicted stance. This diagonal is not visible in the plots shown. Mostly one or two of the classes "pro", "con" or "neutral" are predicted well, but it is not possible for the neural network to predict all stances of a topic well. It can also be seen that "neutral" is classified more often, which can explain the differences in the table 4 at topic level. It can be seen that the stance model is not able to distinguish "pro" from "con". The features do not seem to make it possible to establish a connection between query and stance. The results are significantly worse than those of the argument model. 6.4. Evaluation of the overall system First of all, it should be noted once again that the focus of this work was on the argument model. No solution was found to improve the stance model and bring it to a similar quality. Since the overall system is based on both models, the results of the overall system are not satisfactory due to a poor stance model. Table 5 shows that the system offers on average only 5 strongly argumentative images among the first 20 images viewed for a search query, which were also assigned to the correct stance ("pro" or "con"). If the weakly argumentative ones are also evaluated, the value increases to 8 out of 20. A more precise evaluation of the overall system is considered to make little sense, since the main focus was on the argument model. The stance model is clearly the limiting factor for a better precision here. Precision Strong@20 0.2510 Both@20 0.4030 Table 5 Results of the overall system consisting of the argument NeuralNet model and stance NeuralNet model trained with all topics and a data split at the image level. A Precision@20 was calculated for strong argumentative images only, as well as for strong and weak argumentative images. 6.5. External evaluation with Tira In addition to our own evaluation, another external evaluation of the approaches via Tira took place. Tira is a software that tries to solve the problem of reproducibility of scientific work, especially for shared tasks [29]. Table 6 and table 7 show the results for the submitted runs. The four runs each come from combining one of the two argument models (NeuralNet or standard) with one of the two stance models (NeuralNet or standard). ID Tira Timestamp Models used 1 2022-02-25-11-07-15 NeuralNet argument model and NeuralNet stance model 2 2022-02-25-11-49-41 NeuralNet argument model and standard stance model 3 2022-02-25-19-11-54 standard argument model and NeuralNet stance model 4 2022-02-25-09-41-56 standard argument model and standard stance model Table 6 The table shows the Tira timestamps and the argument and stance models (standard or NeuralNet) used in each case. ID Topic Argument Stance Argument (adj) Stance (adj) Minsc - Baseline 0.736 0.686 0.407 0.932 0.553 1 0.673 0.624 0.354 0.927 0.526 2 0.687 0.632 0.365 0.920 0.531 3 0.664 0.609 0.344 0.917 0.518 4 0.701 0.634 0.381 0.904 0.544 Table 7 Results of the Tira runs compared to the Minsc baseline. The quality of topic relevance, argumenta- tiveness and stance relevance is calculated with a Precision@10. To relate the values to the evaluation shown in the previous subsections, the argumentativeness and stance relevance have to be adjusted by dividing them with the topic precision. This is indicated by the addition of "adj". To measure the performance, a Precision@10 was calculated for topic relevance, argumenta- tiveness, and stance relevance. Comparing only these precision values with the baseline, they are significantly lower than the precision values for the argument and stance model from the previous subsections. This is mainly due to the fact that the topic model was not evaluated for this work and therefore only topic relevant images were used for the evaluation of the argument and stance model. For better comparability, the Tira Precision@10 values for the argument and stance model were additionally adjusted by dividing the respective value by the topic precision. This is possible because there is a continuous dependency between the values. This means that an image can only be argumentative if it is also topic relevant and an image can only have a positive or negative stance value if it is also argumentative. If we now look at the adjusted values, we can see, as in the previous evaluation, that the NeuralNet argument model perform better than the standard argument model, although the deviations between the different runs are small. Furthermore, the four runs with the adjusted precision values achieve about the same high performance as the Minsc baseline. The opposite is evident in the stance models. Here, the standard stance model performs better than the NeuralNet, although the precision can generally be rated as low. Thus, the Tira results also confirm the conclusions of the evaluations in the previous subsections. 7. Conclusion The work has shown that it is in principle possible to recognise argumentative images. Two feature-based approaches were tested. Both involve information derived exclusively from images. These include colour, image text and structural features such as the recognition of diagrams. The standard argument model, which is based on a formula, achieves worse results in this work than a trained neural network with the same features. This is because too many assumptions have to be made in a single formula to accommodate the complexity of an image. The interaction of colours and text cannot be considered because each feature is added up individually and as part of the formula. The neural network, on the other hand, achieves remarkable results and can deliver on average more than 17 matching images among the first 20 results for a search query. Among them, more than 10 images are even strongly argumentative. As the number of search results considered increases, the neural network also performs significantly better than the standard model. It has been shown that not all information on the images is equally important for the network. The presence of text seems to be particularly decisive for an argumentative image. However, this should be seen as positive, since according to the considerations in chapter 2.2, an image can only be argumentative in interaction with text or diagrams. The word count of the recognised text is the most important feature for the argument model. However, this feature is based on the unreliable text recognition of Tesseract, which can neither recognise handwriting nor low-contrast writing. Perhaps an improvement of this optical character recognition technique would also lead to an improvement of the model. In further work, other features could be integrated and examined for their usefulness. It might be conceivable to recognise simple symbols, which could indicate an argumentative character. As a result of this work, in addition to the argument model, the analysis and labeling of the data set provided for the Touché Lab task should be mentioned. There it was shown that many topics are not suitable for evaluating an argument search engine. Too few images are actually argumentative (46%). Furthermore, 28% of the images that were indicated as topic relevant are not topic relevant. With actually only 13% strongly argumentative and at the same time topic relevant images, the data set is only of very limited use for a good training of a neural network and for evaluation for this task. A better data set could possibly lead to better results here as well. Even though it was not the focus of this work, some attempts were made to create a stance model. However, only slightly less than 50% of the images are correctly classified as "pro", "neutral" and "con". This is probably due to the distinction between "pro" and "con". The neural network was not able to establish the connection between the given query and the image. Even after including the HTML pages of the images, the performance improved only slightly. This classification requires a deep understanding of the question, including negations and other rhetorical devices. For this reason, the overall system cannot achieve good results. On average, only 5 of the first 20 images in a search query are strongly argumentative and assigned to the correct side ("pro" or "con") and thus of great interest to the user. Improvements must therefore be made primarily to the stance model for a good overall system. One possibility would be to use a language model such as BERT [30]. This might make it possible to process the complexities of the language and establish a connection between query and text. Acknowledgments We would like to thank all the people who helped with the evaluation. Without them, we would not have been able to use nearly 10,000 labeled images as a data set. We would like to thank Lena, Yasmin, Sören and Roman. Additionally, we would like to thank Theresa Elstner for the detailed and fast review of our paper. References [1] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data acquisition for argument search: The args.me corpus, in: KI 2019: Advances in Artificial Intelli- gence, Springer International Publishing, 2019, pp. 48–59. URL: https://doi.org/10.1007/ 978-3-030-30179-8_4. doi:10.1007/978-3-030-30179-8_4. [2] R. Levy, B. Bogin, S. Gretz, R. Aharonov, N. Slonim, Towards an argumentative content search engine using weak supervision, in: COLING, 2018, pp. 2066–2081. URL: https: //aclanthology.org/C18-1176/. [3] C. Stab, J. Daxenberger, C. Stahlhut, T. Miller, B. Schiller, C. Tauchmann, S. Eger, I. Gurevych, ArgumenText: Searching for arguments in heterogeneous sources, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 21–25. URL: https://aclanthology.org/N18-5005. doi:10.18653/v1/ N18-5005. [4] H. Wachsmuth, M. Potthast, K. A. Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch, V. Morari, J. Bevendorff, B. Stein, Building an argument search engine for the web (2017). URL: https://doi.org/10.18653/v1/w17-5106. doi:10.18653/v1/w17-5106. [5] M. Champagne, A.-V. Pietarinen, Why images cannot be arguments, but moving ones might, Argumentation 34 (2019) 207–236. URL: https://doi.org/10.1007/s10503-019-09484-0. doi:10.1007/s10503-019-09484-0. [6] J. E. Kjeldsen, The rhetoric of thick representation: How pictures render the importance and strength of an argument salient, Argumentation 29 (2014) 197–215. URL: https: //doi.org/10.1007/s10503-014-9342-2. doi:10.1007/s10503-014-9342-2. [7] D. Bundestag, Wirksamkeit von bildlichen warnhinweisen auf zigaret- tenpackungen (2017). URL: https://www.bundestag.de/resource/blob/511122/ 8ae51b807ef2d0ebd58e4f4747c4bee7/wd-5-024-17-pdf-data.pdf. doi:10.1007/ s10503-014-9342-2. [8] N. Busca, How a horrifying cycling crash set up a battle over safety, 2021. URL: https: //www.nytimes.com/2021/01/30/sports/cycling/riders-crashes-uci-safety.html. [9] H. Wachsmuth, N. Naderi, Y. Hou, Y. Bilu, V. Prabhakaran, T. A. Thijm, G. Hirst, B. Stein, Computational argumentation quality assessment in natural language, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Association for Computational Linguistics, 2017. URL: https://doi.org/10.18653/v1/e17-1017. doi:10.18653/v1/e17-1017. [10] M. Potthast, L. Gienapp, F. Euchner, N. Heilenkötter, N. Weidmann, H. Wachsmuth, B. Stein, M. Hagen, Argument search, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2019. URL: https://doi.org/10.1145/3331184.3331327. doi:10.1145/3331184.3331327. [11] J. Kiesel, N. Reichenbach, B. Stein, M. Potthast, Image retrieval for arguments using stance-aware query expansion, in: Proceedings of the 8th Workshop on Argument Mining, Association for Computational Linguistics, 2021. URL: https://doi.org/10.18653/v1/2021. argmining-1.4. doi:10.18653/v1/2021.argmining-1.4. [12] A. Latif, A. Rasheed, U. Sajid, J. Ahmed, N. Ali, N. I. Ratyal, B. Zafar, S. H. Dar, M. Sajid, T. Khalil, Content-based image retrieval and feature extraction: A comprehensive review, Mathematical Problems in Engineering 2019 (2019) 1–21. URL: https://doi.org/10.1155/ 2019/9658350. doi:10.1155/2019/9658350. [13] H. Shao, Y. Wu, W. Cui, J. Zhang, Image retrieval based on MPEG-7 dominant color descriptor, in: 2008 The 9th International Conference for Young Computer Scientists, IEEE, 2008. URL: https://doi.org/10.1109/icycs.2008.89. doi:10.1109/icycs.2008.89. [14] M. Solli, R. Lenz, Color emotions for multi-colored images, Color Research & Application 36 (2011) 210–221. URL: https://doi.org/10.1002/col.20604. doi:10.1002/col.20604. [15] V. Mokshin, I. Sayfudinov, S. Yudina, L. Sharnin, Object detection in the image using the method of selecting significant structures, International Journal of Engineering & Technology 7 (2018) 1187. URL: https://doi.org/10.14419/ijet.v7i4.38.27759. doi:10.14419/ ijet.v7i4.38.27759. [16] J. K. L. and, Image classification and object detection algorithm based on convolutional neural network, Science Insights 31 (2019) 85–100. URL: https://doi.org/10.15354/si.19.re117. doi:10.15354/si.19.re117. [17] D. Fleming, Can pictures be arguments?, Argumentation and Advocacy 33 (1996) 11–22. [18] J. E. Kjeldsen, Virtues of visual argumentation: How pictures make the importance and strength of an argument salient, 2013. [19] M. Meharban, D. Priya, A review on image retrieval techniques, Bonfring International Journal of Advances in Image Processing 6 (2016) 07–10. URL: https://doi.org/10.9756/ bijaip.8136. doi:10.9756/bijaip.8136. [20] M. Honnibal, I. Montani, spacy - industrial-strength natural language processing in python, 2022. URL: https://spacy.io/. [21] M. Potthast, L. Gienapp, F. Euchner, N. Heilenkötter, N. Weidmann, H. Wachsmuth, B. Stein, M. Hagen, Argument search: Assessing argument relevance, Proceedings of the 42nd Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval (2019) 1117–1120. doi:10.1145/3331184.3331327. [22] R. Smith, An overview of the tesseract OCR engine, in: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, IEEE, 2007. URL: https: //doi.org/10.1109/icdar.2007.4376991. doi:10.1109/icdar.2007.4376991. [23] V. Bonta, N. Kumaresh, N. Janardhan, A comprehensive study on lexicon based approaches for sentiment analysis, Asian Journal of Computer Science and Technology 8 (2019) 1– 6. URL: https://doi.org/10.51983/ajcst-2019.8.s2.2037. doi:10.51983/ajcst-2019.8.s2. 2037. [24] M. Zaid, L. George, G. Al-Khafaji, Distinguishing cartoons images from real-life images, International Journal of Advanced Research in Computer Science and Software Engineering 5 (2015) 91–95. [25] user12526469 nathancy, How to detect diagram region and extract(crop) it from a research paper’s image, 2022. URL: https://stackoverflow.com/a/59315026. [26] R. A. Wagner, M. J. Fischer, The string-to-string correction problem, Journal of the ACM 21 (1974) 168–173. URL: https://doi.org/10.1145/321796.321811. doi:10.1145/321796. 321811. [27] Jan-Christoph Klie, Michael Bugert, Beto Boullosa, Richard Eckart de Castilho, Iryna Gurevych, The inception platform: Machine-assisted and knowledge-oriented interac- tive annotation, Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations (2018) 5–9. URL: https://aclanthology.org/C18-2002/. [28] M. Berrendorf, E. Faerman, L. Vermue, V. Tresp, On the ambiguity of rank-based evaluation of entity alignment or link prediction methods, 2020. URL: https://arxiv.org/pdf/2002.06914. [29] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\_5. [30] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/ N19-1423. doi:10.18653/v1/N19-1423. A. Reproducibility Our complete code basis and our results of the labeling process can be found in our GitLab repository 1 . We used the dataset from the Touché 2022 Task 3: Image Retrieval for Arguments. It can be found on the official task website 2 . B. Complete analysis of the labeled data set Topic Topic Argumen- Arg. Stance Stance Relevant Relevant number Relevance tativeness (Strong) Pro Con (Strong) 1 0.5 (247) 0.36 (178) 0.11 (52) 0.16 (79) 0.06 (28) 0.22 (106) 0.1 (47) 2 0.85 (466) 0.73 (400) 0.11 (60) 0.48 (266) 0.17 (91) 0.64 (353) 0.1 (54) 4 0.64 (294) 0.58 (268) 0.15 (70) 0.05 (22) 0.21 (99) 0.26 (121) 0.15 (68) 8 0.68 (333) 0.61 (300) 0.14 (68) 0.3 (149) 0.14 (69) 0.44 (217) 0.12 (58) 9 0.78 (380) 0.21 (103) 0.1 (48) 0.08 (38) 0.08 (41) 0.16 (76) 0.07 (33) 10 0.79 (348) 0.72 (317) 0.14 (60) 0.53 (231) 0.13 (58) 0.66 (289) 0.12 (53) 15 0.87 (466) 0.54 (290) 0.18 (97) 0.03 (15) 0.3 (160) 0.32 (173) 0.16 (88) 20 0.77 (393) 0.34 (171) 0.19 (95) 0.19 (95) 0.06 (30) 0.24 (124) 0.14 (71) 21 0.34 (157) 0.29 (135) 0.14 (64) 0.18 (83) 0.05 (23) 0.23 (106) 0.12 (58) 22 0.65 (263) 0.42 (169) 0.05 (19) 0.1 (41) 0.06 (24) 0.16 (65) 0.05 (19) 27 0.93 (447) 0.88 (421) 0.37 (177) 0.26 (123) 0.34 (165) 0.6 (288) 0.35 (170) 31 0.76 (402) 0.63 (333) 0.21 (110) 0.45 (241) 0.03 (14) 0.48 (253) 0.18 (94) 33 0.61 (345) 0.57 (317) 0.22 (121) 0.4 (222) 0.1 (57) 0.5 (279) 0.21 (118) 36 0.79 (413) 0.14 (75) 0.07 (35) 0.11 (60) 0.02 (10) 0.13 (70) 0.06 (32) 37 0.79 (367) 0.46 (213) 0.16 (76) 0.01 (6) 0.3 (138) 0.31 (144) 0.15 (72) 40 0.79 (392) 0.49 (245) 0.09 (45) 0.14 (71) 0.32 (157) 0.46 (227) 0.06 (32) 43 0.93 (438) 0.26 (121) 0.1 (45) 0.18 (84) 0.04 (17) 0.21 (101) 0.07 (35) 45 0.48 (174) 0.09 (31) 0.05 (18) 0.02 (7) 0.06 (21) 0.08 (28) 0.04 (16) 47 0.69 (329) 0.25 (119) 0.09 (41) 0.04 (20) 0.2 (98) 0.2 (98) 0.06 (27) 48 0.58 (193) 0.56 (186) 0.22 (74) 0.26 (88) 0.05 (15) 0.31 (103) 0.21 (69) Table 8 Results of the analysis of the data set for all labeled topics. The addition "strong" means that only strong argumentative images are used here, while otherwise strong and weak argumentative images are considered. Relevant images are images which are topic relevant, argumentative as well as stance- relevant. 1 https://git.informatik.uni-leipzig.de/jb64vyso/aramis-image-argument-search 2 https://webis.de/events/touche-22/shared-task-3.html Topic number Topic name 1 Should teachers get tenure? 2 Is vaping with e-cigarettes safe? 4 Should corporal punishment be used in schools? 8 Should abortion be legal? 9 Should students have to wear school uniforms? 10 Should any vaccines be required for children? 15 Should animals be used for scientific or commercial testing? 20 Is drinking milk healthy for humans? 21 Is human activity primarily responsible for global climate change? 22 Is a two-state solution an acceptable solution to the Israeli-Palestinian conflict? 27 Should more gun control laws be enacted? 31 Is obesity a disease? 33 Should people become vegetarian? 36 Is golf a sport? 37 Is cell phone radiation safe? 40 Should the death penalty be allowed? 43 Should bottled water be banned? 45 Should the penny stay in circulation? 47 Is homework beneficial? 48 Should the voting age be lowered? Table 9 Assignment of topic numbers to topic names for all labeled topics. C. Results for different data splits Figure 12: Precision@20 values for the argument NeuralNet model with image level data split, with the x-axis showing the proportion of training data. All data points represent an arithmetic mean over the results of 10 trained models. It can be seen that already from a training share of the data of 10% no remarkable changes in the model performance can be recognized.