Boromir at Touché 2022: Combining Natural Language Processing and Machine Learning Techniques for Image Retrieval for Arguments Notebook for the Touché Lab on Argument Retrieval at CLEF 2022 Thilo Brummerloh, Miriam Louise Carnot, Shirin Lange and Gregor Pfänder Leipzig University, Augustusplatz 10, 04109 Leipzig, Germany Abstract With the frequent information overload when scrolling the web, little information sticks with the reader. In argumentation, images are often used to leave a formative impression. Until now, there has been little research focusing on search engines specifically devoted to finding argumentative images. Argumentative images help the viewer to form an opinion by implicitly or explicitly giving an argument that either supports or invalidates a thesis. We built a search engine that assists users to overview a controversial topic with supporting and opposing images. With this goal in mind, we compare different techniques from the fields of Natural Language Processing and Machine Learning to cluster the images, extract the text from the images and evaluate the sentiment of the page the image appears on. The best retrieval system uses a BERT model to determine stance, query Preprocessing, optical character recognition, and image clustering to detect the image content. Over 50% of the images found by this retrieval system are relevant, argumentative, and assigned to the correct stance according to automatic and manual evaluation. Keywords argument retrieval, images, information retrieval, search engines 1. Introduction The internet contains a vast assortment of opinions and arguments in social media posts, discussion forums, and news pages that are presented, challenged, and evaluated by contributors and readers. But mostly these textual argumentation are not structured as arguments [1]. Those arguments are commonly expressed verbally. However, it is possible to present some of them as images with the argument written on them or simply through visual communication e.g., symbolism [2]. A common example of visual arguments are memes which amongst others became popular as a method to influence the 2016 presidential primaries in the U.S. [3]. Our research addresses the above-stated problems with the implementation of an image-based argument search engine in the course of Shared Task 3 from Touché 2022. The topic of the task is "Image retrieval to corroborate and strengthen textual arguments and to provide a quick overview of public opinions on controversial topics" [4]. CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Our work pursues the fulfillment of the evaluation criteria defined by the Shared Task 3 1: • topic relevance • argumentativeness • stance relevance of the retrieved images regarding the query. These three evaluation criteria represent the assessment of the relevance of an image to a given query that should be considered for an image argument search engine. Topic relevance indicates the extent to which an image matches the content of the entered search query. Argumentativeness states whether the image is suitable for defending a position within the debate on the searched topic. And finally, stance relevance evaluates if the image supports the stance (pro or con) it is assigned to [5]. For each of the three evaluation criteria, we compare several approaches: To retrieve a high number of topic relevant images our focus is on preprocessing the query and optical character recognition (OCR). OCR is also used to filter the most argumentative images together with clustering the images according to the type of image (e.g., statistic, text image, image with people). To assign the images to the correct stance we use sentiment analysis. We assume that the sentiment of the web page text the image appears on, correlates with supporting (positive sentiment) or opposing (negative sentiment) a controversial question. In Section 2, we review related work covering argument search, OCR, image clustering and sentiment analysis. Section 3 will present our methodology and Section 4 will discuss our results. Section 5 will summarize this work and point out the limitations of our approach. 2. Related Work In this section, we clarify the context of Argument Search, Optical Character Recognition, Image Clustering and Sentiment Analysis. It should allow the reader to better comprehend the following sections. 2.1. Argument Search Wachsmuth et al. [6] propose a framework for argument search. An argument is saved with an ID, the URL of the web page they are found on, and the page’s full text. The index is created with Apache Lucene using the argument representations as input. Elasticsearch can be used via a REST API as a user interface to Apache Lucene to create the index and to search it. To allow fast search responses Elasticsearch searches an index instead of the whole text. It creates the so-called inverted index with keywords from the document’s text using Apache Lucene [7]. To provide results ranked by relevance to the query Lucene has several standard ranking functions such as Okapi BM25 which relies on term frequency-inverse document frequency (TF-IDF). With slight adaptions, this framework can be applied to results represented as images. 1 https://webis.de/events/touche-22/shared-task-3.html 2.2. Optical Character Recognition OCR is used to retrieve text from images automatically. Hamad et al. [8] discuss OCR’s challenges, important steps in the pipeline, use cases, and historical details. One of the major challenges is called scene complexity where the intricate image content makes it difficult to distinguish the text from the rest. One widely-used OCR system was developed by Fedor et al. [9] in 2018. The so-called “Rosetta” is deployed on Facebook and Instagram to evaluate memes in real-time and at scale. First, the system detects text regions of an image using a Faster-R-CNN model [10], and afterwards recognizes letters contained in those regions by applying a CNN model. They point out that high-quality training data containing a variety of fonts, styles, and font sizes is very important for training the OCR model. Even though the procedure is described in detail, the source code has not been made available so we can not try its performance on our data set. Memes are a very popular image type on the internet. They often contain an opinion and take stance with one side or the other [11]. Beskow et al. [12] worked on characterizing memes and elaborating families of memes. Among other things, they used meme-specific OCR for identifying the text written in the image. The biggest challenge using OCR algorithms is that most are trained on text with black font and white background which is usually not the case for images from the web. To solve the problem, Beskow et al. preprocess the images before using Google’s Tesseract OCR 2 . As our dataset also contains many memes, we take their Preprocessing approach as a guideline and also decide for Google Tesseract. 2.3. Image Clustering In our work we use image clustering to group together images of similar types such as bar plots, pie charts, images of persons demonstrating or persons doing sport. From these image types we derive the importance of the images as an argument depending on their type. Omran et al. [13] give a general overview of clustering methods and addresses some of the most important similarity measures and clustering algorithms. We use the k-Means clustering, which aims to minimize the intra-cluster distance, by adding new data points to the cluster whose centroid has the shortest distance to the new point. Since having to consider a few special things when working with images, Rahmani et al. [14] give a short introduction of how k-means clustering with image data works. Clustering methods for images mostly use image feature vectors, which are created to represent the contents of a image in form of a multidimensional numerical vector, as input data. Those vectors can be created in multiple ways, for example using machine learning methods like VGG16 [15]. VGG16 is a convolutional network designed for image classification. It was the winning method of the 2014 ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC) for the localization and classification tracks [16]. The net consists of 16 convolutional layers with filters of a receptive field size of 3 x 3. We use a pre-trained VGG16 model for our clustering approach to create image feature vectors. Another noteworthy approach for creating image feature vectors is the Scale Invariant Feature Transform (SIFT) from Lowe [17]. This method finds and localizes key points in an image and 2 https://github.com/tesseract-ocr/tesseract transforms them into image features. Csurka et al. [18] and Yanai [19] use SIFT to generate a so-called bag-of-keypoints. A bag-of-keypoints can be seen as a histogram of the number of occurrences of particular image patterns in an image, similar to the bag-of-words representation for documents. The bag-of-keypoints can then be used to classify an image using a classifier method like SVM. The strength of bag-of-keypoints lies in object detection. We decided against a bag-of-keypoints approach because the aim of our clustering is to separate the different types of images (e.g. pie chart, bar plot, persons demonstrating, ...) rather then identifying the objects pictured on them. In many cases, the scale of the image collection is too large for computing on a standard processor. Liu et al. [20] propose a method that allows nearest neighbor clustering for a large number of images with the help of parallel distributed hybrid spill trees which are implemented by combining MapReduce operations with a intelligent data partitioning. Even though it is not relevant for the amount of data the Touché Task involves, the aspect of scalability must be kept in mind when working with even larger data sets. As an illustration, modern image search engines can contain billions of images. 2.4. Sentiment Analysis We determine an image’s stance towards a topic by performing a sentiment analysis on the associated web page’s text. We assume that sentiments reveal if the image supports or opposes a given query. Nielsen [21] proposes a lexicon based approach to determine the sentiment of a text. The proposed lexicon "AFINN" consists of words that are labeled with a certain sentiment score. The latest version of AFINN consists of 2477 unique words (including phrases) that were manually labeled with an integer value between -5 (very negative) and 5 (very positive). The overall sentiment of a text is the sum of sentiment scores of all words contained in that text. Another approach to sentiment analysis uses a machine learning classifier with a pre-trained BERT-model (Bidirectional Encoder Representations from Transformers) introduced by Devlin et al. [22]. BERT is a language representation model that can be customized to a variety of tasks by only adding one extra output layer. Possible applications include answering questions or sentiment analysis[23]. 3. Methodology In this section, we describe our approach guided by the three evaluation criteria: topic relevance, argumentativeness, and stance relevance. We start with a summary of the data we used and the workflow. Afterward, we explain in depth the methods used, namely document and query Preprocessing, optical character recognition, image clustering, and sentiment analysis. At the end of the chapter, we propose an evaluation method for comparing the implemented techniques. Table 1 Data used in our project Name Content Usage one file for each image in the dataset: image.png Image in PNG format OCR/Clustering rankings.jsonl JSON objects describing a query to Google that Evaluation retrieved the image. This includes the rank the image got in the Google retrieval text.txt Text content of the webpage with the image Indexing/Sentiment/ Retrieval dom.html html source of the webpage Sentiment one file for the entire dataset: topics.xml Title, description, and narrative of Touché topics Evaluation/Development 1 to 50 3.1. Data The first step for solving Touché Task 3 is to build an understanding of the data. It is accessible under 3 and can be downloaded from there. The download contains data for 23,841 images. For each image, there are 12 different files that can be used to develop retrieval methods. The files are for example giving detailed information about the image itself, the webpage where the image is pictured or the google retrieval for that image. Additionally, with topics.xml and training-qrels.txt, there are 2 files containing topics and qrels for the entire data set. As shown in Table 1, we do not need all files but only the following subset in our retrieval approach: We need image.png to perform OCR and clustering, rankings.jsonl for the evaluation, text.txt for indexing, retrieval and to perform sentiment analysis, dom.html for sentiment analysis and topics.xml for evaluation and development. 3.2. Workflow Our workflow begins by building the index. As can be seen in Figure 1 we use the crawled web pages’ title, visible text and the image that is associated with that webpage. The title is fed into a Machine Learning algorithm to derive a sentiment that can either be positive or negative. The pre-processed text of the webpage is used to calculate a sentiment, though here we use a dictionary based approach to determine the score of the text. Furthermore the text is also directly fed into the index to execute queries on it. The image of a webpage is also used twofold. We scan it for text and put any readable text into the index to be queried with higher weighting than the text of the webpage. Additionally, we analyze the image with a neural network trained for image classification. We use the output of the network to assign the image to an image type. 3 https://files.webis.de/corpora/corpora-webis/corpus-touche-image-search-22/ Machine learning based sentiment analysis Website Title Index Dictionary based sentiment analysis Sentiment Scores Website Text Website Text Query Optical Character Image Text Recognition Image Cluster Image Image Type Clustering Figure 1: Methods to build the index and query it. 3.3. Document And Query Preprocessing In our initial approach to pre-process we use the text of the web page as this document is used for building the index. We use the Natural Language Toolkit (NLTK) [24] for the desired adjustments: First, we convert the document to lowercase and remove URLs. We only keep letters and tokenize the text. Next, we lemmatize all tokens and remove tokens consisting of only one letter and tokens included in the stop word list of the NLTK library. Finally, we eliminate the tokens that appear only once or twice in the text, which is a method based on Zipf’s law [25]. We find this step necessary because many web pages contain additional information like navigation or a footer. Words from those sections do not appear often on the web page, whereas words important to the main topic of the page will show up frequently. In the next step we pre-process the query. Not all words contained in a query have the same importance to the topic and thereby for the retrieval. Less important words can negatively influence the search. If web pages are found that use only these less important words, those pages might be less relevant to the topic. Therefore, we decide to create our own stop word list with selected words that do not contribute to the statement of the query e.g., "be", "for", "in". We eliminate these words from the query. Furthermore, we lemmatize the remaining words as we did with the text of the web pages. We believe that it is more effective to perform the retrieval when the words in the query and the page text are restored to their non-inflected form in the same way. 3.4. Optical Character Recognition Our research team shares the impression that images containing text are in the majority of cases more argumentative than images without because the text directly indicates what the image should express. We expect images which contain text from a query to be more relevant than other images. Optical character recognition is used to identify image text. We decide to build our pipeline using Google’s Tesseract OCR [26] as it is one of the most popular freely-available OCR engines and is widely used [27][12]. The implemented text identification pipeline is depicted in Figure 2. First, we binarize the image as OCR techniques work better on black and white pictures [12]. Afterward, we extract the text using Tesseract. It often occurs that random symbols and letters are being extracted from parts of the image where there is no text. Therefore, we decide to keep only letters and check for each word of the extracted text if it is included in an English dictionary. Only keep words Extract text from Only keep letters that can be found Image Image Binarize Image image using from text in English Text Tesseract dictionary Figure 2: The constructed OCR pipeline. There are clear limitations for handwritten text e.g., images containing demonstration posters. However, our goal is not to identify the text from each image perfectly but to get an impression of whether OCR can generally improve image retrieval for arguments. It is possible to use commercial OCR engines that claim to achieve more accurate results or to train neural networks for text extraction. The latter would be much more time-consuming and is an ongoing field of research on its own. Based on our results, Tesseract OCR is improving the retrieval and it would be interesting to find out in the future if different OCR systems could improve it even further. After extracting the text of each image, we add it to the index. For each query, we retrieve images based on the web pages text and the text extracted from the images via OCR. The image text gets a boost of five i.e. its score is multiplied by five. We choose five after having tried different numbers. For each image, we compare the boosted image text score with the page text score and use the higher one for the retrieval. 3.5. Image Clustering Another approach to increase the argumentativeness of the results is to categorize images into classes of different image types. There are plots, memes, landscape photos, portraits, and many more possible classes. The idea behind using image clustering is, that some of the classes are more argumentative than others. Therefore images of these classes should get higher attention than images of less argumentative classes. This can be shown by the exemplary topic "Should people become vegetarian?". Figure 3 (a) shows a bar plot of the protein intake of vegetarians and non-vegetarians. Images like this can easily be used to argue for or against a topic. In comparison, Figure 3 (b) shows different kinds of fruits and vegetables in the shape of a heart. Compared to the argumentative benefit of a bar plot and other statistics, the argumentativeness of a symbolic image like this is limited, because it does not provide additional facts to support a textual argument with. Our goal is to find clusters containing images with the same level of argumentativeness. We do this by using an image clustering method based on feature vectors and k-means clustering4 . The detail of the processing is as follows: 1. Load a pre-trained VGG16 model for image classification 4 cf. https://towardsdatascience.com/how-to-cluster-images-based-on-visual-similarity-cd6e7209fe34 (a) Bar plot of protein intake by diet (very (b) Image of fruits and vegetables in the shape of a argumentative) heart (not as argumentative as Figure 1(a)) Figure 3: Retrieval candidates for the topic ’Should people become vegetarian?’ 2. Cut off the original output-layer and use the model to calculate feature vectors, which are used to represent the contents of the images as numerical vectors 3. Reduce the dimension of the feature vectors with Principal Component Analysis (PCA) for faster computing 4. Make an elbow plot for k-means clustering to find possible number of clusters 5. Make k-means clustering with the chosen number of clusters Figure 4 shows the elbow plot from which we derive the number of clusters to be found by k-means. The plot shows the Within Cluster Sum of Squared Errors (WCSS) when using different numbers of clusters [28]. In theory, the optimal number of clusters is a number where the WCSS stops decreasing significantly with the addition of one cluster. This point is usually recognizable by a clear inflection point in the displayed curve. Our plot shows no clear indication for one optimal number of clusters, but because of the flattening character of the curve in the range between 10 and 20, anywhere between 10 clusters and 20 clusters seems reasonable. To find the final number we need a deeper inspection of the differences in the clustering results, when choosing different numbers of clusters. We repeat the k-means clustering for all possible numbers of clusters between 10 and 20. Then, for each possible number, we look into examples of images that are assigned to the different clusters and compare them to other results. After manually inspecting all possibilities between 10 and 20 clusters, we found that 14 clusters is a good number to build our image clustering up on. After the identification of the clusters, we define a weight for each cluster, based on the level of argumentativeness of the image type each cluster represents. A strength of this approach is, that the determination of the weights is variable. However, it is difficult to find an objective heuristic for determining a cluster’s weight, because the perceived level of argumentativeness Figure 4: The elbow method graph indicates the optimal number of clusters of image types in the argumentative image dataset is between 10 and 20. After manually inspecting the clusters, we decided on 14 (red). of each cluster is subjective. In our case we determine weights between 1 (for lowest level of argumentativeness) to 5 (for the highest level of argumentativeness) for each cluster. Table 2 gives a overview of the clusters and the weights we use in our retrieval system. It shows that we give the highest weight of 5.0 to the clusters 5,8,10 and 13 which contain a high amount of statistical graphics or other graphics with text on them. We give a weight of 3.0 to the clusters 3,6,9 and 12 which contain cartoons, memes or other graphics with text. Cluster 7 is the only one with a weight of 2.0 and contains photos of protesters with posters. The remaining clusters 0,1,2,4 and 11 get a weight of 1.0 and are mostly containing some sort of photos without text on them. The information to which cluster each image belongs is added to the index. After that, when retrieving images, the weight for each cluster is implemented into the query in a way that it will be multiplied with the relevance score of the retrieval. Hence, the method will influence the relevance ranking of an image, depending on the cluster the image belongs to. 3.6. Sentiment Analysis Our approach to stance classification assumes an underlying sentiment of the web page’s content that can be used to determine whether the web page is supporting or opposing a query. In a first experiment, we use dictionary-based sentiment analysis on the whole text content of the web pages embedding the image. In a second experiment, we use a BERT-based method to label the titles of web pages that embed the images to positive and negative sentiments. Dictionary-Based Approach: We use the complete textual content of the web page from the text.txt file to compare each word with the AFINN dictionary and get a corresponding score that is either positive (from 1 to 5), negative (from -1 to -5), or neutral (0). The sum of scores of all words gives the overall score of a page’s content. We assume that a positive score represents a pro stance, a negative score a contra stance, and zero no stance. We add the scores to the index and modify the query to give two result sets, one for positive scores (higher than zero) as Table 2 Identified clusters by k-means, the weights specified by us and the number N of images within each cluster Cluster Description Weight N 0 photos with round objects (e.g. pills, coins, bottle caps) 1.0 1,886 1 mostly photos but no real context identified 1.0 2,027 2 photos of objects people hold in their hands (e.g. guns, vapes, 1.0 1,631 syringes) 3 cartoons, cartoon-like memes and maps 3.0 2,216 4 photos of groups of people (e.g. athletes, police, protesters) 1.0 1,452 5 graphics with text (e.g. memes, quotes, twitter posts) 5.0 860 6 statistical graphics (e.g. horizontal bar plots, line plots, scatter 3.0 1,465 plots) and thesis covers 7 photos of protesters with posters 2.0 2,012 8 graphics with round forms and text (e.g. pie charts) 5.0 2,155 9 memes with faces as component 3.0 1,458 10 statistical graphics but with better quality as in 6 (e.g. bar 5.0 2,384 plots, tables, line plots) 11 photos of children (... making homework, getting vaccinated, 1.0 1,500 eating in canteen) 12 graphics with text (e.g. newspaper headlines, quotes, informa- 3.0 1,309 tion graphics) 13 statistical plots (bar plots and line plots) 5.0 1,485 the pro side and one for negatives scores (lower than zero) as the con side. Images with the score zero are not considered in the results. Machine-Learning-based method: For this approach, we fine-tune a pre-trained BERT model with movie reviews from the Internet Movie Database (IMDB) to specialize the model on sentiment [29]. This model accepts input phrases of up to 512 words only. Because the entire text of the web page is usually longer than the web page’s title, we used the latter to determine the web page’s sentiment. The assumption is that the title will represent a web page’s content and thus, their sentiment. The BERT model we use is the BERT-base-uncased model from "Hugging Face".5 3.7. Evaluation Method Corresponding to the three evaluation criteria (topic relevance, argumentativeness and stance), our evaluation method is threefold. The easiest case is evaluating if an image was relevant for the topic. The dataset is provided by the organizers of the shared task. They obtained the images by conducting Google Image searches for all topics from the provided topics.xml file in the dataset as queries. To verify if a retrieved image fits the entered query, we check from which Google image search query it was obtained. This information is given in the rankings.jsonl file. Our retrieval systems retrieve ten images supporting and ten images opposing a query. 5 cf. https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face- 294e8a04b671 Table 3 Inter-annotator Agreement coefficient name Fleiss’ kappa Krippendorff’s AC1 (Gwet’s) AC2 (Gwet’s) Alpha identity weights quadratic weights coefficient value 5.9% 10.6% 13.2% 84.9% confidence interval (-0.265, 0.384) (-0.218, 0.431) (-0.066, 0.33) (0.613, 1) p-value 0.638 0.414 0.137 0.00057 We perform the retrieval for the 50 given topics and calculate the percentage of fitting images based on their ranking in the Google Image search results for the same query. There is no information given if an image is argumentative or if it has the correct stance. Therefore, we evaluate those two evaluation criteria manually. We pick out five topics to evaluate. The team members evaluate the retrieved images for the five topics independently. For each run, we indicate how many of the images on the pro side are argumentative and how many of those are on the correct side. We repeat the same procedure for the con side. We then form the average across all topics and individual ratings and calculate the percentages of retrieved images that are argumentative respective stance relevant. To get an idea of our overall consensus in the evaluation, we calculate several statistics used for ratio data and more than two raters: Fleiss’ Kappa, Krippendorff’s Alpha, and Gwet’s coefficients AC1 and AC2 [30]. Table 3 shows the calculated coefficients. The percent agreement increases from left: Fleiss’ kappa with 5.9% to right: AC2 with 84.9%. The general representation of percent agreement is shown in (1) and can be summarized with: the actual agreement that was not caused by chance divided by perfect agreement not caused by chance. 𝐴𝑤 − 𝐴𝑐 (1) 1 − 𝐴𝑐 where 𝐴𝑤 = weighted percent agreement 𝐴𝑐 = percent chance agreement It is known that Fleiss’ kappa, Krippendorff’s alpha, and AC1 "overstate the percent chance agreement" resulting in an understated percent agreement [30]. We can also see that the p- values of those methods are higher than a 5% alpha-value while the p-value of AC2 with 0.00057 is even lower than a 1% alpha-value. That is why we assume an 84.9% agreement within our evaluation. 4. Results The following chapter compares the performance of the implemented methods. We aim to find out which methods work best together, considering the three evaluation criteria and the overall performance. We opt for six different retrieval systems which we compare to the method proposed by Kiesel et al. [5]. We replicate their approach which expands the query with the term "pro" for the supporting side and the term "anti" for the opposing side. This retrieval system is denoted with the number 0 in the following graphics. System number 1 uses only the dictionary-based sentiment analysis for stance allocation. The second system extends system 1 by usage of OCR, whereas the number 3 uses image clustering on top of the dictionary-based sentiment analysis. System number 4 combines the same sentiment analysis, the OCR, and the image clustering. Additionally, system number 5 adds query Preprocessing. For retrieval system number 6 we use the same components as number 5 but exchange the dictionary-based sentiment analysis with the machine-learning-based sentiment analysis. We deployed our systems within the "TIRA" platform that was set up by the workshop officials [31]. There, we submitted five different approaches at first, that can be recognized by their timestamps in table 4. We included a short description of every system. A sixth system was submitted, that we did not anticipate to perform as well as it did. This sixth combination of methods ultimately performed best out of all in Touché Task 3. We did not analyze this system further because we did not expect that it would perform better without the clustering. Table 4 Systems submitted as runs on the TIRA-platform No. Description Timestamp 1 Afinn-Sentiment, OCR 2022-02-26-21-13-41 2 Afinn-Sentiment, Clustering 2022-05-03-07-54-56 3 Afinn-Sentiment, OCR, Clustering 2022-02-26-21-45-32 4 Afinn-Sentiment, OCR, Clustering, Query Preprocessing 2022-02-26-21-59-50 5 Afinn-Sentiment, OCR, Clustering, Query Preprocessing 2022-02-27-18-02-37 6 Afinn-Sentiment, OCR, Query Preprocessing 2022-06-17-21-01-29 Figure 5 reveals the results of the seven retrieval systems regarding the topic relevance of the retrieved images. Only the replicated system 0 obtains less than 80% of topic-relevant images. Especially the retrieval systems 1, 2, 3 and 5 perform best with over 84%. However, the distances are not large. Retrieval systems number 6 perform better for the con-side than any other retrieval systems. The next evaluation criterion is to retrieve argumentative images. Figure 6 shows how well the retrieval systems perform regarding this criterion. It is important to note that in our point of view, only topic-relevant images can also be argumentative for this topic. This means that the reached percentage of topic-relevant images corresponds to 100% in the evaluation of the argumentativeness. By doing so, we can assess the evaluation criteria independently from each other and find out which retrieval systems are best regarding each evaluation criterion separately. The retrieval system’s performances for retrieving argumentative images are far apart. As to be expected the retrieval systems 0 and 1 perform poorly, both of them are based on techniques made for boosting the topic relevance and not the argumentativeness. The more techniques we add to the retrieval systems the better the results. The retrieval systems 4 and 6 perform best. We record an improvement from 64% with retrieval system 0 to 89% with system Figure 5: Percentage of how many topic-relevant images are retrieved by the different retrieval systems for the pro (blue) and the contra (red) side. (0 - query expansion, 1 - dictionary-based sentiment analysis, 2 - dictionary-based sentiment analysis + OCR, 3 - dictionary-based sentiment analysis + Image Clustering, 4 - dictionary-based sentiment analysis + OCR + Image Clustering, 5 - dictionary-based sentiment analysis + OCR + Image Clustering + Query Preprocessing, 6 - machine-learning-based sentiment analysis + OCR + Image Clustering + Query Preprocessing) Figure 6: Percentage of how many argumentative images are retrieved by the different retrieval systems for the pro (blue) and the con (red) side. (0 - query expansion, 1 - dictionary-based sentiment analysis, 2 - dictionary-based sentiment analysis + OCR, 3 - dictionary-based sentiment analysis + Image Clustering, 4 - dictionary-based sentiment analysis + OCR + Image Clustering, 5 - dictionary-based sentiment analysis + OCR + Image Clustering + Query Preprocessing, 6 - machine-learning-based sentiment analysis + OCR + Image Clustering + Query Preprocessing) 6 - a clear sign that the applied methods have a positive effect on retrieving argumentative images. The most difficult evaluation criterion to achieve is assigning the correct stance (pro or contra) to the images. As can be observed in figure 7 all retrieval systems are struggling to figure out whether an image is supporting the query or not. On the pro-side, the retrieval Figure 7: Percentage of how many correctly classified images are retrieved by the different retrieval systems for the pro (blue) and the con (red) side. (0 - query expansion, 1 - dictionary-based sentiment analysis, 2 - dictionary-based sentiment analysis + OCR, 3 - dictionary-based sentiment analysis + Image Clustering, 4 - dictionary-based sentiment analysis + OCR + Image Clustering, 5 - dictionary-based sentiment analysis + OCR + Image Clustering + Query Preprocessing, 6 - machine-learning-based sentiment analysis + OCR + Image Clustering + Query Preprocessing) systems do similarly well with about four out of five correct assignments. But the images on the contra-side often do not oppose the query. Retrieval system 6 using the machine-learning-based sentiment analysis managed to classify a majority of contra-images correctly. We accomplish an improvement from 48% with system 0 to 71% using system 6. Regarding all three evaluation criteria, we obtain the results depicted in Figure 8. To under- stand the plot, it is important to know that only images that correspond to the topic can be argumentative for that topic and only those images argumentative for a topic can be assigned a correct side. Images in the red part on the bottom of the plot achieve all three evaluation criteria, whereas the brown part in the middle is argumentative but does not have the correct stance. The beige top part contains images that match the topic but are not argumentative. Retrieval systems 0 and 1 which only use the query expansion respective the dictionary-based sentiment analysis perform weakest with around 26% and 21% of images fulfilling all evaluation criteria. The three following retrieval systems 2, 3, and 4 score 39%, 39%, and 42%. Finally, the best results are produced by retrieval systems 5 with 48% and 6 with 52%. Comparing the replicated system designed by Kiesel et al.[5] to our best retrieval system, we achieve an improvement of the retrieval performance by 26%, which doubles that of system 0. 5. Conclusion Intending to retrieve argumentative images supporting or opposing an entered query, we built six retrieval systems using ElasticSearch. The implemented retrieval systems use different combinations of the following techniques: Query Preprocessing, Optical Character Recognition, Image Clustering, dictionary-based, and machine-learning-based sentiment analysis. All six retrieval systems were evaluated regarding topic relevance, argumentativeness, and stance Figure 8: Percentage of how many images meet all three evaluation criteria (pink) retrieved by the different retrieval systems (0 - query expansion, 1 - dictionary-based sentiment analysis, 2 - dictionary- based sentiment analysis + OCR, 3 - dictionary-based sentiment analysis + Image Clustering, 4 - dictionary-based sentiment analysis + OCR + Image Clustering, 5 - dictionary-based sentiment analysis + OCR + Image Clustering + Query Preprocessing, 6 - machine-learning-based sentiment analysis + OCR + Image Clustering + Query Preprocessing) relevance of the retrieved images. The last two evaluation criteria were evaluated manually by four independent annotators. Our best retrieval system used a combination of all mentioned techniques except the dictionary-based sentiment analysis. We compared our retrieval systems to the only search engine we found that was implemented for this sort of image retrieval designed by Kiesel et al. [5]. We replicated their approach. Our best retrieval system is able to improve it by 26% regarding all three evaluation criteria. 5.1. Limitations Nevertheless, our approach is constrained by some limitations. We observed that several images were informative and thus useful to build an opinion on the topic but did not clearly represent either side (pro or con). This often is the case with statistics showing the results of a survey. Depending on the viewer’s interpretation the image can support the pro or the con side. As an example, Figure 9 presents people’s opinions on abortion depending on the time of the abortion and the political opinion of the respondents. The plot opens up different interpretation possibilities. Particularly, the viewer’s political opinion may influence if the plot is interpreted as supporting or opposing abortion. Nevertheless, this graph can contribute to answering the question of whether abortion should be legal as it contains a lot of information. Another observation was that even though the retrieval systems work well on many topics some topics do not have as many supporting arguments in form of images as opposing them or the other way around. An example of this is the ban on bottled water. Many arguments support the ban such as environmental friendliness and the equal quality of tab and bottled water, while the opposing side does not reveal any relevant arguments. We conclude that not all Figure 9: Legality of abortion depending on time of abortion and political opinion arguments are equally represented on the internet thus in our database and results. This limits the representativeness of an argument search engine to topics and opinions that are frequently discussed with a diversity of arguments. Also, more annotators would improve the significance of the results because the agreement would be more reliable. A more diverse group of annotators could help represent the view of a bigger part of society. 5.2. Future Work In the future, we want to add more approaches to our list and work on the limitations. One idea is to train Convolutional Neural Networks to perform the OCR to improve text identification. Another idea is to analyze an image’s colors and their distribution to find out how argumentative the image is and which stance it supports. We want to introduce an additional stance category besides pro and con, that contains informative images that can not be assigned to either pro or con. We would also like to have more people conduct the manual evaluation to increase the validity of the results. We would expect the retrieval performance to improve further by adding synonyms to the query. Thereby, we could also retrieve images that appear on web pages treating the same topic but use different words. Instead of applying sentiment analysis only on the title or the entire webpage we want to use it also on the image text or only on an excerpt of the webpage that is of higher importance to the image. Also, we would like to find out, if the retrieval performance could be enhanced by classifying the images into predefined classes instead of clustering them. Acknowledgments We would like to thank the Webis research group for giving helpful advice and always being available for upcoming questions. A special thanks goes to Theresa Elstner, who took time for us every week to discuss the current status. References [1] I. Rahwan, F. Zablith, C. Reed, Laying the foundations for a world wide argument web, Artificial Intelligence 171 (2007) 897–921. doi:10.1016/j.artint.2007.04.015. [2] J. A. Blair, The possibility and actuality of visual arguments, in: Groundwork in the Theory of Argumentation, Springer, Dordrecht, 2012, pp. 205–223. doi:10.1007/ 978-94-007-2363-4_16. [3] B. Heiskanen, Meme-ing electoral participation, European journal of American studies 12 (2017). doi:10.4000/ejas.12158. [4] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of touché 2022: Argument retrieval, in: Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022), Springer, 2022. [5] J. Kiesel, N. Reichenbach, B. Stein, M. Potthast, Image Retrieval for Arguments Using Stance-Aware Query Expansion, in: K. Al-Khatib, Y. Hou, M. Stede (Eds.), 8th Workshop on Argument Mining (ArgMining 2021) at EMNLP, Association for Computational Linguistics, 2021, pp. 36–45. URL: https://aclanthology.org/2021.argmining-1.4/. doi:10.18653/v1/ 2021.argmining-1.4. [6] H. Wachsmuth, M. Potthast, K. Al Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch, V. Morari, J. Bevendorff, B. Stein, Building an argument search engine for the web, in: Proceedings of the 4th Workshop on Argument Mining, Association for Computational Linguistics, 2017, pp. 49–59. doi:10.18653/v1/w17-5106. [7] M. S. Divya, S. K. Goyal, Elasticsearch: An advanced and quick search technique to handle voluminous data, Compusoft, An international journal of advanced computer technology 2 (2013) 171–175. [8] K. A. Hamad, K. A. Y. A. Mehmet, A detailed analysis of optical character recognition technology, International Journal of Applied Mathematics Electronics and Computers (2016) 244–249. [9] F. Borisyuk, A. Gordo, V. Sivakumar, Rosetta: Large scale system for text detection and recognition in images, Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018) 71–79. [10] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 28 (2015). [11] L. Shifman, Memes in a digital world: Reconciling with a conceptual troublemaker, Journal of computer-mediated communication 18 (2013) 362–377. [12] D. M. Beskow, S. Kumar, K. M. Carley, The evolution of political memes: Detecting and characterizing internet memes with multi-modal deep learning, Information Processing & Management 57 (2020). doi:10.1016/j.ipm.2019.102170. [13] M. G. Omran, A. P. Engelbrecht, A. Salman, An overview of clustering methods, Intelligent Data Analysis 11 (2007) 583–605. doi:10.3233/IDA-2007-11602. [14] M. K. I. Rahmani, N. Pal, K. Arora, Clustering of image data using k-means and fuzzy k-means, International Journal of Advanced Computer Science and Applications 5 (2014) 160–163. doi:10.14569/IJACSA.2014.050724. [15] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014. URL: http://arxiv.org/pdf/1409.1556v6. [16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (2015) 211–252. doi:10.1007/ s11263-015-0816-y. [17] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2004) 91–110. doi:10.1023/B:VISI.0000029664.99615.94. [18] G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization with bags of keypoints, in: Workshop on statistical learning in computer vision, ECCV, volume 1, Prague, 2004, pp. 1–2. [19] K. Yanai, Image collector iii: a web image-gathering system with bag-of-keypoints, in: Proceedings of the 16th international conference on World Wide Web, ACM Press, 2007, pp. 1295–1296. doi:10.1145/1242572.1242816. [20] T. Liu, C. Rosenberg, H. Rowley, Clustering billions of images with large scale nearest neighbor search, in: 2007 IEEE Workshop on Applications of Computer Vision (WAC V ’07), IEEE, 2007, pp. 28–28. doi:10.1109/WACV.2007.18. [21] F. Å. Nielsen, A new ANEW: Evaluation of a word list for sentiment analysis in microblogs, 2011. [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL: https://arxiv.org/pdf/1810.04805. [23] C. Sun, X. Qiu, Y. Xu, X. Huang, How to fine-tune bert for text classification?, ???? URL: http://arxiv.org/pdf/1905.05583v3. [24] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, "O’Reilly Media, Inc.", 2009. [25] H. Saif, M. Fernandez, Y. He, H. Alani, On stopwords, filtering and data sparsity for sentiment analysis of twitter (2014). [26] R. Smith, An overview of the tesseract ocr engine, Ninth international conference on document analysis and recognition (ICDAR 2007) Vol. 2. (2007). [27] A. Amalia, A. Sharif, F. Haisar, D. Gunawan, B. B. Nasution, Meme opinion categorization by using optical character recognition (ocr) and naïve bayes algorithm, in: 2018 Third International Conference on Informatics and Computing (ICIC), IEEE, 2018, pp. 1–5. doi:10. 1109/iac.2018.8780410. [28] C. Yuan, H. Yang, Research on k-value selection method of k-means clustering algorithm, J 2 (2019) 226–235. doi:10.3390/j2020016. [29] S. Alaparthi, M. Mishra, Bert: a sentiment analysis odyssey, Journal of Marketing Analyt- ics 9 (2021) 118–126. URL: https://link.springer.com/article/10.1057/s41270-021-00109-8. doi:10.1057/s41270-021-00109-8. [30] K. L. Gwet, Handbook of Inter-Rater Reliability, 4th Edition: The Definitive Guide to Measuring The Extent of Agreement Among Raters, Advanced Analytics, LLC, 2014. [31] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\_5.