1. Introduction

Xiv:

HTW-DIL at Touché: Multimodal Dense Information Retrieval for Arguments

Tamás Janusko

Aaron Kämpf

Denis Keiling

Jessica Knick

David Schäfer

Maik Thiele

0 0 HTW Dresden , Friedrich-List-Platz 1, Dresden, 01069 , Germany

2303

15343

Retrieving images for arguments poses many of the problems of traditional information retrieval with the added challenge of being inherently multimodal. We adapt a dense retrieval approach to address this issue and acquire synthetic training data to fine-tune a multimodal model as part of our retriever. Furthermore we conduct ablation studies to examine the impact of diferent modalities and benchmark our approach against state-of-the-art methods. While the task itself is ambiguity-laden there appears to be a benefit of using only textual information for retrieving argumentative images.

eol>Machine Learning Information Retrieval Multimodal Retrieval Image Retrieval

1. Introduction 2. Task Description

We are provided with 136 arguments consisting of topic, premise, claim, stance and type. The task is to retrieve supporting images from a web crawl of approx. 9.000 samples where each image is accompanied by additional information, among others the text content of the encompassing website and the search query used to obtain that image. A particularly dificult aspect of the task is that several arguments share a topic and have similar premises and claims while their difering stances are grounded in subtle deviations of the lines of reasoning.

3. Methodology 3.1. Multimodal DPR

As presented in the original DPR paper we also frame retrieval as a metric learning problem aiming to maximize dot product similarity of matching queries and targets. While DPR retrieves passages chunks derived from larger documents - we treat each image and the accompanying textual information (website text summary and web search query) as a unit during training, although only the image is evaluated eventually.

The method employs an in-batch negative training scheme where for a given image a matching argument (positive) doubles as a negative example when paired with any other image within the batch, thus eficiently increasing the data size. Additionally, a randomly sampled argument is passed along with the positive argument to function as yet another negative for each image in the batch. This way we yield positive pairings and 2 negative pairings for a batch size of . On this basis we compute the negative log-likelihood loss as implemented in the original DPR code1 using the MLLM Moondream22 to facilitate operations in joint embedding space. Phi 1.5 [ 6 ] is the underlying LLM with SigLIP [7] providing the vision capabilities in Moondream2. This model choice is motivated by the favorable reported performance as well as its moderate size which is manageable with the hardware available to us.

We use a learning rate of 1e-5 with linear warm-up from 10% over the first 10% of the training and then decay back to 10% over the remaining run. With a batch size of 16 training is performed for two epochs (Ep2), with the exception of an approach with image and text input that we train for three epochs (Ep3) to probe the onset of possible overfitting.

The fine-tuned model is then used to embed the query argument as well as the image/text pairs. FAISS [8] indices are computed for all embedded image/text pairs and the top-k most similar instances for each argument embedding are retrieved.

Note that the final evaluation is only based on the retrieved image, rendering the textual website content as merely supportive context information for the retrieval task which is not directly considered by the judges.

3.2. Data Pre-Processsing

Since the input length of any large language model (LLM) is finite we perform several pre-selection and pre-processing steps. Firstly, we use only the image, website content text and query string to represent an image and its website context. Additionally, images are scaled to 256 pixels at their largest dimension, and content text is summarized with BART fine-tuned on CNN Daily Mail 3 for summarization [9]. Inputs too large for summarization models are chunked in suitable sizes and separately condensed and then re-concatenated. If a website’s content consists mostly of structured text such as lists, no summarization is performed.

Since arguments are given in XML-format we join the argument elements and topic into a concise natural sentence, but without the type information.

3.3. Synthetic Train Data

In order to train a model in the first place we need a training set for the task at hand. Using the multimodal capabilities of OpenAIs GPT-4 [10] we generate synthetic arguments by inferring plausible argument elements from available image/website data. Each image/summary is used to derive one argument topic for which in turn a premise and claim are generated for pro and con stances. The

1https://github.com/facebookresearch/DPR/blob/main/dpr/models/biencoder.py#L254 2https://huggingface.co/vikhyatk/moondream2 3https://huggingface.co/facebook/bart-large-cnn

resulting argument is given in valid XML-format. We do not distinguish between anecdotal and study types as there were no examples of the latter at the time of development.

3.4. Ablative Approaches

In addition to using image and text data jointly (Moondream Default, Ep2, Ep3) we experiment with using the images (Moondream Image) and the textual information (Moondream Text) separately in order to examine whether an unimodal approach represents a feasible alternative within our DPR-like setup. For this purpose Moondream models are fine-tuned using only images or website content text (including the query string) with the same hyperparameters as the multimodal approach. Moreover, we employ Ada-embeddings4 from OpenAI to represent the case of simple text-based retrieval with proven of-the-shelf technology. The rationale for this is that images found on websites are usually placed there deliberately by human authors with the intention of supporting the written content. Following that assumption and given robust text embeddings one can leverage this relation to obtain relevant images without taking them into account explicitly.

4. Manual Evaluation Results

To obtain a better understanding of our approaches we manually evaluate the retrieval results. For each approach we examine the top-3 retrieved images and assign them to the mutually exclusive categories supports argument, on-topic and of-topic and compute the inter-annotator agreement using Cohen’s kappa. Considering all 136 arguments, two annotators and top-3 results we yield 816 annotations per approach. In Figure 1 bars represent number of class instances from all top-3 runs and both annotators, with the bold marking representing the mean and the thinner upper and lower markings show min/max values found in annotations of single runs. We find the majority of retrieved images classified as not supportive for the corresponding argument. Only Ada and text-only Moondream show parity of supporting and of-topic classes with on-topic instances being the majority which can be interpreted as a trend of low performing image-only approaches to best performing text-only ones. This may be because only shallow semantics of images are captured by the model. It is

4https://platform.openai.com/docs/models/embeddings

also supported by kappa values found in Table 1 where we find the highest inter-annotator agreement for image-only Moondream, our worst performing model. In contrast, the best performing approaches, Ada and text-only Moondream, have the lowest kappa values. This can be interpreted as evidence of the inherent ambiguity of the task at hand and the many ways an image can support an argument. Additionally, we compute the Jaccard index to quantify the similarity of results given by the diferent methods we employ. For this we use the top-10 results from all arguments and compare the IDs of the retrieved images. From Figure 2 we take that approaches difer substantially in their choice of relevant images, with Moondream-based approaches being fairly similar and Ada-based retrieval far behind with Moondream text-only as the closest approach. The findings from Figure 1 that text-based approaches difer significantly from the image-only Moondream approach are reafirmed. But while the high similarity of Moondream image/text fine-tuned for two and three epochs is obvious, the non-similarity of Ada and text-only Moondream is surprising and again speaks for the high ambiguity of the challenge.

Method Ada MD Std.

MD Ep2 MD Ep3 MD Image MD Text

Kappa 0.231 0.489 0.439 0.525 0.66 0.371

5. Conclusion

In this work we explored possibilities of multimodal dense image retrieval for arguments. We adapted the DPR-technique and fine-tuned a multimodal base model for diferent input modalities. Ablation studies suggest that text-only input is the most favorable input format and fine-tuning on images alone causes retrieval to stray of-topic. This is underlined by the similarly good results of both text-only approaches in contrast to their diferences in size and purpose. However, it calls for further examination, preferably on a larger data set and with additional annotators since 136 samples and two annotators facilitate only moderately robust statistical analysis.

The main areas of interest are the contribution of information from text vs. image inputs, as well as the role and extent of ambiguity when mapping arguments to images. As a natural first step the possibility of bottlenecks caused by models that are too small and low image resolutions has to be ruled out.

[1]

Ionescu ,

Müller ,

Drăgulinescu ,

Rückert ,

A. Ben

Abacha ,

Garcıa Seco de Herrera , L. Bloch,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

T. M.

Pakull ,

Damm ,

Bracke ,

C. M.

Friedrich ,

Andrei ,

Prokopchuk ,

Karpenka ,

Radzhabov ,

Kovalev ,

Macaire ,

Schwab ,

Lecouteux ,

Esperança-Rodier ,

Yim ,

Fu ,

Sun ,

Yetisgen ,

Xia ,

S. A.

Hicks ,

M. A.

Riegler ,

Thambawita ,

Storås ,

Halvorsen ,

Heinrich ,

Kiesel ,

Potthast ,

Stein , Overview of ImageCLEF 2024: Multimedia retrieval in medical applications , in: L. Goeuriot , P.

Mulhem , G.

Quénot , D.

Schwab , L.

Soulier , G. M. D. Nunzio , P. Galuščáková , A. G. S. de Herrera , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024 .

[2]

Kiesel , Ç. Çöltekin,

Heinrich ,

Fröbe ,

Alshomary ,

B. D.

Longueville ,

Erjavec ,

Handke ,

Kopp ,

Ljubešić ,

Meden ,

Mirzakhmedova ,

Morkevičius ,

Reitis-Münstermann ,

Scharfbillig ,

Stefanovitch ,

Wachsmuth ,

Potthast ,

Stein , Overview of Touché 2024: Argumentation Systems , in: L. Goeuriot , P.

Mulhem , G.

Quénot , D.

Schwab , L.

[3]

Fröbe ,

Wiegmann ,

Kolyada ,

Grahm ,

Elstner ,

Loebe ,

Hagen ,

Stein ,

Potthast , Continuous Integration for Reproducible Shared Tasks with TIRA.io , in: J. Kamps , L.

Goeuriot , F.

Crestani , M.

Maistro , H.

Joho , B.

Davis , C.

Gurrin , U.

Kruschwitz , A . Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023 , pp. 236 - 241 . doi: 10 .1007/ 978-3- 031 -28241-6_ 20 .

[4]

Robertson ,

Zaragoza , The probabilistic relevance framework: Bm25 and beyond , Found. Trends Inf. Retr . ( 2009 ). URL: https://doi.org/10.1561/1500000019. doi: 10 .1561/1500000019.

[5]

Karpukhin ,

Oguz ,

Min ,

Lewis ,

Wu ,

Edunov ,

Chen , W.-t. Yih, Dense passage retrieval for open-domain question answering , in: Proceedings of the 2020 EMNLP , Association for Computational Linguistics , Online, 2020 . doi: 10 .18653/v1/ 2020 .emnlp-main. 550 .

[6]

Li ,

Bubeck ,

Eldan ,

A. D.

Giorno ,

Gunasekar ,

Y. T.

Lee , Textbooks are all you need ii: phi-1.5 technical report , 2023 . URL: https://www.microsoft.com/en-us/research/publication/ textbooks -are-all-you-need-ii- phi-1-5-technical-report/.