HTW-DIL at Touché: Multimodal Dense Information
                         Retrieval for Arguments
                         Notebook for the Touché Lab at CLEF 2024

                         Tamás Janusko1 , Aaron Kämpf1,† , Denis Keiling1,† , Jessica Knick1,† , David Schäfer1,† and
                         Maik Thiele1
                         1
                             HTW Dresden, Friedrich-List-Platz 1, Dresden, 01069, Germany


                                                Abstract
                                                Retrieving images for arguments poses many of the problems of traditional information retrieval with the added
                                                challenge of being inherently multimodal. We adapt a dense retrieval approach to address this issue and acquire
                                                synthetic training data to fine-tune a multimodal model as part of our retriever. Furthermore we conduct ablation
                                                studies to examine the impact of different modalities and benchmark our approach against state-of-the-art
                                                methods. While the task itself is ambiguity-laden there appears to be a benefit of using only textual information
                                                for retrieving argumentative images.

                                                Keywords
                                                Machine Learning, Information Retrieval, Multimodal Retrieval, Image Retrieval


                         1. Introduction
                         In information retrieval tasks the key objective is to find the most relevant documents for a given query.
                         This is also the case for the image retrieval for arguments as performed in task 3 of Touché@CLEF [1, 2]
                         on the TIRA shared task platform [3]. Although methods like BM25 [4] are strong baselines, they are
                         limited by their consideration of surface forms of the information-carrying elements in the document
                         alone. While this leads to diminished recall when dealing with synonyms, for example, the case of
                         multimodal retrieval poses the question of how to operate on the surface level at all.
                         Joint text and image embeddings solve both the challenges of deeper semantic understanding as well
                         as that of relating textual to visual information. In the case of image retrieval for arguments we are
                         confronted with an additional layer of implicitness which out-of-the-box embeddings are not equipped
                         to handel, thus we propose a method inspired by Facebook Research’s dense passage retrieval (DPR)
                         System [5] where we fine-tune a multimodal large language model (MLLM) to maximize similarity
                         scores of matching argument and image/text pairs.


                         2. Task Description
                         We are provided with 136 arguments consisting of topic, premise, claim, stance and type. The task is to
                         retrieve supporting images from a web crawl of approx. 9.000 samples where each image is accompanied
                         by additional information, among others the text content of the encompassing website and the search
                         query used to obtain that image. A particularly difficult aspect of the task is that several arguments
                         share a topic and have similar premises and claims while their differing stances are grounded in subtle
                         deviations of the lines of reasoning.


                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         †
                           Authors contributed equally
                          $ tamas.janusko@htw-dresden.de (T. Janusko); aaron.kaempf@stud.htw-dresden.de (A. Kämpf);
                          denis.keiling@stud.htw-dresden.de (D. Keiling); jessica.knick@stud.htw-dresden.de (J. Knick);
                          david.schaefer@stud.htw-dresden.de (D. Schäfer); maik.thiele@htw-dresden.de (M. Thiele)
                           000000021665977X (M. Thiele)
                                © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
3. Methodology
3.1. Multimodal DPR
As presented in the original DPR paper we also frame retrieval as a metric learning problem aiming
to maximize dot product similarity of matching queries and targets. While DPR retrieves passages -
chunks derived from larger documents - we treat each image and the accompanying textual information
(website text summary and web search query) as a unit during training, although only the image is
evaluated eventually.
The method employs an in-batch negative training scheme where for a given image a matching
argument (positive) doubles as a negative example when paired with any other image within the batch,
thus efficiently increasing the data size. Additionally, a randomly sampled argument is passed along
with the positive argument to function as yet another negative for each image in the batch. This way
we yield 𝑛 positive pairings and 𝑛2 negative pairings for a batch size of 𝑛. On this basis we compute the
negative log-likelihood loss as implemented in the original DPR code1 using the MLLM Moondream22
to facilitate operations in joint embedding space. Phi 1.5 [6] is the underlying LLM with SigLIP [7]
providing the vision capabilities in Moondream2. This model choice is motivated by the favorable
reported performance as well as its moderate size which is manageable with the hardware available to
us.
We use a learning rate of 1e-5 with linear warm-up from 10% over the first 10% of the training and then
decay back to 10% over the remaining run. With a batch size of 16 training is performed for two epochs
(Ep2), with the exception of an approach with image and text input that we train for three epochs (Ep3)
to probe the onset of possible overfitting.
The fine-tuned model is then used to embed the query argument as well as the image/text pairs.
FAISS [8] indices are computed for all embedded image/text pairs and the top-k most similar instances
for each argument embedding are retrieved.
Note that the final evaluation is only based on the retrieved image, rendering the textual website
content as merely supportive context information for the retrieval task which is not directly considered
by the judges.


3.2. Data Pre-Processsing
Since the input length of any large language model (LLM) is finite we perform several pre-selection and
pre-processing steps. Firstly, we use only the image, website content text and query string to represent
an image and its website context. Additionally, images are scaled to 256 pixels at their largest dimension,
and content text is summarized with BART fine-tuned on CNN Daily Mail3 for summarization [9].
Inputs too large for summarization models are chunked in suitable sizes and separately condensed
and then re-concatenated. If a website’s content consists mostly of structured text such as lists, no
summarization is performed.
Since arguments are given in XML-format we join the argument elements and topic into a concise
natural sentence, but without the type information.

3.3. Synthetic Train Data
In order to train a model in the first place we need a training set for the task at hand. Using the
multimodal capabilities of OpenAIs GPT-4 [10] we generate synthetic arguments by inferring plausible
argument elements from available image/website data. Each image/summary is used to derive one
argument topic for which in turn a premise and claim are generated for pro and con stances. The


1
  https://github.com/facebookresearch/DPR/blob/main/dpr/models/biencoder.py#L254
2
  https://huggingface.co/vikhyatk/moondream2
3
  https://huggingface.co/facebook/bart-large-cnn
resulting argument is given in valid XML-format. We do not distinguish between anecdotal and study
types as there were no examples of the latter at the time of development.

3.4. Ablative Approaches
In addition to using image and text data jointly (Moondream Default, Ep2, Ep3) we experiment with
using the images (Moondream Image) and the textual information (Moondream Text) separately in
order to examine whether an unimodal approach represents a feasible alternative within our DPR-like
setup. For this purpose Moondream models are fine-tuned using only images or website content text
(including the query string) with the same hyperparameters as the multimodal approach. Moreover, we
employ Ada-embeddings4 from OpenAI to represent the case of simple text-based retrieval with proven
off-the-shelf technology. The rationale for this is that images found on websites are usually placed there
deliberately by human authors with the intention of supporting the written content. Following that
assumption and given robust text embeddings one can leverage this relation to obtain relevant images
without taking them into account explicitly.


4. Manual Evaluation Results


Figure 1: Class attributions per approach for retrieved images averaged over top-3 results


   To obtain a better understanding of our approaches we manually evaluate the retrieval results. For
each approach we examine the top-3 retrieved images and assign them to the mutually exclusive
categories supports argument, on-topic and off-topic and compute the inter-annotator agreement
using Cohen’s kappa. Considering all 136 arguments, two annotators and top-3 results we yield
816 annotations per approach. In Figure 1 bars represent number of class instances from all top-3
runs and both annotators, with the bold marking representing the mean and the thinner upper and
lower markings show min/max values found in annotations of single runs. We find the majority of
retrieved images classified as not supportive for the corresponding argument. Only Ada and text-only
Moondream show parity of supporting and off-topic classes with on-topic instances being the majority
which can be interpreted as a trend of low performing image-only approaches to best performing
text-only ones. This may be because only shallow semantics of images are captured by the model. It is

4
    https://platform.openai.com/docs/models/embeddings
also supported by kappa values found in Table 1 where we find the highest inter-annotator agreement
for image-only Moondream, our worst performing model. In contrast, the best performing approaches,
Ada and text-only Moondream, have the lowest kappa values. This can be interpreted as evidence of
the inherent ambiguity of the task at hand and the many ways an image can support an argument.

Additionally, we compute the Jaccard index to quantify the similarity of results given by the
different methods we employ. For this we use the top-10 results from all arguments and compare the
IDs of the retrieved images. From Figure 2 we take that approaches differ substantially in their choice
of relevant images, with Moondream-based approaches being fairly similar and Ada-based retrieval far
behind with Moondream text-only as the closest approach. The findings from Figure 1 that text-based
approaches differ significantly from the image-only Moondream approach are reaffirmed. But while
the high similarity of Moondream image/text fine-tuned for two and three epochs is obvious, the
non-similarity of Ada and text-only Moondream is surprising and again speaks for the high ambiguity
of the challenge.


                                           Method       Kappa
                                           Ada          0.231
                                           MD Std.      0.489
                                           MD Ep2       0.439
                                           MD Ep3       0.525
                                           MD Image     0.66
                                           MD Text      0.371
Table 1
Cohen’s kappa inter-annotator agreement for two annotators and top-3 retrieved images


5. Conclusion
In this work we explored possibilities of multimodal dense image retrieval for arguments. We adapted
the DPR-technique and fine-tuned a multimodal base model for different input modalities. Ablation
studies suggest that text-only input is the most favorable input format and fine-tuning on images alone
causes retrieval to stray off-topic. This is underlined by the similarly good results of both text-only
approaches in contrast to their differences in size and purpose. However, it calls for further examination,
preferably on a larger data set and with additional annotators since 136 samples and two annotators
facilitate only moderately robust statistical analysis.
The main areas of interest are the contribution of information from text vs. image inputs, as well as the
role and extent of ambiguity when mapping arguments to images. As a natural first step the possibility
of bottlenecks caused by models that are too small and low image resolutions has to be ruled out.


References
 [1] B. Ionescu, H. Müller, A. Drăgulinescu, J. Rückert, A. Ben Abacha, A. Garcıa Seco de Herrera,
     L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, T. M. Pakull, H. Damm, B. Bracke,
     C. M. Friedrich, A. Andrei, Y. Prokopchuk, D. Karpenka, A. Radzhabov, V. Kovalev, C. Macaire,
     D. Schwab, B. Lecouteux, E. Esperança-Rodier, W. Yim, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, S. A. Hicks,
     M. A. Riegler, V. Thambawita, A. Storås, P. Halvorsen, M. Heinrich, J. Kiesel, M. Potthast, B. Stein,
     Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: L. Goeuriot,
     P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera,
     G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction.
     Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture
     Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024.
Figure 2: Jaccard indices quantifying similarities between approaches for top-10 retrieved images


 [2] J. Kiesel, Ç. Çöltekin, M. Heinrich, M. Fröbe, M. Alshomary, B. D. Longueville, T. Erjavec, N. Handke,
     M. Kopp, N. Ljubešić, K. Meden, N. Mirzakhmedova, V. Morkevičius, T. Reitis-Münstermann,
     M. Scharfbillig, N. Stefanovitch, H. Wachsmuth, M. Potthast, B. Stein, Overview of Touché
     2024: Argumentation Systems, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier,
     G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR
     Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International
     Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer,
     Berlin Heidelberg New York, 2024.
 [3] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
     Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
     F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances
     in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes
     in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/
     978-3-031-28241-6_20.
 [4] S. Robertson, H. Zaragoza, The probabilistic relevance framework: Bm25 and beyond, Found.
     Trends Inf. Retr. (2009). URL: https://doi.org/10.1561/1500000019. doi:10.1561/1500000019.
 [5] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage
     retrieval for open-domain question answering, in: Proceedings of the 2020 EMNLP, Association
     for Computational Linguistics, Online, 2020. doi:10.18653/v1/2020.emnlp-main.550.
 [6] Y. Li, S. Bubeck, R. Eldan, A. D. Giorno, S. Gunasekar, Y. T. Lee, Textbooks are all you need
     ii: phi-1.5 technical report, 2023. URL: https://www.microsoft.com/en-us/research/publication/
     textbooks-are-all-you-need-ii-phi-1-5-technical-report/.
 [7] X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer, Sigmoid loss for language image pre-training, 2023.
     arXiv:2303.15343.
 [8] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, 2017.
     arXiv:1702.08734.
 [9] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer,
     BART: denoising sequence-to-sequence pre-training for natural language generation, translation,
     and comprehension, CoRR abs/1910.13461 (2019). URL: http://arxiv.org/abs/1910.13461.
[10] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al-
     tenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao,
     M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko,
     M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell,
     A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen,
     R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier,
     Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning,
     A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford, L. Gao,
     E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein,
     S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, J. Hei-
     decke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga,
     S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, Łukasz
     Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim,
     J. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich, A. Konstantini-
     dis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. M. Li,
     R. Lim, M. Lin, S. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning,
     T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. M. McKinney, C. McLeavey,
     P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin,
     V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair, R. Nakano,
     R. Nayak, A. Neelakantan, R. Ngo, H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo,
     A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman,
     F. de Avila Belbute Peres, M. Petrov, H. P. de Oliveira Pinto, Michael, Pokorny, M. Pokrass, V. H.
     Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae, A. Ramesh, C. Raymond,
     F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar,
     G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh,
     S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky,
     Y. Song, N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. B. Thompson,
     P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone,
     A. Vijayvergiya, C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Wein-
     mann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich,
     H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba,
     R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, B. Zoph, Gpt-4 technical
     report, 2024. URL: https://arxiv.org/abs/2303.08774. arXiv:2303.08774.