<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">DS@GT at Touché: Image Search and Ranking via CLIP and Image Generation Notebook for the Touché Lab at CLEF 2024</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Benjamin</forename><surname>Ostrower</surname></persName>
							<email>bostrower3@gatech.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">Georgia Institute of Technology</orgName>
								<address>
									<addrLine>225 North Avenue</addrLine>
									<postCode>30332</postCode>
									<settlement>Atlanta</settlement>
									<country key="US">United States</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Patcharapong</forename><surname>Aphiwetsa</surname></persName>
							<email>paphiwetsa3@gatech.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">Georgia Institute of Technology</orgName>
								<address>
									<addrLine>225 North Avenue</addrLine>
									<postCode>30332</postCode>
									<settlement>Atlanta</settlement>
									<country key="US">United States</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">DS@GT at Touché: Image Search and Ranking via CLIP and Image Generation Notebook for the Touché Lab at CLEF 2024</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">F62119ADBB3859E29538212EB00BCC12</idno>
					<idno type="DOI">10.18653/v1/2021.argmining-1.4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:52+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Image Generation</term>
					<term>CLIP</term>
					<term>Image Retrieval</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Our team made 2 submissions in the task "Image Retrieval for Arguments", where our submission focused on retrieving images. Our two runs made use of Image Generation comparison and CLIP embeddings.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The exponential growth of digital imagery has profoundly influenced various fields, ranging from social media and entertainment to scientific research and healthcare. The importance and proliferation of visual media will only accelerate as a form of efficient communication, hence the phrase "a picture is worth a thousand words". Touché offers a competition on selecting the most relevant images from a crawled corpus for a set of arguments. Therefore we attempted to enter in this touche task to improve on solutions for retrieving images related to arguments. We wanted our solutions to only focus on images and descriptions of images, to try and not use any webpage text. Our solution approaches focused around combining image descriptions and the image itself as a comprehensive unit and then for our other submission implementing one more step on top of that to add a comparison to generated images that used the arguments themselves as prompts for the generated images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Related work</head><p>The defining paper for retrieving images for arguments is by Kiesel <ref type="bibr" target="#b0">[1]</ref>. In it they use natural language processing techniques found in the web text of surrounding these images to create expanded keyword searches in the web text to attempt to track the stance of an argument. Last year at Touché 2023 team Picard <ref type="bibr" target="#b1">[2]</ref> constructed a similar solution -one of their submissions involved image generation using stable diffusion.The authors prompt the image generation with the arguments from the competition to create a benchmark image that is used to compare the competition images searching for the most similar using CLIP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">CLIP</head><p>CLIP <ref type="bibr" target="#b2">[3]</ref> stands for Contrastive Language Image Pretraining. It is a model developed by OpenAI that embeds images or texts into the same vector space by training on images with their corresponding captions. It is helpful to reduce the dimensionality of text or images, but still be semantically similar to retrieve relevant results from one modality to the other.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Stable Diffusion</head><p>Stable Diffusion <ref type="bibr" target="#b3">[4]</ref> is a neural network model that is capable of producing images given a prompt. By decomposing the image formation process into a sequential application of denoising autoencoders stable diffusion can achieve state-of-the-art synthesis results on image data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">System Overview</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Embedding Pipeline</head><p>Both submissions made use of CLIP from OpenAI. The competition supplied the images and their corresponding image descriptions obtained from LLava. These modalities were embedded using CLIP into a 70-30 ratio of Image to text. These embeddings were stored in a chromaDB vector database for later retrieval.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Retrieval</head><p>Provided arguments (only the arguments no premises or claims used) were used as queries to be brushed against the vector database. The arguments were embedded using CLIP to keep the same dimensionality as the combined image-text embeddings of the images. These arguments were compared pairwise across each image in the database via cosine similarity keeping the top 10 for our initial submission.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Image generation</head><p>For the image generation submission for each topic images were generated off a set of tinyLLama generated supporting/detracting arguments depending on the stance. For example if the stance was pro the prompt would instruct to provide supporting claims if it was anti then it would prompt with claims to detract. The number of arguments generated would vary from 3-7. The prompt format for a supporting generation is found in figure <ref type="figure" target="#fig_0">1</ref>.</p><p>{ " r o l e " : " s y s t e m " , " c o n t e n t " : " You a r e a s t u d e n t t r a i n e d t o t h i n k c r i t i c a l l y f o r each c l a i m b r e a k i t down i n t o s e v e r a l s u b c l a i m s " , } , { " r o l e " : " u s e r " , " c o n t e n t " : f " C r e a t e some numbered prompts t o g i v e t o a machine t o c r e a t e i m a g e s t h a t s u p p o r t t h e c l a i m : ' { prompt } ' " } These tinyLLama-generated supporting/detracting arguments were then fed into the stable-diffusion-2-1-base for image generation. These generated images are again embedded with CLIP and compared to the top 40 retrieved images using the method described in the prior section for a given argument. Because there was a varying number of images generated for each argument when comparing the crawled images to the generated images we take the highest average score across all generated images as our metric for most relevant images. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>Our approaches didn't beat baseline of BM25 and SBERT. We do see that the added filter of comparing the top results to images generated from the arguments does increase the accuracy of the model. It appears that Images alone</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>The image generated approach worked best, however both submissions didn't beat baseline. Future directions of work include re-ranking LLava Visual Question Answering generations -i.e. ask LLava to describe the picture in relevance to argument in question. Utilizing BM25 and webpage text to decipher keywords that might indicate the relevance of the image.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Prompt for supporting argument image generation. Using python and the transformers library pipeline method</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Approaches did not beat baseline</figDesc><graphic coords="3,72.00,65.61,451.28,224.34" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The Data Science at Georgia Tech Club.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Image Retrieval for Arguments Using Stance-Aware Query Expansion</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kiesel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Reichenbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.argmining-1.4</idno>
	</analytic>
	<monogr>
		<title level="m">8th Workshop on Argument Mining (ArgMining 2021) at EMNLP, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Al-Khatib</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Hou</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Stede</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="36" to="45" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Moebius</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Enderling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Bachinger</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2307.09172</idno>
		<title level="m">Jean-luc picard at touch∖&apos;e 2023: Comparing image generation, stance detection and feature matching for image retrieval for arguments</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="8748" to="8763" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">High-resolution image synthesis with latent diffusion models</title>
		<author>
			<persName><forename type="first">R</forename><surname>Rombach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Blattmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lorenz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Esser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ommer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</title>
				<meeting>the IEEE/CVF conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="10684" to="10695" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
