<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CEDNAV-UTB: Eficient Image Retrieval for Arguments with CLIP</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diego Alberto Guevara Amaya</string-name>
          <email>guevarad@utb.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jairo Enrique Serrano Castañeda</string-name>
          <email>jserrano@utb.edu.co</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan C. Martinez-Santos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edwin Puertas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Naval Technological Development Center</institution>
          ,
          <addr-line>Colombian Navy, Cartagena</addr-line>
          ,
          <country country="CO">Colombia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Digital Transformation, Universidad Tecnológica de Bolívar</institution>
          ,
          <addr-line>Cartagena</addr-line>
          ,
          <country country="CO">Colombia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper introduces an eficient and reproducible system for argumentative image retrieval developed by the UTB-CEDNAV team for the 2025 edition of the Image Retrieval for Arguments challenge at Touché@CLEF. The system leverages the CLIP model (ViT-B/32) to represent textual arguments through images. Unlike previous approaches that rely heavily on complex text processing, image generation models, or multi-stage architectures, this solution focuses on computational simplicity. It significantly reduces energy consumption by reusing embeddings, enabling parallel processing, and eliminating redundant steps. According to measurements made using the CodeCarbon tool, this strategy resulted in an energy consumption reduction of over 85% in subsequent runs. The implementation is easy to deploy in environments like Google Colab and adheres to all Touché evaluation standards. This work provides a strong baseline for developing sustainable and scalable multimodal retrieval systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sustainable AI</kwd>
        <kwd>Computational eficiency</kwd>
        <kwd>Image retrieval</kwd>
        <kwd>CLIP</kwd>
        <kwd>Multimodal modeling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Image Retrieval for Arguments is a task that leverages natural language processing and computer vision
to enhance the analysis, generation, and presentation of complex ideas. This task is part of the Touché
Lab at CLEF 2025 challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], organized in collaboration with ImageCLEF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and evaluates systems
capable of retrieving or generating images relevant to a textual argument. Each image should help
convey the argument by illustrating it, providing examples, or evoking an emotional response, as shown
in Figure 1.
      </p>
      <p>
        To accomplish the task, we used the CLIP model (Contrastive Language–Image Pretraining) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
This model encodes both text and images into vector representations, enabling direct comparison and
measurement of their semantic relatedness. CLIP has demonstrated strong performance in multimodal
tasks and requires no additional training when used directly as a retrieval engine.
      </p>
      <p>The task of Image Retrieval for Arguments has practical applications in education, digital media, and
language assistance systems. Images reinforce textual content, enhance understanding of technical
or abstract concepts, and support visual assessment, thereby reducing bias and misinterpretation.
Moreover, integrating relevant images into automatic argument generation and analysis pipelines
contributes to the development of more valuable and accessible multimodal systems.</p>
      <p>
        Current systems often prioritize improving retrieval accuracy without considering the computational
eficiency of the process. Recent studies, such as Anthony et al. (2021) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], highlight the importance
of measuring and minimizing the carbon footprint during the training and execution of models,
encouraging the use of tools that efectively track and optimize energy consumption. However, moving
toward more sustainable artificial intelligence requires addressing the increasing energy demands of
modern models. Canales (2024) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] discusses several strategies to reduce the environmental impact
of AI systems. In response to this need, the present research proposes a solution that optimizes the
use of cloud infrastructure, executes processes in parallel, and reuses intermediate results—such as
embeddings and rankings—to reduce energy consumption without compromising task performance.
      </p>
      <p>This work presents a functional, easy-to-understand, and optimized baseline that solves the task
using CLIP without requiring additional training. The system delivers reproducible and reliable results
with minimal manual intervention. Key contributions include:
• A multimodal pipeline for image retrieval using CLIP.
• A computational eficiency strategy that minimizes unnecessary resource usage.</p>
      <p>• A validated baseline on the dataset provided by Touché 2025.</p>
      <p>We organized the remainder of the document as follows: Section 2 presents the previous approaches
used in earlier editions of the challenge and compares them with the proposed methodology. Section 3
describes the general architecture of the system and its workflow. Section 4 details the validation and
preliminary evaluation process. Section 5 ofers a critical discussion of the results obtained. Finally,
Section 6 proposes possible lines of future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>This section provides context for the study by reviewing prior approaches, the CLIP model used, and
the criteria applied for data selection in the UTB–CEDNAV System. It addresses four key aspects: (i)
related work in previous argumentative image retrieval tasks, (ii) the text–image matching model that
serves as the core of the system, (iii) the data selection strategy designed to ensure both eficiency and
relevance and (iv) a quantitative evaluation of the system’s environmental impact, ofering insight into
its computational sustainability compared to more resource-intensive methods. This review situates the
proposed approach within the current state of the art and justifies the methodological decisions made.</p>
      <sec id="sec-2-1">
        <title>2.1. Related Work</title>
        <p>
          The task of Image Retrieval for Arguments has been in previous editions of the Touché challenge through
various methods. Brummerloh et al. (2022)[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] employed sentiment analysis with BERT, optical character
recognition (OCR) with Tesseract, and image clustering, which improved stance classification but
relied heavily on text processing and manual validation. Elaina et al. (2023)[7] incorporated
ChatGPTgenerated arguments and combined CLIP with IBM Debater as a re-ranker, which introduced generative
biases and reduced accuracy. Ostrower et al. (2024)[8] proposed generating reference images using
TinyLLaMA and Stable Difusion to compare with the corpus via CLIP. Still, the high computational
cost prevented surpassing the traditional baseline. In contrast, the UTB-CEDNAV System avoids the
use of OCR, sentiment analysis, artificial generation, and external services. It relies solely on real data
(image captions), significantly reducing computational load, bias, and ambiguity.
2.2. CLIP (ViT-B/32)
Developed by OpenAI and introduced by Radford et al. (2021) [9], CLIP is a multimodal learning
model designed to associate images and text within a shared vector space. Although primarily built for
image–text matching, its architecture supports comparisons across diferent modalities—text-to-text,
image-to-image, and text-to-image—while preserving semantic consistency. This flexibility makes it
especially efective for tasks such as argumentative image retrieval, where semantic similarity between
claims and captions is crucial. In the UTB–CEDNAV System, CLIP is used precisely for this purpose,
leveraging its ability to represent complex concepts in a unified space without requiring OCR or
sentiment analysis.
        </p>
        <p>Although CLIP was originally designed for multimodal tasks, its text encoder has proven efective
for measuring semantic similarity in scenarios where the goal is to align textual descriptions of visual
content. In the context of the Touché 2025 task, each image is accompanied by a human-written caption
that reflects its visual semantics. Using CLIP embeddings for both claims and captions ensures that both
vectors lie in the same multimodal space, preserving compatibility with future image-based extensions
without additional re-training.</p>
        <p>Moreover, adopting a text-only model like SBERT would require aligning two independently trained
encoders: one for textual claims and another for captions intended to describe visual content. Since
the captions are tightly coupled to the image semantics, we found that CLIP’s text encoder provides a
better inductive bias for the retrieval task.</p>
        <p>Finally, our approach avoids the computational cost of running the image encoder, while still
leveraging CLIP’s alignment between natural language and visual concepts. This makes it both eficient and
semantically coherent for the task, especially when prioritizing sustainability and simplicity.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Select Dataset Features</title>
        <p>It is important to acknowledge that the image captions used in this work were generated using the
LLaVA model by the task organizers prior to system execution. While our approach avoids running
large-scale models during retrieval, the initial creation of captions involved significant computational
efort and energy consumption. This upstream cost, although external to our implementation, should be
considered when evaluating the total environmental footprint of the end-to-end pipeline. Nonetheless,
by focusing exclusively on reusing these pre-generated captions, our system minimizes additional
emissions and promotes sustainable downstream processing.</p>
        <p>Building on this foundation, and to further optimize task performance, this work draws on findings
by Theng and Bhoyar (2024) [10], who emphasize that the quality and relevance of data directly impact
model performance. Although the dataset was selected by the organizers and is considered fixed, our
system focuses on selecting the most informative elements within each data instance. Specifically, we
prioritize image captions, which in our view contain the most semantically relevant and computationally
eficient representation of the image content. This selection is guided by three criteria:
• Computational eficiency : Reducing unnecessary data lowers processing time and resource
usage.
• Direct semantic relevance: Prioritizing elements closely tied to the task objective enhances
model interoperability.
• Reduction of non-informative textual noise: Eliminating irrelevant or redundant content
prevents the model from learning spurious patterns, as described by Maheronnaghsh et al. (2024)
[11]</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. Environmental Impact Assessment</title>
        <p>To evaluate the environmental impact of the UTB–CEDNAV System, the team employed CodeCarbon
[12], an open-source library developed by MLCO2, to estimate the carbon footprint associated with
the computational load of running machine learning models. This tool tracks the energy consumption
of Python scripts. It translates it into estimated CO2 emissions, taking into account factors such as
hardware type, geographical location, and runtime duration.</p>
        <p>The integration of CodeCarbon reflects a growing need to develop AI systems that are both sustainable
and transparent regarding their environmental cost. Unlike previous approaches to argumentative
image retrieval, this work not only avoids computationally intensive techniques like OCR or synthetic
image generation but also quantifies its eficiency using objective environmental metrics.</p>
        <p>The values obtained through CodeCarbon support the system’s minimalist design, demonstrating
that we achieved strong performance while maintaining low energy consumption, thereby reinforcing
the feasibility of sustainable solutions in real-world scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. System Overview</title>
      <p>Building on the outlined context, we designed the system to address the task of Image Retrieval for
Arguments by adhering to two core principles: processing eficiency and sustainable use of computational
resources.</p>
      <sec id="sec-3-1">
        <title>3.1. Selecting Dataset Elements</title>
        <p>The system begins with an analysis of the oficial Touché 2025 dataset, published on Zenodo [ 13], which
comprises 32,339 images associated with 128 claims across 27 argument topics.</p>
        <p>After reviewing the dataset’s structure and content, we selected the following components:
• arguments.xml: Contains the textual arguments, particularly the claims that define the core of
each argument (see Figure 2)
• touche25-image-retrieval-and-generation-main.zip: file is approximately 20 GB in size, that includes
images, HTML files, captions, and metadata. The key component is image-caption.txt, which
provides precise and eficient image descriptions (see Figure 3)</p>
        <p>After analyzing the metadata provided by the organizers, we observed that each image has a
corresponding caption, ofering a precise and concise description. Given this consistency, and in line with
our objective of minimizing computational cost, we opted to compare textual embeddings between the
claims of the arguments and the captions of the images. This approach allowed us to avoid direct image
processing while preserving semantic alignment throughout the retrieval process.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Embeddings Pipeline</title>
        <p>Once we identified the relevant dataset elements, the system loads captions from image-caption.txt
ifles in parallel using ThreadPoolExecutor, as described by Sreedeep S. (2024) [14]. Each caption is
linked to its corresponding image_id and stored in a dictionary-like structure for eficient access.</p>
        <p>Claims and captions are then transformed into normalized vector representations using the CLIP
model (ViT-B/32). These embeddings capture semantic meaning in a shared multidimensional space
and are stored in organized, separate files for later use. Before processing new data, the system checks
for existing embeddings to avoid redundant computations, as illustrated in Figure 4. This caching
mechanism reduces execution time, promotes scalability, and supports experiment reproducibility. The
impact of this optimization is quantified in Section 3.4 through environmental metrics.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Retrieval</title>
        <p>Given that the captions reliably describe the content of the images and are consistently available, the
system computes semantic similarity exclusively between claim and caption embeddings. This text–text
approach aligns with the task requirements while simplifying the retrieval process.</p>
        <p>To identify the most relevant images for each argument, the system calculates cosine similarity
between the embedding of each claim and those of all captions. Cosine similarity quantifies the angle
between vectors in a shared semantic space, with values closer to 1 indicating stronger alignment.</p>
        <p>Images whose captions rank among the TOP_K most similar are selected as final results. These are
formatted according to the Touché 2025 submission specifications and exported in submission.jsonl
format.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Implementation</title>
        <p>The system was implemented in Google Colab, ofering a practical balance between performance,
simplicity, and scalability. The reuse of embeddings and conditional file downloading reduce memory
and storage overhead during execution.</p>
        <p>To assess the system’s environmental impact, we employed the CodeCarbon tool to estimate energy
consumption and associated CO2 emissions throughout the pipeline. Although the shared task requires
a single submission, development involves multiple validation and testing runs to ensure retrieval
quality, parameter tuning, and reproducibility.</p>
        <p>The initial run—generating all embeddings from captions and claims—consumed approximately
0.00349 kWh, resulting in 0.00093 kg of CO2 emissions. In contrast, subsequent executions that
reused precomputed embeddings averaged only 0.00013 kg of CO2 per run, as shown in Figure 5.
These results support the benefits of the reuse strategy discussed in Section 3.4.</p>
        <p>Energy Consumption and CO Emissions per Iteration
0.0000 1st (with embeddings)
2nd</p>
        <p>3rd
Iteration
4th
5th</p>
        <p>
          Although a large language model (LLM) was not directly implemented as a baseline, recent studies
indicate that such models exhibit significantly higher energy consumption. For instance, Anthony
et al. (2021) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] report that a single inference with an LLM can consume between 0.01 and 0.3 kWh,
depending on the model size and the underlying infrastructure. This far exceeds the energy
consumption recorded by our CLIP-based approach. The diference highlights the eficiency of the proposed
solution, particularly during iterative validation phases, where the reuse of embeddings contributes to
a cumulative reduction in emissions.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>
        The UTB–CEDNAV system distinguishes itself from previous work by adopting a direct, reproducible,
and lightweight approach. Unlike earlier proposals that rely on intensive text processing—such as OCR
and sentiment analysis [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], re-ranking with external argumentation models [7], or synthetic visual
generation via difusion models [ 8]—this system operates exclusively on real data provided in the oficial
dataset (claims and captions).
      </p>
      <p>The design avoids external dependencies such as re-rankers, additional classifiers, or APIs, which
simplifies the pipeline and facilitates deployment. While approaches based on OCR or image generation
may require significantly more operations per argument—due to reliance on resource-heavy models like
Tesseract, LLaMA, or Stable Difusion—UTB–CEDNAV performs a direct embedding transformation
using CLIP followed by semantic similarity computation.</p>
      <p>An important optimization is the pre-check for existing embeddings, allowing the system to skip
redundant computations. This promotes reusability of intermediate results and contributes to eficient
runtime behavior, particularly during iterative development and testing.</p>
      <p>The deliberate exclusion of HTML parsing, generative components, and synthetic data reflects a clear
focus on algorithmic transparency and methodological traceability. Furthermore, the final submission
was successfully validated using the oficial Touché tool, ensuring strict compliance with the task
requirements.</p>
      <p>Although the UTB–CEDNAV system did not surpass the baseline performance (nDCG@5 = 0.2360), it
adheres to a clear design philosophy centered on simplicity and responsible resource use. The winning
team also leveraged CLIP embeddings, but used a larger and more compute-intensive model variant
(ViT-L/14-336), suggesting that performance gains come at the cost of higher complexity. These results
highlight an important trade-of between accuracy and sustainability in multimodal retrieval systems.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This work presents a robust, reproducible, and environmentally responsible system for argumentative
image retrieval in the Touché 2025 challenge. Its design is grounded in three core principles:
• The exclusive use of the CLIP model (ViT-B/32) to transform text into embeddings within a shared
vector space.
• Eficient batch processing with reuse of previously generated resources.
• A mindful data selection strategy that avoids redundant operations and reduces computational
load.</p>
      <p>Unlike other approaches that integrate generative models, synthetic visual analysis, or additional
neural networks for classification or re-ranking, this system minimizes technical complexity and energy
consumption. It makes it particularly well-suited for resource-constrained environments or institutions
committed to digital sustainability.</p>
      <p>Additionally, the strategy of reusing previously stored representations proved highly efective: after
the initial run, which required whole embedding generation, subsequent executions showed an energy
consumption reduction of over 85%, with average emissions as low as 0.00013 kg of CO2 per run.
This measurable diference highlights the positive impact of avoiding unnecessary recomputation. It
reinforces the importance of designing optimized pipelines that prioritize both computational eficiency
and environmental sustainability in resource-intensive AI tasks.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>Future directions for the system include:
• Integrating lightweight models for semantic stance classification, enabling the system not only to
assess image-argument relevance but also to determine whether an image supports or opposes a
given argument
• Evaluating low-impact visual question answering techniques for re-ranking previously retrieved
results. This includes exploring lightweight methods to approximate queries such as “Does this
image support the argument?” with minimal computational cost, aiming to improve the semantic
alignment of retrieved images.
• Exploring hybrid embeddings that combine eficiency with lightweight generative capabilities,
blending CLIP with small models that better capture argumentative context without adding
latency or complexity</p>
      <p>Together, these enhancements position the UTB–CEDNAV System as a viable path toward more
sustainable multimodal artificial intelligence without compromising performance or coherence in the
task of argumentative retrieval.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We thank the Integral Naval Education Command of the Colombian Navy for providing the necessary
resources and the Naval Technological Development Center for ofering a suitable environment to
conduct this research. We thank the team of the Artificial Intelligence Laboratory VerbaNex 1, afiliated
with the UTB, for their contributions to this project.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4 to:
• Write and structure the scientific article with formal coherence.
• Synthesize and compare previous approaches accurately.</p>
      <p>• Improve argumentative clarity, grammatical consistency, and academic translation into English.</p>
      <p>After using this tool, the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-9">
      <title>CRediT Author Statement</title>
      <p>Diego Alberto Guevara Amaya: Conceptualization, Methodology, Software, Formal Analysis, Writing
– Original Draft, Visualization.</p>
      <p>Jairo Enrique Serrano Castañeda: Supervision, Writing – Review &amp; Editing, Project
Administration.</p>
      <p>Juan C. Martinez-Santos: Resources, Validation, Writing – Review &amp; Editing.</p>
      <p>Edwin Alexander Puertas Del Castillo: Funding Acquisition, Institutional Support, Writing –
Review &amp; Editing.
[7] D. Elagina, B.-A. Heizmann, M. Koch, G. Lahmann, C. Ortlepp, Neville Longbottom at Touché
2023: Image Retrieval for Arguments using ChatGPT, CLIP and IBM Debater, 2023. URL: http:
//ceur-ws.org, cLEF 2023, September 18–21.
[8] B. Ostrower, P. Aphiwetsa, DS@GT at Touché: Image Search and Ranking via CLIP and Image</p>
      <p>Generation, 2024. URL: http://ceur-ws.org, cLEF 2024, September 09–12.
[9] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language
supervision, 2021. Accessed May 26, 2025.
[10] D. Theng, K. K. Bhoyar, Feature selection techniques for machine learning: a survey of more
than two decades of research, 2025. URL: https://link.springer.com/10.1007/s10115-023-02010-5.
doi:10.1007/s10115-023-02010-5, accessed May 26, 2025.
[11] M. J. Maheronnaghsh, T. Akbari Alvanagh, Robustness to spurious correlation: A comprehensive
review, 2024. Accessed May 26, 2025.
[12] A. Lacoste, S. Luccioni, V. Schmidt, T. Dandres, Codecarbon: Estimate the carbon footprint of
your compute usage, 2021. URL: https://github.com/mlco2/codecarbon. doi:10.5281/zenodo.
5105071, accessed May 26, 2025.
[13] M. Heinrich, J. Kiesel, M. Wolter, M. Potthast, B. Stein,
Touché25-image-retrieval-andgeneration-for-arguments, 2025. URL: https://doi.org/10.5281/zenodo.15123526. doi:10.5281/
zenodo.15123526, accessed May 26, 2025.
[14] S. S., Parallel processing in python with ThreadPoolExecutor, 2024. URL: https://www.linkedin.
com/pulse/parallel-processing-python-threadpoolexecutor-sreedeep-surendran-hsbhc, accessed
May 26, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          , Ç. Çöltekin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gohsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Heineking</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erjavec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ljubešić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Meden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mirzakhmedova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Morkevičius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          , I. Zelch,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <source>Overview of Touché 2025: Argumentation Systems</source>
          ,
          <year>2025</year>
          . URL: https:// link.springer.com/chapter/10.1007/978-3-
          <fpage>031</fpage>
          -88720-8_
          <fpage>67</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -88720-8_
          <fpage>67</fpage>
          , accessed May 26,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wolter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>Touché-</surname>
          </string-name>
          argument-images | ImageCLEF / LifeCLEF - multimedia retrieval in CLEF,
          <year>2025</year>
          . URL: https://www.imageclef.org/2025/ argument-images,
          <source>accessed May 26</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2103.00020. arXiv:
          <volume>2103</volume>
          .00020, accessed May 26,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. F. W.</given-names>
            <surname>Anthony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kanding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Selvan</surname>
          </string-name>
          ,
          <article-title>Carbontracker: Tracking and predicting the carbon footprint of training deep learning models</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2007</year>
          .03051. arXiv:
          <year>2007</year>
          .03051, accessed May 26,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Luna</surname>
          </string-name>
          ,
          <article-title>Sustainable ai: How can ai reduce its environmental footprint</article-title>
          ?,
          <year>2024</year>
          . URL: https: //www.datacamp.com/es/blog/sustainable-ai,
          <source>accessed May 26</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brummerloh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Carnot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lange</surname>
          </string-name>
          , G. Pfänder, Boromir at Touché 2022:
          <article-title>Combining Natural Language Processing and Machine Learning Techniques for Image Retrieval for Arguments, 2022</article-title>
          . URL: http://ceur-ws.org, cLEF
          <year>2022</year>
          ,
          <article-title>September 5-8</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>