<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Relation between Texts and Images in News: News Images in MediaEval 2023</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andreas Lommatzsch</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Kille</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Özlem Özgöbek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehdi Elahi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Duc-Tien Dang-Nguyen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Norwegian University of Science and Technology</institution>
          ,
          <addr-line>Trondheim</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Universität Berlin</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Bergen</institution>
          ,
          <addr-line>Bergen</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>News articles typically consist of text and images. Images plays a crucial role in catching the user's attention and emphasising the article's message. For each news text, the editor must select the best photos from the available set of recent photos, archived photos or stock images to both attract user's attention and best fitting with the news article text. The NewsImages benchmark aims to shed a light on this real-world relation between news texts and the accompanying images. The task provides datasets and evaluation components for studying this relation. The datasets includes AI-generated images as an additional research challenge. This paper describes the NewsImages task in detail, giving the explanations for the dataset and evaluation metrics. It also discusses the connections to existing research and the addressed challenges.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In the fast-paced world of digital journalism, news articles are inherently multi-modal, seamlessly
intertwining text and images to convey information. Among the various components of a news
article, images occupy a pivotal role. They not only serve as a visual aid; they catch the readers’
interest, compelling them to delve into the text. Furthermore, images reinforce the central
message of the article, often providing context or ofering a visual perspective that words alone
might fail to capture. With the rise of generative artificial intelligence, there has been a shift
towards automating news article creation. This automation includes the generation of text and
images that align perfectly with the content.</p>
      <p>NewsImages task aims to support the research in understanding the relationship between
news texts and their accompanying images on news portals. This relationship is full of challenges.
The vast expanse of news topics, the diversity in domains, the plethora of news portals, and the
myriad styles of news articles, all culminate in a complex web of considerations when matching
text with images. Delving deeper into the scenario, NewsImages is driven by several pertinent
questions: How can the connection between texts and images in news articles be re-established?
To what extent do generated images alter this re-establishment? Are there discernible patterns
or principles that guide editors when they select images for news pieces? And, in the grand
scheme of automated news generation, are there innovative methods to generate better-suited
images for given news texts?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>
        Deciphering the relationship between text and images in news articles is an important task
for understanding both the creation and the perception of content in the news sector. The
depiction gap between texts and images is a major problem [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Significant advances in image
comprehension have recently been made through deep neural networks, enabling systems not
only to detect intricate concepts within images but also to identify pertinent objects with high
precision. The embedding of concepts extracted from images and texts within a unified vector
space is central to this advancement, facilitating nuanced correlations. While there are multiple
datasets tailored for optimizing learning strategies in image labeling (e.g. MS COCO [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), a new
frontier lies in generative AI’s capability to produce high-resolution images from text descriptors.
The landscape of news imagery, dominated by stock photos, portraits, and loosely related
archival images, presents unique challenges, often accentuated by the absence of directly relevant
visuals. This inspires the pivotal research question: How are images and text interconnected
in the context of news? Furthermore, this opens a broader inquiry into AI’s potential role in
enhancing news article formulation, opening avenues for automated, contextually appropriate
visual representation.
      </p>
      <p>
        For the 5th time, the NewsImages challenge explores the aspects of multimedia content in
news. The first editions (NewsREEL Multimedia [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]) focused on predicting the popularity
of news items based on multimedia content. In 2021, the focus shifted to understanding the
relationship between text and images [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In 2023, we extend the task by adding AI generated
images to further explore the relation between news and AI generated images. The NewsImages
task is related to several research topics, such as multi-modal recommender systems [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ],
the detection of fake news [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and multi-modal embedding methods [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The task supports
the research toward multi-modality in diferent news related domains.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <p>The NewsImages benchmark investigates the connection between textual news content and
associated imagery. This year’s task draws its data from two distinct news dissemination
channels: oficial publishers’ portals and RSS feeds. Participants are provided with a comprehensive
training dataset, encompassing linked text-image pairs, complemented by a test dataset with
disassociated pairs. The challenge mandates the development and critical evaluation of
innovative methodologies to accurately re-associate news articles with corresponding images. The
dataset represents a challenge with instances of images, such as conceptual stock photographs,
potentially aligning with multiple articles. Participants are required to submit a prioritized list
of plausible image matches, with the evaluation metric favoring early correct re-associations.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset</title>
      <p>NewsImages provides a dataset compris- Batch Source Purpose No. Cases
ing three parts built on news from news
portals and an RSS news feed. As source GDELT-P1-a Web sites Training 8500
for the crawled web sites, we use the GDELT-P1-b Web sites Test 1500
GDELT project (https://www.gdeltproject. GDELT-P2-a Web sites Training 12 041
org/) that aggregates news from all over GDELT-P2-b Web sites Test 1500
the world. For the RT part the RSS feed RT-a RSS Feed Training 9755
rtde1 has been used. The dataset has RT-b RSS Feed Test 3000
been created using the following three
steps: (1) Crawling: Crawl news items Table 1: Dataset Statistics. The dataset comes in six
from the selected sources and eliminate batches. The number of cases refers to the
news articles that do not consist of an article-image pairs.
image and a suitable text. We use news items published in the period November 2022–August
2023. For the GDELT part the news title and the entities (extracted by GDELT from the news
text for creating knowledge graphs) are used (http://data.gdeltproject.org/gkg/index.html. For the
RT part, the news title and the snippet (both German originals and English machine translation
of these fields) are used. (2) Cleaning: For ensuring the quality of images, we use diferent
heuristics for removing duplicates, low quality images, and logos. In addition, we remove images
mainly consist of text. (3) Image generation: For studying the problem of matching generated
images we use Stable Difusion. We use the news article’s headline as the prompt. The generated
images are used to replace some of the original images. In the three parts of the dataset, the
fraction of generated images difers. Part GDELT-P1 does not contain any generated images;
GDELT-P2 contains 80% generated images, and RT has 50% generated images. (4) Splitting:
Each part of the dataset is split into a training and a test set as Table 1 illustrates.</p>
      <p>The data set contains information related to articles and images. Articles’ metadata include
the URL, title, and a text snippet (RT batch) or the entities extracted from the news text (GDELT
batch). Image captions or image filenames must not be used in the task.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>The NewsImages benchmark is designed to analyze the relation between news texts and the
accompanying images. As a concrete task, the participants must assign a matching image to
for each news text in the given test set. Concretely, for each news article an ordered list of 100
images must be submitted. The participants provide a text file that provides a tab separated list
of 100 image IDs for each news article ID.</p>
      <p>The participants’ submissions are evaluated against a ground truth defined by the originally
crawled connection between the images and the text. The ground truth ensures that a 1:1
relation between the images and the texts exists.</p>
      <sec id="sec-5-1">
        <title>5.1. Evaluation Metric</title>
        <p>
          The participants’ submissions are evaluated using the Mean Reciprocal Rank (MRR) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] as the
main evaluation criteria. MRR is defined as MRR = 1 ∑︀ 1
=1 rank() , where rank() returns
the rank at which the matching image was listed. The earlier the matching image appears on
average, the higher the score. The Mean Reciprocal favors the top of the list and penalizes
ifnding a match further down.
        </p>
        <p>In addition to MRR, we also compute the Average Recall (AR) at ranks  for  ∈
{1, 5, 10, 20, 50, 100}. AR computes the average over the recall scores calculated for each
news article. The evaluation scores are computed separately for each batch.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Run Description</title>
        <p>Participants are encouraged to contribute working notes that elucidate their innovative concepts,
fostering an in-depth exploration of the intricate relationship between textual content and
images in news media. In this pursuit, participants have the opportunity to submit a maximum
of five runs for each of the three test datasets. Each run entails a set of predictions tailored
to these test datasets. We encourage participants to engage in a comprehensive comparative
analysis of their various runs, encompassing assessments of quality, computational complexity,
and resource utilization.</p>
        <p>Furthermore, the discussion of results should be characterized by a nuanced consideration
of the datasets’ idiosyncrasies, illuminating how the discoveries made can be extrapolated to
diverse scenarios. To culminate, participants are expected to articulate their insights and reflect
on their potential contributions towards advancing cutting-edge research in this field.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The linking between news texts and images is still a complicated problem due to the news
domain’s diversity, editors’ habits, and readers’ expectations. The mixture of real photos, stock
images, archived photos, and AI generated images makes it very challenging to extract not only
concepts from images but also to understand the principles applying when selecting the images.
The NewsImages challenge provides a medium-sized, real-world data set for investigating the
existing principles for connecting images and texts. Participants can develop, optimize, and
evaluate innovative re-matching methods for news texts and images.With the growing popularity
and enhancement of AI methods for generating images, images that are more representative of
the text could replace the partially matching images like stock photos. These artificial images
could be used to reinforce the credibility of fake news but also avoid misinterpretation of news
caused by ill-fitted stock images. Thus, understanding the relation between news texts and
images remains a highly relevant and challenging research topic. News Images provides the
foundation to foster the development and evaluation of innovative approaches.
Acknowledgments
We gratefully thank Marc Gallofré Ocaña and Sohail Ahmed Khan for supporting the data set
creation. We acknowledge the contributions of the GDELT (https://www.gdeltproject.org/) project
for providing the data which made the dataset creation possible.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Özgöbek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tešić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bartolomeu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Semedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pivovarova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liang</surname>
          </string-name>
          , M. Larson,
          <article-title>NewsImages: Addressing the Depiction Gap with an Online News Dataset for Text-Image Rematching</article-title>
          ,
          <source>in: Proceedings of the 13th ACM Multimedia Systems Conference, MMSys '22</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>227</fpage>
          -
          <lpage>233</lpage>
          . URL: https://doi.org/10.1145/3524273.3532891.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>Common Objects in Context</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2014</year>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -10602-1_
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          , L. Ramming,
          <article-title>MediaEval 2018 - Overview on NewsREEL Multimedia</article-title>
          ,
          <source>in: Proceedings of the MediaEval Benchmarking Initiative for Multimedia Evaluation</source>
          <year>2018</year>
          , CEUR Workshop Proceedings,
          <year>2018</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2283</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deldjoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>The 2019 Multimedia for Recommender System Task: MovieREC and NewsREEL at MediaEval</article-title>
          , in: Procs.
          <article-title>of the MediaEval Benchmarking Initiative for Multimedia Evaluation 2019, CEUR WS Procs</article-title>
          .,
          <year>2019</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2670</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Özgöbek,</surname>
          </string-name>
          <article-title>NewsImages: The Role of Images in Online News</article-title>
          ,
          <source>in: Proceedings of the MediaEval Benchmarking Initiative for Multimedia Evaluation</source>
          <year>2020</year>
          , CEUR Workshop Proceedings,
          <year>2020</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2882</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          , Ö. Özgöbek,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elahi</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-T.</surname>
          </string-name>
          Dang-Nguyen,
          <article-title>News Images in MediaEval 2021</article-title>
          , in:
          <source>Proceedings of the MediaEval Benchmarking Initiative for Multimedia Evaluation</source>
          <year>2021</year>
          , CEUR Workshop Proceedings,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3181</volume>
          /paper2.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Salah</surname>
          </string-name>
          , Q.-T. Truong,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Lauw</surname>
          </string-name>
          ,
          <article-title>Cornac: A Comparative Framework for Multimodal Recommender Systems</article-title>
          .,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>95</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Oramas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sordo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Serra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Deep</given-names>
            <surname>Multimodal</surname>
          </string-name>
          <article-title>Approach for Cold-start Music Recommendation</article-title>
          ,
          <source>in: Procs. of the WS on Deep Learning for Recommender Systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>32</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deldjoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          , G. Pasi,
          <source>Recommender Systems Leveraging Multimedia Content, ACM Computing Surveys (CSUR) 53</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zafarani</surname>
          </string-name>
          ,
          <article-title>A Survey of Fake News</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>53</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          . URL: http://dx.doi.org/10.1145/3395046. doi:
          <volume>10</volume>
          .1145/3395046.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>SAME</surname>
          </string-name>
          :
          <article-title>Sentiment-Aware Multi-Modal Embedding for Detecting Fake News</article-title>
          ,
          <source>in: Pros of the 2019 Intl. Con. on Advances in Social Networks Analysis and Mining, ASONAM '19</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          . doi:
          <volume>10</volume>
          .1145/3341161.3342894.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          , et al.,
          <source>The TREC-8 Question Answering Track Report., in: TREC</source>
          , volume
          <volume>99</volume>
          ,
          <year>1999</year>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>82</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>