<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multimodal Fusion in NewsImages 2023: Evaluating Translators, Keyphrase Extraction, and CLIP Pre-Training</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Quang-Vinh Dinh</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tien-Huy Nguyen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoang-Long Nguyen-Huu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thien-Doanh Le</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huu-Loc Tran</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Quoc-Khanh Le-Tran</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoang-Bach Ngo</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Hung An</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FPT Telecom</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>International University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Information Technology</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Vietnamese German University</institution>
          ,
          <addr-line>Binh Duong</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Matching the most appropriate image to its corresponding article poses a significant challenge in this landscape. This paper explores the intricate challenge of matching headline images to news articles, utilizing the zero-shot capability of CLIP to address the complex relationship between texts and both real and AI-generated images in the MediaEval 2023 News-Images Challenge. Additionally, analyzes the ramifications of diverse translation methodologies on the eficacy of CLIP performance. The innovative approach involving key phrase extraction for CLIP input demonstrates competitive results across various benchmarks in information extraction and matching.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        In today’s Internet age, online news articles play a crucial role as fundamental sources of
information on current events, employing compelling titles and content segments to engage
and inform readers efectively. Journalists strategically integrate images to enhance content
intuitiveness, enabling a comprehensive understanding of the presented information and
captivating the reader’s attention. The MediaEval Multimedia Evaluation benchmark, with a focus
on the NewsImages task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], explores the intricate relationship between textual narratives
and visual elements in news articles, contributing significantly to understanding collaborative
dynamics in news discourse. Recent advancements, exemplified by Contrastive Language-Image
Pretraining (CLIP) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], provide a robust foundation for research combining text and real images,
comprehensively exploring their relationship in news articles. Taking advantage of CLIP’s
zero-shot capabilities to evaluate experiments on real images and AI-generated images.
∗All authors contributed equally to this paper.
CEUR
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Related works</title>
      <p>
        Understanding the interaction between text and images in news is crucial for grasping news
content creation. Recent studies challenge the notion of a simple text-image connection,
highlighting the limitations of traditional image captioning models. New dynamic
attentionbased models, like those by Messina et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Qizhang et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], ofer adaptability but
increase computational complexity. Nelleke Oostdijk’s analysis in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] highlighted the limitations
of a simplistic correlation between modalities, demonstrating that images possess the capability
to depict entities within text or unrelated visual elements. Research like Lidia Pivovarova’s
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in the MediaEval 2021 NewsImages task, which integrated knowledge distillation and a
visual topic model, shows that images can represent entities from text or unrelated visuals,
and alignment between text and visual topics is possible. HCMUS [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] achieved competitive
results through advanced text preprocessing and the utilization of the CLIP pre-trained model;
however, this approach also relied on the translator. Our method further investigates the efect
of diferent translators on performance, using CLIP and key phrase extraction to predict relevant
text for images, thus deepening the understanding of the complex text-image relationship.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Proposed Methods</title>
      <p>The fundamental concept of this architecture is to integrate both text and image as inputs
by embedding them into a shared space. For the image input, each image undergoes vector
embedding by the CLIP image encoder. These embeddings are then indexed using the Faiss
library (as you can see in Figure 1). With the text input, we process the headline and snippet
of the news (including translating text and optionally extracting keyphrases) before encoding
into an embedding using the CLIP text encoder. In the end, we identify the 100 most relevant
images by leveraging the K-NN (with cosine similarity) algorithm.</p>
      <sec id="sec-4-1">
        <title>3.1. Translator</title>
        <p>
          The dataset consists of three components, each derived from news content sourced from news
portals, including GDELT1 and GDLT2, and a news feed RT dataset. The articles from RT News
are written in German and paired with their corresponding English texts in the dataset. In
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], using Google Translate as a translator tool, in addition, we use another translator tool,
mBART (multilingual Bidirectional and Auto-Regressive Transformers) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], to experiment and
evaluate the impact of diferent translation methods on overall performance. Through this
experimentation, we could have a better insight into each translator tool’s advantages and
disadvantages, allowing us to explore the relationship between the features of images and news.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Keyphrase</title>
        <p>
          The section aims to analyze and extract relevant keyphrases from given inputs. The CLIP without
keyphrase approach shows suboptimal accuracy, attributed to lengthy and noisy headlines and
snippet sentences. The observation that images closely match the content in the headline and
snippet, so to enrich key information for the image query, suggests the need for the keyphrase
approach to extract more crucial entities. To address this problem, our approach uses KBIR
(Keyphrase Boundary Infilling with Replacement) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] pre-trained model, designed for NLP tasks,
for efective keyphrase extraction and generation from text, crucial for the CLIP model.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Using CLIP as a Zero-shot retriever</title>
        <p>
          Automatic image captioning and text-image matching have advanced significantly, typically
requiring labelled data and specialized training. CLIP, a pre-trained neural network model,
takes a unique approach by learning joint image and text representations without task-specific
optimization. Its ability to transfer knowledge to other tasks without prior training, along with
a large and varied pre-training dataset, especially with news articles collected on the internet,
makes it a suitable and attractive option for tasks such as NewsImage. In addition, this allows
the model to achieve state-of-the-art performance on tasks it hasn’t been explicitly trained
on before, ofering a promising baseline for further research [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10, 11, 12, 13</xref>
          ]. This research
investigates the zero-shot performance of CLIP, and the advanced ViT-L/14@336p model, the
most potent CLIP variant, is employed for optimal results.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experimental Results</title>
      <p>The competition task requires participants to predict a sequentially organized list of images
that closely aligns with the accompanying textual article. Evaluation employs the Mean
Reciprocal Rank (MRR) metric and MeanRecall@K scores (K in 5, 10, 50, 100). Our research
undergoes assessment on three datasets provided by the competition organizers, leading to
distinct experimental methodologies and variations in textual input for CLIP due to inherent
dissimilarities in each dataset. Consequently, our innovative approaches difer for each dataset
under consideration.</p>
      <p>The experimental findings on various lexical translation methodologies show relatively
consistent results compared to utilizing the organizers’ translated text. Comparisons between
translation models indicate that using Googletrans is better mBART for the RT dataset, this
led to the decision to leverage Googletrans for ongoing enhancements. However, based on
experimental reports, among many translation methodologies, we conclude that translator
modules do not greatly afect CLIP’s performance, so future methods should consider the option
of removing the translator module to reduce pipeline complexity.</p>
      <p>In the experiment, incorporating keyphrases into the textual content that previously
consisted of the headline and snippet as input for CLIP yields overall performance improvements,
particularly in the MeanRecall@5 and MeanRecall@10 metrics (increases of 0.00133 and 0.00734,
respectively), as shown in Table 2. The keyphrase’s ability to encapsulate main ideas helps the
model focus more on crucial information and clarify images query, resulting in commendable
outcomes, particularly in retrieving images within the 5th and 10th ranks.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion and future work</title>
      <p>This study tackles the demanding text-image matching task in the MediaEval 2023 NewsImages
challenge, achieving notable success using the pre-trained CLIP model’s zero-shot capability.
Our experiments underscore the eficacy of the model architecture and the benefits of
employing a pre-trained model. We studied to experiment with the CLIP’s ability on both real and
synthetic images, yielding promising outcomes for real images and proficient performance on
AI-generated images. In addition, we proved that adding a translator did not improve
performance, so we consider not using it in the pipeline in the future. Conversely, using key phrases
showed positive signs of slightly increasing the accuracy of top-5 and top-10 image queries.</p>
      <p>
        Future eforts will concentrate on implementing a more extensive approach, exploring
additional techniques, such as re-ranking strategies [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or face recognition systems [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] enhance
to enrich crucial information for the image query and to further improve overall performance.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          , Ö. Özgöbek,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elahi</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-T.</surname>
          </string-name>
          Dang-Nguyen,
          <article-title>News images in mediaeval 2023 (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Messina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Falchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          , G. Amato,
          <article-title>Transformer reasoning network for image-text matching and retrieval</article-title>
          , CoRR abs/
          <year>2004</year>
          .09144 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2004</year>
          .09144. arXiv:
          <year>2004</year>
          .09144.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Context-aware attention network for image-text retrieval</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3536</fpage>
          -
          <lpage>3545</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Oostdijk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. v.</given-names>
            <surname>Halteren</surname>
          </string-name>
          , E. Basar,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <article-title>The connection between the text and images of news articles: New insights for multimedia analysis (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pivovarova</surname>
          </string-name>
          , E. Zosa,
          <article-title>Visual topic modelling for newsimage task at mediaeval 2021</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2021 Workshop</source>
          , MediaEval,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ngô</surname>
          </string-name>
          , T.-D. Le,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huynh</surname>
          </string-name>
          , N.-T. Nguyen,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tran</surname>
          </string-name>
          , Hcmus at mediaeval 2021:
          <article-title>Fine-tuning clip for automatic news-images re-matching 3181 (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <article-title>Multilingual denoising pre-training for neural machine translation</article-title>
          , CoRR abs/
          <year>2001</year>
          .08210 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2001</year>
          .08210. arXiv:
          <year>2001</year>
          .08210.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mahata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bhowmik</surname>
          </string-name>
          ,
          <article-title>Learning rich representation of keyphrases from text</article-title>
          ,
          <source>CoRR abs/2112</source>
          .08547 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2112.08547. arXiv:
          <volume>2112</volume>
          .
          <fpage>08547</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.-N.</given-names>
            <surname>Vu</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-D. Nguyen</surname>
          </string-name>
          , M.-T. Tran,
          <article-title>Re-matching images and news using clip pretrained model (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Wan,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Clip pre-trained models for cross-modal retrieval in newsimages 2022 (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <article-title>Cross-modal networks and dual softmax operation for mediaeval newsimages 2022 (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>M.-D.</surname>
            Le-Quynh,
            <given-names>A.-T.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-T.</surname>
          </string-name>
          Quang-Hoang, V.
          <string-name>
            <surname>-H. Dinh</surname>
            ,
            <given-names>T.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-B. Ngo</surname>
            ,
            <given-names>M.-H.</given-names>
          </string-name>
          <string-name>
            <surname>An</surname>
          </string-name>
          ,
          <article-title>Enhancing video retrieval with robust clip-based multimodal system</article-title>
          ,
          <source>in: Proceedings of the 12th International Symposium on Information and Communication Technology, SOICT '23</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>972</fpage>
          -
          <lpage>979</lpage>
          . URL: https://doi.org/10.1145/3628797.3629011. doi:
          <volume>10</volume>
          .1145/3628797.3629011.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent difusion models</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2112</volume>
          .
          <fpage>10752</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Neural re-ranking in multi-stage recommender systems: A review</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2202</volume>
          .
          <fpage>06602</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bafna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bagaria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Virnodkar</surname>
          </string-name>
          ,
          <article-title>A survey on face recognition systems</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2201</volume>
          .
          <fpage>02991</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>