<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodality In Recommender Systems: Does It Help, and Should We Expect An Answer? ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aixin Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Nanyang Technological University</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Multimodal recommender systems integrate diverse information sources, such as text and visual data, into predictive models for personalization. While multimodality promises richer representations and potentially improved relevance, it remains an open question whether multimodal inputs genuinely enhance recommender systems. In this talk, I present findings from an evaluation of reproducible multimodal RecSys models, designed to address this question. The results show that multimodality does not consistently lead to better performance. I argue that while multimodal inputs can yield incremental gains, their efectiveness must be considered in relation to user interaction dynamics, task objectives, and system context - their value is inherently contextual.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodality Recommender System</kwd>
        <kwd>Evaluation</kwd>
        <kwd>User-Decision Process</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Does Multimodality Really Help?</title>
      <p>Recommender systems (RecSys) are foundational technologies for digital platforms ranging from
ecommerce to media streaming. Traditional systems primarily rely on user-item interactions, leveraging
collaborative filtering or content features extracted from text. However, the proliferation of rich
multimedia content has spurred a growing body of research on multimodal recommendation, where
heterogeneous signals — such as text descriptions, product images, and short videos — are fused into
the recommendation pipeline.</p>
      <p>
        As a trending topic in RecSys, multimodal recommendation has attracted significant attention. Yet,
a central question remains: Does multimodality truly improve recommender systems? Interestingly,
three groups have conducted independent studies around the same time, each aiming to answer this
question from diferent perspectives and experimental setups [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. In this extended abstract, I briefly
review key findings from [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The authors collected 41 papers published between 2019 and 2024 in
top-tier venues such as SIGIR, WWW, TOIS, and TKDE. A paper qualifies as a study on multimodal
RecSys if it introduces a novel technique and addresses issues specific to multimodal recommendation.
      </p>
      <p>
        While the community increasingly emphasizes reproducibility, not all papers release their source
code or datasets for various reasons. Among the 41 papers, 12 were considered reproducible — meaning
both code reproducible (source code publicly available and functioning correctly) and dataset available
(publicly accessible datasets or raw data with preprocessing scripts). For benchmarking, the team
used three datasets: two e-commerce datasets (Amazon and Taobao) and one short-video dataset (DY).
Largely following the experimental settings of the reproducible papers, each dataset was randomly
partitioned into training, validation, and test sets with an 8:1:1 ratio. Although this random split can
introduce data leakage [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], it remains a common practice in academic research, as it avoids modifying
models to accommodate temporally aware splits. The evaluation metrics used were Recall and NDCG.
      </p>
      <p>The team conducted multiple evaluations focusing on the role of multimodality. To assess its benefits,
two classic baselines — ItemKNN and UserKNN — were employed, both relying solely on interaction
data. Experimental results show that these traditional KNN-based methods perform comparably to, or
even better than, several sophisticated multimodal recommendation models. Moreover, when comparing
results across the three datasets, no consistent advantage of multimodality emerges. In particular, very
few methods outperform the KNN baselines on the DY dataset.</p>
      <p>The team then conducted experiments comparing the recommendation accuracy of multimodal and
single-modality models. Interestingly, multimodal systems do not always achieve the best performance
compared to their single-modality counterparts. Furthermore, diferent types of modal information
contribute diferently depending on the recommendation scenario. Specifically, in e-commerce settings,
textual features often play a more important role, whereas visual information tends to be marginally
more useful for short-video recommendations.</p>
      <p>
        While many more interesting findings are presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the results highlighted here do not
always provide a positive answer to the question posed in this section’s title. In fact, negative or
inconclusive results are not uncommon in RecSys research. For example, through a large-scale evaluation
comparing 18 algorithms across 85 datasets, McElfresh et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] observed that classic ItemKNN remains
a competitive method outperforming many others. More recently, two papers have discussed broader
issues in RecSys research, and their titles are themselves quite indicative [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. These negative findings
also relate to the second question in the title of this paper.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Should We Expect an Answer?</title>
      <p>Before addressing the question of multimodality specifically, it is helpful to view recommender systems
from a broader perspective. ACM RecSys is a conference under SIGCHI, emphasizing human-computer
interaction. Yet, many academic papers in the field focus primarily on algorithmic or methodological
advances, often overlooking the HCI dimension.</p>
      <p>
        Recently, we conducted a survey of real-world recommender systems, including only those studies
that reported online A/B testing results on production systems [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The survey revealed that real-world
recommendation scenarios are highly diverse and often difer substantially in their interaction settings,
which in turn shape distinct recommendation logics.
      </p>
      <p>In particular, we categorize commonly observed RecSys tasks into two broad types:
TransactionOriented RecSys and Content-Oriented RecSys. The goal of the former is to drive transactional actions —
optimizing for conversion rates, revenue, or purchase likelihood. E-commerce platforms are typical
examples. In contrast, the latter focuses on promoting user consumption and engagement, optimizing
for metrics such as dwell time, clicks, or user satisfaction to encourage continued interaction, such as
watching videos, listening to music, or reading news articles.</p>
      <p>Multimodal RecSys must be situated within the larger ecosystem of recommendation. Research often
narrows to algorithmic accuracy, but real-world systems involve broader considerations: interfaces,
feedback loops, optimization objectives, and operational constraints like cost and latency. Evaluation
diferences between ofline metrics and online A/B testing further complicate insights.</p>
      <p>
        The efectiveness of diferent modalities depends heavily on context, visibility, and the extent to
which these modalities influence user decision-making [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. For instance, in e-commerce, product
images and textual descriptions can be critical factors afecting whether users click on a product among
many recommended items and ultimately make a purchase. In contrast, within short-video platforms,
users primarily engage with visual content, while textual cues are often ignored. In audio streaming,
songs are delivered continuously unless the user actively intervenes. In this case, visual or textual
attributes may influence only the selection of the first song that initiates playback, whereas subsequent
recommendations depend less on multimodal signals. A more detailed discussion on the diferent stages
of user-item interaction can be found in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Mixed findings suggest no universal superiority of multimodality. Domain specificity, task
dependence, evaluation design, and user visibility all shape outcomes. While multimodality enriches
representation, its value is inherently contextual.</p>
    </sec>
    <sec id="sec-3">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used ChatGPT-5 for spelling check and sentence
polishing. After using this tool, the author reviewed and edited the content as needed and takes full
responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Pomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Attimonelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Danese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Narducci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Noia</surname>
          </string-name>
          ,
          <article-title>Do recommender systems really leverage multimodal content? a comprehensive analysis on multimodal representations for recommendation</article-title>
          ,
          <source>in: Proceedings of the 34th ACM International Conference on Information and Knowledge Management</source>
          , CIKM '25,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2025</year>
          , p.
          <fpage>2377</fpage>
          -
          <lpage>2387</lpage>
          . URL: https://doi.org/10.1145/3746252.3761398. doi:
          <volume>10</volume>
          .1145/3746252.3761398.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Jose</surname>
          </string-name>
          ,
          <article-title>Are multimodal embeddings truly beneficial for recommendation? a deep dive into whole vs</article-title>
          . individual modalities,
          <source>ArXiv abs/2508</source>
          .07399 (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2508.07399, to appear
          <source>in ECIR</source>
          <year>2026</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Does multimodality improve recommender systems as expected? a critical analysis and future directions</article-title>
          ,
          <source>ArXiv abs/2508</source>
          .05377 (
          <year>2025</year>
          ). URL: https: //arxiv.org/abs/2508.05377.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A critical study on data leakage in recommender system ofline evaluation</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>41</volume>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1145/3569930.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Take a fresh look at recommender systems from an evaluation standpoint</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '23</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>2629</fpage>
          -
          <lpage>2638</lpage>
          . doi:
          <volume>10</volume>
          .1145/3539618. 3591931.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>McElfresh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khandagale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Valverde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Dickerson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <article-title>On the generalizability and predictability of recommender systems</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Neural Information Processing Systems</source>
          , NeurIPS '22, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Said</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Pera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Ekstrand</surname>
          </string-name>
          ,
          <article-title>We're still doing it (all) wrong: Recommender systems, iffteen years later</article-title>
          ,
          <source>in: Beyond Algorithms: Reclaiming the Interdisciplinary Roots of Recommender Systems Workshop (BEYOND</source>
          <year>2025</year>
          )
          <article-title>, co-located with the ACM RecSys 2025</article-title>
          , Prague, Czech Republic,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Higley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Burke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Ekstrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Knijnenburg</surname>
          </string-name>
          ,
          <article-title>What news recommendation research did (but mostly didn't) teach us about building a news recommender</article-title>
          ,
          <source>in: Beyond Algorithms: Reclaiming the Interdisciplinary Roots of Recommender Systems Workshop (BEYOND</source>
          <year>2025</year>
          )
          <article-title>, co-located with the ACM RecSys 2025</article-title>
          , Prague, Czech Republic,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>A survey of real-world recommender systems: Challenges, constraints, and industrial perspectives</article-title>
          ,
          <source>ArXiv abs/2509</source>
          .06002 (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2509.06002.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>A task-centric perspective on recommendation tasks</article-title>
          ,
          <source>ArXiv abs/2503</source>
          .21188 (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2503.21188, to appear
          <source>in Communications of the ACM (CACM).</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>