<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparative Analysis of Fashion Captioning for Multimodal Fashion Recommendation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gwendolyn Rippberger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Neidhardt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CD Lab for Recommender Systems, TU Wien</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Multimodal information provides new opportunities for recommender systems, especially in the fashion domain, where both visual and textual information can be utilized to provide a comprehensive understanding of the product. In this work, we focused on the task of fashion captioning, a specialized form of image captioning for fashion items. We fine-tuned pretrained vision-language models on two distinct fashion datasets to evaluate how efectively they capture dataset-specific ground truths. We were able to fine-tune the models successfully to a competitive result with specifically trained models. The resulting captioning models are applied in two key scenarios: (1) as components for generating richer multimodal embeddings in recommender systems, and (2) for modality imputation, where automatically generated descriptions are used to fill in missing textual data. We show that diferent modalities work better depending on the size of the dataset and the list length but none outperform the traditional item-based collaborative filtering technique using a real-life dataset with over 1M users and 31M transactions. Additionally, we present a detailed analysis of the two fashion datasets, highlighting critical aspects such as item presentation and textual style, which are often overlooked yet essential for efective modeling.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal Recommendation</kwd>
        <kwd>Fashion Captioning</kwd>
        <kwd>NLP</kwd>
        <kwd>Generative AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>
        The fashion domain poses challenges due to sparse purchase data for traditional recommender systems
(RS) such as Collaborative Filtering (CF) [
        <xref ref-type="bibr" rid="ref1">1, 2</xref>
        ]. There are various approaches trying to leverage diferent
modalities to overcome this problem e.g. using product images [
        <xref ref-type="bibr" rid="ref1">1, 3–8</xref>
        ], textual descriptions or customer
reviews [9–11] and even video [12].
      </p>
      <p>
        Fashion is inherently multimodal [13], combining visual information like product images with textual
descriptions. Previous work with multimodal recommender systems [14–16] shows that leveraging
high-level features from multimodal items outperforms CF recommendation [
        <xref ref-type="bibr" rid="ref1">1, 17–21</xref>
        ]. However,
real-world data struggles with missing data modalities [22, 23]. Common solutions are dropping or
imputation [24] with the risk of discarding valuable items or introducing noise. Previous work [23]
shows that imputation using traditional methods, e.g., random, zeros, and global mean, can preserve
the performance gap between multimodal and pure collaborative recommender systems. Still, those
methods are rigid and might not capture the diversity of items. Another approach [22] uses feature
propagation in the context of graph networks, though this is limited to graph networks.
      </p>
      <p>We propose an alternative for the fashion domain: fine-tuning pretrained vision-language models
for fashion captioning. Fashion captioning describes the domain-specific task of image captioning
(generating text based on images) for fashion items that focuses on long captions with fine-grained
attributes with an enchanting expression style [25] or tailor details [26]. Using the fine-tuned models to
generate item descriptions we can augment missing text descriptions and the fine-tuned models can
be used as feature extractors in order to experiment with multimodal embeddings. This method can
be used as a preprocessing step for diferent recommendation algorithms and resulting text
descriptions can be inspected as they are human-readable (compared to feature vectors). With the diferent
modalities fashion has to ofer, the question arises: which feature representations are most efective for
recommending items? To address this, we analyze unimodal and multimodal feature spaces derived
from pretrained models, as well as features extracted from our fine-tuned fashion captioning models.</p>
      <p>
        We benchmark these feature sets using multiple recommendation algorithms, including content-based
methods (e.g., k-NN), hybrid models (e.g., VBPR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), and unpersonalized algorithms (e.g. most popular),
to understand how the choice of feature space impacts recommendation quality. By systematically
comparing these approaches, we aim to identify the representation that best captures item semantics
and user preferences, ultimately bridging the gap between sparse user interaction data and rich item
content.
      </p>
      <p>To summarize, the research questions and related contributions of this work are:
RQ1: To what extent can fine-tuning improve the performance of of-the-shelf image
captioning models on domain-specific fashion datasets? Firstly, we want to analyze if
finetuning achieves good enough results in order to use the models for augmentation. We experiment with
four diferent models (Section 2.1) using a dataset specifically curated for creating item descriptions of
fashion items (FACAD) and a real-life dataset based on items from the fashion store H&amp;M (more details
in Section 3). This is done in order to better understand capabilities and limitations of fine-tuning.
The evaluation is done quantitatively and qualitatively to get a full picture (Section 2.2). Based on
seven diferent metrics, we show that fine-tuning achieves good results, especially when it comes to
identifying item specific attributes. However, our qualitative analysis shows that the models struggle
with abstract concepts and hidden details.</p>
      <p>
        RQ2: Which feature embeddings (textual, visual, or multimodal) provide the best
recommendations? We compare two setups: (1) filtered items (dropped missing modalities and sparsely
represented items), (2) unfiltered items and missing text descriptions augmented. This is done with a
real-life data set [26] and we use VBPR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a model that extends matrix factorization with feature
vectors (originally visual), to compare the diferent feature spaces. Our results show that, when
comparing the diferent feature spaces, textual features perform best except in terms of precision for
the augmented dataset. Still, ItemKNN outperforms all setups.
      </p>
      <p>Following our research questions, we contribute the following: (1) We conduct an in-depth analysis
of the efectiveness of fine-tuning pretrained models. (2) We explore the use of previously unutilized
query embeddings derived from image captioning models for fashion recommendation. (3) We present
results on a real-world, underexplored fashion recommendation dataset to evaluate which feature
spaces yield the best recommendation performance. Additionally, we make all models publicly available
via our Hugging Face Space at https://huggingface.co/CDL-RecSys, and we release the code at https:
//github.com/omgwenxx/multimodal-fashion-analysis/.</p>
      <p>Finally, this work is structured as follows: we first present the experiment setup, including the models
(Section 2.1) used for fine-tuning, feature extraction (Section 2.3.1), and recommendation algorithms
(Section 2.3.2). We then compare the datasets in detail (Section 3), present the results (Section 4), and
conclude with discussion and future directions (Section 5).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experiment Setup</title>
      <sec id="sec-2-1">
        <title>2.1. Models for Image Captioning</title>
        <p>For the image captioning task, we focused on open-source models to ensure transparency and
reproducibility. We selected models available via Hugging Face [27]. We evaluated BLIP-2 [28] variants
based on OPT [29] and LLaVA-1.5 [30], a multimodal conversational model, with diferent number of
parameters. While BLIP-2 generates text from images alone, LLaVA relies on prompt-based inputs and
is optimized for instruction-following. As a baseline on the FACAD dataset, we used results from Yang
et al. [25], replicating their setup.</p>
        <p>e
tfeahgo trcupdo
m
I</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Evaluation</title>
        <p>We evaluated our results using well-established image captioning measures, namely BLEU [31],
ROUGEL [32], CIDEr [33], METEOR [34] and Spice [35]. As image captioning focuses on the task of generating a
“correct” caption, the majority of those metrics measure word overlap either based on precision (BLEU),
recall (ROUGE-L), or a combination of both (METEOR, focusing on word order). CIDEr incorporates
term frequency-inverse document frequency (TF-IDF) weighting to emphasize informative words, while
SPICE evaluates scene-graph-based semantic content.</p>
        <p>Additionally, we re-implemented the measures introduced by Yang et al. [25] for category accuracy
and “mean average precision”. What the authors reported as mean average precision is the average
precision over all captions. We keep the naming for comparison. Also, the authors originally pretrained
a 3-layer text CNN [36] for category classification which was not provided. As a result, we used a
pretrained BERT model [37] for text classification and fine-tuned it on each dataset, achieving a test
accuracy of 90.3% on the FACAD dataset (78 classes) and 94.7% on the H&amp;M dataset (89 classes).</p>
        <p>For the qualitative comparison of the captions, we implemented a web application (see Figure 1),
showing each product and its ground truth caption and attributes. We then proceeded to manually
check the first 20 samples in the H&amp;M test set and the first 3 distinct items in the FACAD test set
(because the dataset includes the same caption for multiple images containing the same item).</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Algorithms for Recommendations and Feature Extraction</title>
        <sec id="sec-2-3-1">
          <title>2.3.1. Feature Extraction</title>
          <p>We used the Ducho-meets-Elliot framework presented by Attimonelli et al. [14]. The framework
integrates Ducho [14] for feature extraction. For our experiments, we explored six diferent setups:
(1) visual features extracted from ResNet50 [39], (2) textual features from SentenceBERT [40], and (3)
multimodal features from CLIP [41].</p>
          <p>Additionally, we included (4) multimodal features obtained by concatenating ResNet50 and
SentenceBERT embeddings, (5) features extracted from the Q-Former component of the fine-tuned BLIP-2 model
(32x768 values), and (6) visual features extracted from the same fine-tuned BLIP-2 model (257x1408
values). We used average global pooling to reduce the size to 1x1408.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>2.3.2. Recommendation Algorithms</title>
          <p>
            We used algorithms supported by the elliot pipeline [42]. We chose to use Visual Bayesian Personalized
Ranking [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] as it supports using diferent features for recommendation and implicit feedback. As a
baseline, we used unpersonalized recommendation algorithms that suggest items without considering
individual user preferences, namely Most Popular (MostPop) and Random. For additional comparison,
we ran neighborhood-based approaches based on item similarity (Amazon’s item-to-item collaborative
ifltering, ItemKNN) [43] and user similarity (algorithm used by GroupLens, UserKNN) [44].
          </p>
          <p>Data Split. We applied an 80/20 temporal split to divide the transactions into training and test
sets. For the first setup, we used the filtered articles (see Section 3.2) to inspect an isolated setup for
the comparison of the features. For the second setup, we compared a “real-world scenario” where
we augment the missing item descriptions using the generated captions from BLIP-2 (404 items, all
fashion-related) and keep all items with images (105,100). We give a summary of the numbers in Table 1.
9699 customers do not appear in the transactions.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>3.1. FACAD
We evaluated the fine-tuned models on FACAD (FAshion CAptioning Dataset) and the H&amp;M dataset [ 26].
FACAD enables comparison with Yang et al. [25] and is, to our knowledge, the only available dataset
focused on fashion captioning. H&amp;M includes user-item transactions, allowing us to test generated
captions and fine-tuned embeddings for recommendations.</p>
      <p>The dataset contains 993K images and 130K captions that were initially split into 794K (∼ 80%)
imagedescription pairs for training, 99K (∼ 10%) for validation, and the remaining 100K (∼ 10%) for testing.
However, the dataset that is currently provided by the authors has a diferent distribution, with 888,293
pairs designated for training, 19,915 for validation, and 101,225 for testing (a total of 1,009,463 samples).
As per the authors, this was done so that validation would not take as long1.</p>
      <p>The authors extracted 990 attributes from item metadata using the Stanford Parser [45]. It should be
noted, that working with the dataset, we noticed that the test attribute file was incorrect (did not match
the test items), we therefore manually created the correct attributes by checking caption overlap with
the ground truth data and then using the attributes provided in the metadata file.
3.2. H&amp;M
Preprocessing. The primary focus was on preparing the article dataset and associated images for
creating an image captioning dataset. The first step was to clean the dataset by removing any articles
that did not have a detailed description or a corresponding image file. We then decided to filter based on
1https://github.com/xuewyang/Fashion_Captioning/issues/5, last accessed 13.03.2025, 15:40
the number of articles per product type and kept only those categories with at least 7 items (because 25%
of the product categories have less than 7 items, keeping 75% of the initial items). Non-fashion-related
categories, e.g. Dog Wear and Sleeping Sack, were removed manually to retain only fashion and accessory
items. The resulting dataset contains 89 categories after filtering. The filtered dataset was split into
training, validation, and test set, maintaining a distribution of 80% training, 10% validation, and 10%
test data. The final count of articles after preprocessing was 104,232, distributed as follows into 83,385
articles (train set), 10,423 (validation set), and 10,424 articles (test set).</p>
      <p>Attributes. To compute the average precision, attributes within the product descriptions need to
be extracted. The detail_desc column was used to extract nouns, adjectives, and proper nouns
based on Universal POS tagging definitions [ 46, 47]. The extraction process used the Stanza NLP
library [48] to identify and filter these attributes. Before tagging, the descriptions are lowercase, and
hyphen-connected words, e.g., t-shirt, are split. Then, we extract the lemmatized attributes and filter,
keeping only attributes appearing at least 10 times. This is done to ensure the significance and relevance
of attributes kept. Using the train set items, we then collect all extracted attributes (1014 in total). These
are then used as a pool to select attributes from generated captions and to generate the ground truth
for the test set.</p>
      <sec id="sec-3-1">
        <title>3.3. Comparison of Datasets</title>
        <p>We use this section to emphasize the diferences between both datasets to provide a better understanding
of the results presented in Section 4.1. Both datasets are domain-specific fashion datasets but difer in
many aspects (see Table 2). One of them is the size, with FACAD being almost 10 times larger than the
H&amp;M dataset. Providing more variety of item perspectives, including diferent angle shots, with and
without a model and a material shot (see Figure 2a). The H&amp;M dataset only ofers single-item images
without a model, sometimes only showing part of the item. Furthermore, the H&amp;M dataset shows a
1-to-1 relationship between captions and images, whereas the FACAD dataset includes multiple items
with the same description but difering in color or single items with many images.</p>
        <p>Comparing the captions, the H&amp;M dataset has more concise captions describing the item’s “tailor”
details, e.g., “frill-trimmed shoulder straps”, whereas FACAD captions are more “enchanting” (examples
can be seen in Figure 2). It includes expressions made for selling, e.g., “this neutral hued cotton sweater
you’ll wear everywhere”. This can be also noticed in the size of the vocabulary (see Table 2).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Fashion Captioning</title>
        <p>Quantitative Analysis. We present the results of the pretrained models and the fine-tuned models
side-by-side for the H&amp;M dataset in Table 3 and for the FACAD dataset in Table 4.</p>
        <p>H&amp;M. All models’ performance improved with fine-tuning. Among them, BLIP-2-6.7B achieves the
best performance. We observed, and investigated during our qualitative analysis, the trade-of between
the precision and recall of the attributes for the diferent models. Due to the longer captions produced
by the LLaVA models, they achieve higher recall scores but then lack precision. The opposite for the
this shawl collar jersey blazer bloom with black and white
floral patterning inspired by the work of rising spanish
photographer coco capit n
Short, sleeveless dress in an airy c
open at the back with a tie. Adjusta
shoulder straps, a concealed zip in
waist with elastication at the back
Unlined.
(a) this shawl collar jersey blazer bloom
with black and white floral patterning
inspired by the work of rising spanish
photographer coco capit n
(b) Short, sleeveless dress in an airy cotton
weave that is open at the back with a
tie. Adjustable frill-trimmed shoulder
straps, a concealed zip in the side, seam
at the waist with elastication at the back
and a flared skirt. Unlined.
BLIP-2 models. They achieve better overall performance by having a balance between precision and
recall due to their ability to adapt to the ground truth length.</p>
        <p>FACAD. Comparing the accuracy reported by Yang et al. [25] for the category classification, we
noticed that our models (fine-tuned BERT classifier) worked better than the reported CNN classifier.
The accuracy metric generally describes how well the captioning models recognize the correct category
from the image because the accuracy models will predict the category based on the caption input (with
an accuracy of &gt;90%, based on the performance on the test set). The zero-shot setup (using BLIP-2
results) achieves nearly 40% better accuracy than the results reported by Yang et al. We hypothesize that
this improvement may be due to the BERT model outperforming the originally used CNN-based model
or because the model by Yang et al. (Semantic Rewards guided Fashion Captioning, SRFC) struggles
to accurately classify clothing items. The results show that the fine-tuned models do not achieve the
same performance as SRFC in terms of image captioning metrics. However, results are competitive and
models outperform when it comes to attributes precision and recall as well as recognizing the product
category. We see the same behavior for BLIP-2 and LLaVA in terms of precision and recall.</p>
        <p>Qualitative Analysis. H&amp;M. Based on the setup explained in Section 2.2, we find that the fine-tuned
LLaVA models retain the core content of the original H&amp;M captions but tend to hallucinate details,
such as materials or funding sources, not present in the training data, likely influenced by frequent
patterns e.g. polyester. Due to their fixed setting, they often overgenerate beyond the caption’s natural
end, unlike BLIP-2 models, which better learn caption lengths post fine-tuning. This overgeneration
leads to higher attribute recall but lower precision, with the reverse observed for BLIP-2, especially in
shorter captions. All models struggle with nuances in material (satin vs. velvet), product size (e.g. 33
cm), and disambiguation of visually similar items (crop top vs. sports bra).</p>
        <p>FACAD. For the FACAD dataset sample in Figure 2a, we observed that certain caption parts, such
as references to designers or inspirations, lack visual grounding, adding noise for models learning
image-text alignment. The fine-tuned LLaVA models often adopt a “sales-like” tone which works well
for marketing language but tends to overgenerate for straightforward item descriptions (FACAD vs.
H&amp;M). LLaVA models are prone to overgeneration, even reproducing prompt templates (e.g., starting
responses with ASSISTANT:). Category predictions are sometimes of,e.g., mislabeling a blazer as a
jacket or a tee/top, and misclassifications increase for partial views or when items are not fully visible
(e.g., back details or side views). Some original attributes (e.g., “chrissy”, “teigen”) do not correspond to
visual features, complicating evaluation. Overall, both datasets reveal that the models struggle with
ifne-grained visual distinctions in material, color (dark blue vs. black), or category (jacket vs. blazer),
likely due to image resizing during preprocessing and the inherent visual limitations in the dataset.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Recommendations</title>
        <p>With the results presented in Table 5 we are able to answer which feature embeddings provide the
best recommendations (RQ2). Based on the H&amp;M dataset the answer is textual embeddings, except for
MAP in the augmented dataset. However, comparing the results, we see that all diferent feature spaces
perform similarly and not as good as the best overall result using the ItemKNN algorithm. This might
also indicate that VBPR might not be the right algorithm for this dataset in general.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusion</title>
      <p>To evaluate the impact of fine-tuning, we analyze how well it improved our pretrained models. We
report the average percentage improvement in performance across six metrics, computed by first
calculating the percentage change for each metric individually and then averaging these values per
model, presented in Table 3 and Table 4. We observe that fine-tuning was more efective for the H&amp;M
dataset, which we attribute to the presence of a one-to-one relationship between images and captions as
well as the smaller size. To further investigate this hypothesis, one could subsample the FACAD dataset
to include only one-to-one samples or a more refined selection of images [ 50]. Compared to SRFC by
Yang et al., our fine-tuned models, despite showing slightly lower image captioning metrics, generate
more accurate captions in terms of attributes and categories. This demonstrates the strength of our
approach, which is a simpler pipeline that still maintains competitive performance. Our qualitative
analysis (Section 4.1) further reveals model limitations, especially when captions reference non-visible
or abstract information and the trade-of between precision and recall for the LLaVA models. For future
work, we recommend using captions that focus on visible item features and avoiding repeated captions
across multiple images and experiments with image quality [51] to improve performance. Interestingly,
our recommendation results (Section 4.2) show that (1) with increasing list length our NDCG increases
but MAP decreases, meaning that the algorithms retrieve relevant items but not rank them optimally, (2)
textual features (with our dataset) returned the best results (except for MAP on the augmented dataset)
and (3) ItemKNN outperforms all multimodal approaches. ItemKNN likely performs well because it
leverages repetitive user behavior and trends which is common in the fashion domain. The gap between
VBPR and ItemKNN leads us to believe that our embeddings are either not representative enough to
learn user preferences or VBPR may not be well-suited for this dataset. Future work could explore
alternative fusion strategies and evaluate more multimodal recommendation algorithms, e.g. FREEDOM
[52] or BM3 [53]. However, we show that despite fashion being a visually dominant domain, textual
descriptions had the best results, highlighting their importance for future research.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgments</title>
      <p>The financial support by the Austrian Federal Ministry of Labour and Economy, the National Foundation
for Research, Technology and Development and the Christian Doppler Research Association is gratefully
acknowledged.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[2] T. M. A. U. Gunathilaka, P. D. Manage, J. Zhang, Y. Li, W. Kelly, Addressing sparse data challenges
in recommendation systems: A systematic review of rating estimation using sparse rating data
and profile enrichment techniques, Intelligent Systems with Applications 25 (2025) 200474. URL:
https://www.sciencedirect.com/science/article/pii/S2667305324001480. doi:https://doi.org/
10.1016/j.iswa.2024.200474.
[3] V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, N. Sundaresan, Large Scale Visual
Recommendations From Street Fashion Images, 2014. URL: http://arxiv.org/abs/1401.1778. doi:10.48550/
arXiv.1401.1778, arXiv:1401.1778 [cs].
[4] Y. Deldjoo, T. Di Noia, D. Malitesta, F. A. Merra, Leveraging content-style item representation
for visual recommendation, in: Advances in information retrieval: 44th european conference
on IR research, ECIR 2022, stavanger, norway, april 10–14, 2022, proceedings, part II,
SpringerVerlag, Berlin, Heidelberg, 2022, pp. 84–92. URL: https://doi.org/10.1007/978-3-030-99739-7_10.
doi:10.1007/978-3-030-99739-7_10, number of pages: 9 Place: Stavanger, Norway.
[5] W. Chen, P. Huang, J. Xu, X. Guo, C. Guo, F. Sun, C. Li, A. Pfadler, H. Zhao, B. Zhao, POG:
Personalized Outfit Generation for Fashion Recommendation at Alibaba iFashion, 2019. URL:
https://arxiv.org/abs/1905.01866v3.
[6] W.-C. Kang, C. Fang, Z. Wang, J. McAuley, Visually-Aware Fashion Recommendation and Design
with Generative Image Models, 2017. URL: https://arxiv.org/abs/1711.02231v1.
[7] R. He, J. McAuley, Ups and Downs: Modeling the Visual Evolution of Fashion Trends with
OneClass Collaborative Filtering, 2016. URL: http://arxiv.org/abs/1602.01585. doi:10.1145/2872427.
2883037, arXiv:1602.01585 [cs].
[8] Q. Liu, S. Wu, L. Wang, DeepStyle: Learning User Preferences for Visual Recommendation, in:
Proceedings of the 40th International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR ’17, Association for Computing Machinery, New York, NY, USA, 2017,
pp. 841–844. URL: https://doi.org/10.1145/3077136.3080658. doi:10.1145/3077136.3080658.
[9] X. Chen, H. Chen, H. Xu, Y. Zhang, Y. Cao, Z. Qin, H. Zha, Personalized Fashion Recommendation
with Visual Explanations based on Multimodal Attention Network: Towards Visually Explainable
Recommendation, in: Proceedings of the 42nd International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR’19, Association for Computing Machinery,
New York, NY, USA, 2019, pp. 765–774. URL: https://dl.acm.org/doi/10.1145/3331184.3331254.
doi:10.1145/3331184.3331254.
[10] K. Zhao, X. Hu, J. Bu, C. Wang, Deep Style Match for Complementary Recommendation, 2017.</p>
      <p>URL: https://arxiv.org/abs/1708.07938v1.
[11] C. Bracher, S. Heinz, R. Vollgraf, Fashion DNA: Merging Content and Sales Data for
Recommendation and Article Mapping, 2016. URL: http://arxiv.org/abs/1609.02489. doi:10.48550/arXiv.
1609.02489, arXiv:1609.02489 [cs].
[12] M. Yang, K. Yu, Real-time clothing recognition in surveillance videos, in: 2011 18th IEEE
international conference on image processing, 2011, pp. 2937–2940. doi:10.1109/ICIP.2011.
6116276.
[13] Y. Deldjoo, M. Schedl, B. Hidasi, Y. Wei, X. He, Multimedia recommender systems: Algorithms
and challenges, 2022, pp. 973–. doi:10.1007/978-1-0716-2197-4_25.
[14] M. Attimonelli, D. Danese, A. D. Fazio, D. Malitesta, C. Pomo, T. D. Noia, Ducho meets Elliot:
Largescale Benchmarks for Multimodal Recommendation, 2024. URL: http://arxiv.org/abs/2409.15857.
doi:10.48550/arXiv.2409.15857, arXiv:2409.15857 [cs].
[15] K. Laenen, M.-F. Moens, Attention-based Fusion for Outfit Recommendation, 2019. URL: http:
//arxiv.org/abs/1908.10585. doi:10.48550/arXiv.1908.10585, arXiv:1908.10585 [cs].
[16] X. Song, C. Wang, C. Sun, S. Feng, M. Zhou, L. Nie, MM-frec: Multi-modal enhanced fashion item
recommendation, IEEE Transactions on Knowledge and Data Engineering 35 (2023) 10072–10084.
doi:10.1109/TKDE.2023.3266423.
[17] W. Yinwei, W. Xiang, N. Liqiang, H. Xiangnan, C. Tat-Seng, GRCN: Graph-refined convolutional
network for multimedia recommendation with implicit feedback, 2021. URL: https://arxiv.org/abs/
2111.02036, arXiv: 2111.02036 [cs.IR].
[18] Y. Wei, X. Wang, L. Nie, X. He, R. Hong, T.-S. Chua, MMGCN: Multi-modal graph convolution
network for personalized recommendation of micro-video, in: Proceedings of the 27th ACM
international conference on multimedia, Mm ’19, Association for Computing Machinery, New
York, NY, USA, 2019, pp. 1437–1445. URL: https://doi.org/10.1145/3343031.3351034. doi:10.1145/
3343031.3351034, number of pages: 9 Place: Nice, France.
[19] J. Zhang, Y. Zhu, Q. Liu, S. Wu, S. Wang, L. Wang, Mining latent structures for multimedia
recommendation, in: Proceedings of the 29th ACM international conference on multimedia, Mm
’21, ACM, 2021, pp. 3872–3880. URL: http://dx.doi.org/10.1145/3474085.3475259. doi:10.1145/
3474085.3475259.
[20] X. Zhou, H. Zhou, Y. Liu, Z. Zeng, C. Miao, P. Wang, Y. You, F. Jiang, Bootstrap latent representations
for multi-modal recommendation, in: Proceedings of the ACM web conference 2023, Www ’23,
ACM, 2023, pp. 845–854. URL: http://dx.doi.org/10.1145/3543507.3583251. doi:10.1145/3543507.
3583251.
[21] X. Zhou, Z. Shen, A tale of two graphs: Freezing and denoising graph structures for multimodal
recommendation, in: Proceedings of the 31st ACM international conference on multimedia,
Mm ’23, ACM, 2023, pp. 935–943. URL: http://dx.doi.org/10.1145/3581783.3611943. doi:10.1145/
3581783.3611943.
[22] D. Malitesta, E. Rossi, C. Pomo, F. D. Malliaros, T. D. Noia, Dealing with Missing Modalities in
Multimodal Recommendation: a Feature Propagation-based Approach, 2024. URL: http://arxiv.org/
abs/2403.19841. doi:10.48550/arXiv.2403.19841, arXiv:2403.19841 [cs].
[23] D. Malitesta, E. Rossi, C. Pomo, T. Di Noia, F. D. Malliaros, Do We Really Need to Drop Items
with Missing Modalities in Multimodal Recommendation?, in: Proceedings of the 33rd ACM
International Conference on Information and Knowledge Management, CIKM ’24, Association for
Computing Machinery, New York, NY, USA, 2024, pp. 3943–3948. URL: https://dl.acm.org/doi/10.
1145/3627673.3679898. doi:10.1145/3627673.3679898.
[24] C. Wang, M. Niepert, H. Li, LRMM: Learning to Recommend with Missing Modalities, 2018. URL:
http://arxiv.org/abs/1808.06791. doi:10.48550/arXiv.1808.06791, arXiv:1808.06791 [cs].
[25] X. Yang, H. Zhang, D. Jin, Y. Liu, C.-H. Wu, J. Tan, D. Xie, J. Wang, X. Wang, Fashion Captioning:
Towards Generating Accurate Descriptions with Semantic Rewards, 2022. URL: http://arxiv.org/
abs/2008.02693. doi:10.48550/arXiv.2008.02693, arXiv:2008.02693 [cs].
[26] C. G. Ling, H&amp;M Personalized Fashion Recommendations, 2022. URL: https://kaggle.com/
competitions/h-and-m-personalized-fashion-recommendations.
[27] Hugging Face – The AI community building the future., 2025. URL: https://huggingface.co/.
[28] J. Li, D. Li, S. Savarese, S. Hoi, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen
Image Encoders and Large Language Models, 2023. URL: http://arxiv.org/abs/2301.12597. doi:10.
48550/arXiv.2301.12597, arXiv:2301.12597 [cs].
[29] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin,
T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, L. Zettlemoyer,
OPT: Open Pre-trained Transformer Language Models, 2022. URL: http://arxiv.org/abs/2205.01068.
doi:10.48550/arXiv.2205.01068, arXiv:2205.01068 [cs].
[30] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual Instruction Tuning, 2023. URL: http://arxiv.org/abs/2304.08485.</p>
      <p>doi:10.48550/arXiv.2304.08485, arXiv:2304.08485 [cs].
[31] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th Annual Meeting on Association for Computational
Linguistics, ACL ’02, Association for Computational Linguistics, USA, 2002, pp. 311–318. URL:
https://dl.acm.org/doi/10.3115/1073083.1073135. doi:10.3115/1073083.1073135.
[32] C.-Y. Lin, ROUGE: A Package for Automatic Evaluation of summaries, 2004, p. 10.
[33] R. Vedantam, C. L. Zitnick, D. Parikh, CIDEr: Consensus-based Image Description Evaluation,
2015. URL: http://arxiv.org/abs/1411.5726. doi:10.48550/arXiv.1411.5726, arXiv:1411.5726
[cs].
[34] M. Denkowski, A. Lavie, Meteor Universal: Language Specific Translation Evaluation for Any</p>
      <p>Target Language, volume 6, 2014, pp. 376–380. doi:10.3115/v1/W14-3348.
[35] P. Anderson, B. Fernando, M. Johnson, S. Gould, SPICE: Semantic Propositional Image
Caption Evaluation, 2016. URL: http://arxiv.org/abs/1607.08822. doi:10.48550/arXiv.1607.08822,
arXiv:1607.08822 [cs].
[36] Y. Kim, Convolutional Neural Networks for Sentence Classification, in: A. Moschitti, B. Pang,
W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp.
1746–1751. URL: https://aclanthology.org/D14-1181. doi:10.3115/v1/D14-1181.
[37] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.org/abs/
1810.04805, arXiv: 1810.04805 tex.bibsource: dblp computer science bibliography, https://dblp.org
tex.timestamp: Tue, 30 Oct 2018 20:39:56 +0100.
[38] Streamlit • A faster way to build and share data apps, 2021. URL: https://streamlit.io/.
[39] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, CoRR abs/1512.03385
(2015). URL: http://arxiv.org/abs/1512.03385, arXiv: 1512.03385 tex.bibsource: dblp computer
science bibliography, https://dblp.org tex.timestamp: Wed, 25 Jan 2023 11:01:16 +0100.
[40] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese BERT-networks,
in: Proceedings of the 2019 conference on empirical methods in natural language processing,
Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084.
[41] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language
Supervision, 2021. URL: http://arxiv.org/abs/2103.00020. doi:10.48550/arXiv.2103.00020,
arXiv:2103.00020 [cs].
[42] V. W. Anelli, A. Bellogin, A. Ferrara, D. Malitesta, F. A. Merra, C. Pomo, F. M. Donini, T. Di Noia,
Elliot: A Comprehensive and Rigorous Framework for Reproducible Recommender Systems
Evaluation, in: Proceedings of the 44th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’21, Association for Computing Machinery, New
York, NY, USA, 2021, pp. 2405–2414. URL: https://dl.acm.org/doi/10.1145/3404835.3463245. doi:10.
1145/3404835.3463245.
[43] G. Linden, B. Smith, J. York, Amazon.com recommendations: item-to-item collaborative filtering,
IEEE Internet Computing 7 (2003) 76–80. URL: https://ieeexplore.ieee.org/document/1167344.
doi:10.1109/MIC.2003.1167344, conference Name: IEEE Internet Computing.
[44] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl, GroupLens: an open architecture for
collaborative filtering of netnews, in: Proceedings of the 1994 ACM conference on Computer
supported cooperative work, CSCW ’94, Association for Computing Machinery, New York, NY,
USA, 1994, pp. 175–186. URL: https://dl.acm.org/doi/10.1145/192844.192905. doi:10.1145/192844.
192905.
[45] R. Socher, J. Bauer, C. D. Manning, A. Y. Ng, Parsing with Compositional Vector Grammars, in:
H. Schuetze, P. Fung, M. Poesio (Eds.), Proceedings of the 51st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics,
Sofia, Bulgaria, 2013, pp. 455–465. URL: https://aclanthology.org/P13-1045.
[46] M.-C. de Marnefe, C. D. Manning, J. Nivre, D. Zeman, Universal dependencies, Computational
Linguistics 47 (2021) 255–308. URL: https://doi.org/10.1162/coli_a_00402. doi:10.1162/coli_a_
00402, tex.eprint: https://direct.mit.edu/coli/article-pdf/47/2/255/1938138/coli\_a\_00402.pdf.
[47] J. Nivre, M.-C. de Marnefe, F. Ginter, J. Hajič, C. D. Manning, S. Pyysalo, S. Schuster, F. Tyers,
D. Zeman, Universal Dependencies v2: An evergrowing multilingual treebank collection, in:
N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard,
J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the twelfth language
resources and evaluation conference, European Language Resources Association, Marseille, France,
2020, pp. 4034–4043. URL: https://aclanthology.org/2020.lrec-1.497/.
[48] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manning, Stanza: A Python Natural Language Processing
Toolkit for Many Human Languages, in: Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics: System Demonstrations, 2020.
[49] J. Aneja, A. Deshpande, A. Schwing, Convolutional Image Captioning, 2017. URL: http://arxiv.org/
abs/1711.09151. doi:10.48550/arXiv.1711.09151, arXiv:1711.09151 [cs].
[50] C. Cai, K.-H. Yap, S. Wang, Attribute Conditioned Fashion Image Captioning, in: 2022 IEEE
International Conference on Image Processing (ICIP), 2022, pp. 1921–1925. doi:10.1109/ICIP46576.
2022.9897417, iSSN: 2381-8549.
[51] B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng,
F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong,
A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, M. Lee, Z. Wang, R. Pang, P. Grasch,
A. Toshev, Y. Yang, MM1: Methods, Analysis &amp; Insights from Multimodal LLM Pre-training, 2024.</p>
      <p>URL: http://arxiv.org/abs/2403.09611. doi:10.48550/arXiv.2403.09611, arXiv:2403.09611 [cs].
[52] X. Zhou, Z. Shen, A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal
Recommendation, in: Proceedings of the 31st ACM International Conference on Multimedia,
MM ’23, ACM, 2023, pp. 935–943. URL: http://dx.doi.org/10.1145/3581783.3611943. doi:10.1145/
3581783.3611943.
[53] X. Zhou, H. Zhou, Y. Liu, Z. Zeng, C. Miao, P. Wang, Y. You, F. Jiang, Bootstrap Latent
Representations for Multi-modal Recommendation, in: Proceedings of the ACM Web
Conference 2023, WWW ’23, ACM, 2023, pp. 845–854. URL: http://dx.doi.org/10.1145/3543507.3583251.
doi:10.1145/3543507.3583251.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McAuley</surname>
          </string-name>
          ,
          <article-title>VBPR: visual Bayesian Personalized Ranking from implicit feedback</article-title>
          ,
          <source>in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence</source>
          , AAAI'
          <fpage>16</fpage>
          , AAAI Press, Phoenix, Arizona,
          <year>2016</year>
          , pp.
          <fpage>144</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>