<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Audio, Lyrics, Videoclips, Interactions? An Analysis of Uni- and Multi-modal Music Retrieval Systems in Terms of Accuracy and Beyond-accuracy Aspects</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marta Moscati</string-name>
          <email>marta.moscati@jku.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gustavo Escobedo</string-name>
          <email>gustavo.escobedo@jku.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduardo Hernandez Almanza</string-name>
          <email>eduardohdz0697@hotmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonas Peché</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schedl</string-name>
          <email>markus.schedl@jku.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johannes Kepler University Linz</institution>
          ,
          <addr-line>Linz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Linz Institute of Technology</institution>
          ,
          <addr-line>Linz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Music similarity</institution>
          ,
          <addr-line>Music information retrieval, Accuracy, Coverage, Popularity bias, Hubness, Robustness, Content</addr-line>
        </aff>
      </contrib-group>
      <fpage>95</fpage>
      <lpage>116</lpage>
      <abstract>
        <p>Representations of the audio content of music tracks and of related data (such as lyrics, user-generated tags, or interaction data) are often leveraged in music retrieval and recommendation systems. It is therefore important to know how the choice of representation afects the results of similarity-based music retrieval systems. In this work, we address this question under several aspects. We analyze the accuracy, coverage, hubness, popularity bias, and robustness of retrieval systems based on diferent content modalities (text, audio, video) and on user-item interactions, and analyze the impact of corresponding features on multimodal retrieval systems. The paper gives useful insight into which modality to leverage depending on the aspects of retrieval results to prioritize and hence provides guidelines for practical real-world scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>features</kwd>
        <kwd>Collaborative data</kwd>
        <kwd>Evaluation study</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The way music listeners access music tracks is diverse. Some listeners prefer the use of video or music
streaming platforms, while others prefer purchasing albums. This is reflective of the fact that, although
the production of music is most naturally related to the audio signal, music producers also devote
significant eforts in designing additional content of the music tracks, such as album covers or videoclips.
Correspondingly, music listeners also select which music to listen to based on several modalities. This
renders the way the similarity between music tracks is perceived intrinsically multimodal. Additionally,
the amount of music available is vast and ever-increasing, which renders music retrieval systems
essential for supporting listeners in selecting relevant music tracks.</p>
      <p>
        In this work, we analyze the performance of diferent representations of music in the task of retrieving
music tracks that are similar to a query track. We consider retrieval systems based on the audio signal,
the lyrics, or the videoclips of the tracks, as well as on user–item interactions from music streaming
platforms collected through the music website Last.fm.1 Additionally, we include multimodal systems
based on early- or late-fusion. We analyze the performance of retrieval systems in terms of accuracy and
beyond-accuracy aspects. In particular, we measure the ability of retrieval systems to capture aspects
that define music similarity in terms of music genres, as commonly done in music information retrieval
(MIR) research [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since genres are not mutually exclusive, to balance the skewness in the distribution
of genres over tracks, we include definitions of relevance that are binary or continuous-valued, based on
measures for the similarity of sets. We also include in our analysis catalog coverage, popularity bias, and
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
hubness, i.e., the tendency of retrieving a small number of the same tracks over and over again, since
these have often been considered particularly relevant to MIR applications [ 2, 3, 4, 5, 6]. To analyze the
robustness of modalities and the impact of individual modalities on multimodal systems, we quantify the
coherence of the retrieval results within and across diferent modalities. This information gives insight
into the amount of change of the retrieval results when replacing one (feature from one) modality with
another, and hence helps in providing guidance for real-world scenarios where, e.g., one feature or
one modality is not available. We create an interactive dashboard2 to allow deeper explorations of the
results of our analysis and provide the code to reproduce our experiments.3</p>
      <p>The remainder of the paper is organized as follows: In Section 2 we discuss previous work related to
ours, in particular regarding similarity-based music retrieval methods and their evaluation (Section 2.1),
and regarding beyond-accuracy metrics in MIR domains (Section 2.2). In Section 3 we provide the
mathematical formulation of the retrieval task, describe the methodology underlying the retrieval
systems, and the metrics used to evaluate their performance in terms of accuracy- and beyond-accuracy
aspects. In Section 4 we describe our experiment setup, namely the dataset, the features used for the
retrieval systems, and the approach adopted to create the collaborative filtering (CF) representations.
We report the results of our experiments and discuss our observations in Section 5. Finally, we discuss
the limitations and possible extensions of our work in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In this section, we briefly present work on similarity-based music retrieval systems and in particular on
the comparison of diferent features in MIR tasks, as well as on beyond-accuracy metrics in MIR.</p>
      <sec id="sec-2-1">
        <title>2.1. Similarity-based music retrieval</title>
        <p>Similarity-based music retrieval, i.e., the task of ranking the tracks of a music catalog based on the
similarity to the query track [7], is the basis of many music delivery applications [7, 8, 9]. Standard
techniques for similarity-based music retrieval rely on unsupervised approaches [10, 11, 12] or
supervised approaches [8, 13] that use user-generated data such as tags as learning signals [14]. Recent work
also leverages self-supervised [7, 15] and unsupervised learning based on contrastive losses [16, 17].
Su et al. [18] systematically evaluate the impact of the parameters of bag-of-frames representations of
the audio signal on several MIR tasks, such as genre classification and pitched instrument recognition.
More recently, Plachouras et al. [19] introduce a framework to evaluate representations of the audio
signal on several MIR tasks and datasets, including robustness to perturbations in their analysis. Our
analysis difers from previous studies in that we focus on the task of music retrieval, which has not
been considered by Su et al. [18] nor Plachouras et al. [19], and which is closely connected to industry
domains such as that of music recommendation. Furthermore, we consider a multimodal scenario and
in addition to representations of the audio we also include in our analysis representations of the lyrics,
of the videoclips, and of user–item interaction data. Finally, our evaluation extends to aspects beyond
accuracy.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Beyond-Accuracy Evaluation</title>
        <p>One of the areas of MIR in which beyond-accuracy aspects are gaining increasing attention is that of
music recommendation. Music recommender systems (MRSs) [2, 3, 9] are one of the main applications
of similarity-based music retrieval, and several works [20, 21, 22] highlight the importance of measuring
aspects of the quality of recommendation that go beyond accuracy. Among those, catalog coverage [23],
hubness [24, 5, 6], and popularity bias [4] are of particular relevance to MRSs. However, these aspects are
typically neglected in other MIR tasks such as similarity-based music retrieval. The work at hand difers
2Dashboard: tinyurl.com/cmrs2024
3Code:https://github.com/hcai-mms/multimodal_mir
from previous work on beyond-accuracy evaluation of MIR systems. In fact, we analyze similarity-based
music retrieval systems under aspects typically not jointly considered in their evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>evaluation (Section 3.2).</p>
      <sec id="sec-3-1">
        <title>3.1. Retrieval Systems</title>
        <p>In this section, we provide a mathematical definition of the retrieval systems (Section 3.1) and of their
Given the catalog  , a music track  ∈ 
is an index labelling the feature and   is the dimensionality of the corresponding vector. A retrieval
system  (fk,dis) is defined by the combination of feature fk and distance in feature space (dis). Given
is represented by several feature vectors f(m)</p>
        <p>∈ ℝ  , where 
k
(f(kq),dis) returns the  retrieved tracks [ 1 , … ,  
()
() ] ∈  
a query track  ∈  , the retrieval system  
that have the lowest distance with  , breaking ties randomly. We refer to the audio signal, the lyrics,
and the videoclip of the track as modalities that represent content information. We consider unimodal
retrieval systems based on a feature fk representing a single modality. We also include multimodal
retrieval systems that simultaneously leverage one feature from the audio, one from the lyrics, and
one from the videoclip modality using early- or late-fusion.4 In addition to content representations,
we consider three representations of user–item interaction data created using CF algorithms. Two
are based on traditional recommendation algorithms, item  -nearest-neighbors (ItemkNN) and matrix
factorization with truncated singular value decomposition (SVD). One is based on a well-established
neural network (NN) architecture for recommendation, multinomial variational autoencoder (MultVAE),
selected for its accuracy in the task of music recommendation [25]. For retrieval systems based on
ItemkNN, we represent each track as the corresponding item vector in the user–item interaction matrix.
For SVD we represent tracks with the embeddings multiplied by the square root of the singular values.
MultVAE is usually trained to encode and reconstruct the user profiles. Since we are interested in the
track representations, we use the same architecture to reconstruct the track profiles. Therefore, we
train an instance of MultVAE on the transposed user–item interaction matrix and use the latent vectors
of the tracks as features in the retrieval system. In initial experiments, we considered either inverted
cosine or Tanimoto similarity as distances. Since all retrieval systems reached higher accuracy with
cosine similarity, we restrict our discussion at hand to retrieval systems based on cosine.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation</title>
        <p>()
,  
track  
We measure the accuracy of a retrieval system in terms of normalized discounted cumulative gain
(NDCG) with gain based on the genres of the query  and the retrieved tracks [ 1 , … ,  
()
() ]. Each track
∈  is labeled with a subset   of the set of all genres,   ⊂  . We consider the  -th retrieved
() to be relevant if it shares at least one genre with the query track  , and include four definitions
of NDCG based on diferent values of the gain. In the simplest binary case of NDCG  , we assign  
gain of one if it shares at least one genre with  , and zero otherwise. For NDCG we assign a gain given
() a
by the Szymkiewicz-Simpson coeficient |  ∩   () |/ min(|  |, |



() |). For NDCG we assign a gain given
by the Jaccard coeficient |  ∩   () |/|  ∪   () |. For NDCG we assign a gain given by Sørensen–Dice</p>
        <p>coeficient 2|  ∩   () |/(|  | + |



genre overlap with  leads to a higher NDCG if it is ranked at the top of the list, compared to another
with a smaller genre overlap. NDCG,,,</p>
        <p>are aggregated with mean over all retrieval lists.</p>
        <p>() |). By extending the binary gain, we enforce that a track with a large
4Although we consider 12 diferent feature vectors, as listed in Table 1, we report the results of the two unimodal retrieval
systems that reached the highest accuracy within each modality, and of multimodal retrieval systems obtained by their
combination. We refer the reader to the dashboard for the full results.</p>
        <p>As for the studied beyond-accuracy metrics, we define the popularity   of track  as the sum of its
interactions over all users [4] and the popularity bias  , i.e., the tendency to retrieve tracks that are more
popular than the query track, adapting the method from Lesota et al. [4]:  = Median (   () −  ), where
∈  
  () denotes the average popularity of all tracks retrieved for  . A positive  indicates that retrieved
tracks are overall more popular than queries.5 We define coverage  as the percentage of all tracks in 
that occur in at least one result list for any query [23]. We define hubness  as the tendency to often
retrieve the same tracks for diferent queries, leading to non-symmetric results [ 26, 5, 27, 6]. We measure
 in terms of the skewness of the distribution of  -occurrences [5, 6]. We also analyze the robustness of
unimodal systems, i.e., the extent to which systems based on the same modality (e.g., lyrics) but diferent
representations (e.g., TF-IDF vs. BERT) create similar rankings for the same query, and the influence of
each modality in case of multimodal systems, i.e., the coherence between results retrieved with unimodal
and multimodal systems. We quantify both in terms of Kendall’s rank correlation between the lists
created by the two retrieval systems to compare, i.e.,  (fk1,dis1)and  (fk2,dis2).</p>
        <p>In the evaluation, NDCG,  ,  , and  are computed over all queries  ∈  and for  = 10 top
retrieved tracks.6 The rank correlations are computed over all queries  ∈  and for lists of | | − 1
retrieved tracks, i.e., ranking all tracks apart from the query, since restricting to a shorter list often
results in disjoint lists of retrieved tracks.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In this section, we provide the details on our experimental setup. More specifically, Section 4.1 describes
the dataset and the features f(km) representing the content of the music tracks, while Section 4.2 provides
details on the setup used to extract the CF representations of the tracks.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>Our experiments are based on the Music4All-Onion dataset [28] and its extension released by Peintner
et al. [29]. Music4All-Onion is a large-scale multimodal dataset for MIR. We select the tracks for which
all the content features are available and that have at least one genre. This results in | | = 68,641 tracks.
We perform our experiments with nine features for the audio, three for the lyrics, and three for the
video modalities, as described in Table 1. For the audio signal, in order to capture short- and long-time
dependencies, we consider both the Mel Frequency Cepstral Coeficients (MFCCs), aggregated either
with statistical descriptors or as bag-of-audio-words (BoW) computed with openXBOW [30], and all the
block-level features (BLFs) [11]. We also include the features extracted with Essentia [31] in our analysis,
since these include information such as the zero-crossing rate and the attack time, that is complementary
to the MFCCs and BLFs. For the lyrics, we consider both statistical representations of word occurrences
in terms of TF-IDF, representations obtained with pre-trained instances of word2vec [32], and
representations obtained with the all-mpnet-v27 pre-trained instance of the SentenceTransformer model [33]
provided by Hugging Face [34]. We refer to this latter representation of the lyrics as BERT. For the video
modality, we consider the visual representations of the YouTube videoclips of the music tracks. These
visual representations are obtained by first sampling videoclip frames at 1 fps. The frames are then
encoded with pre-trained instances of VGG19 [35], Inception v3 [36], and ResNet [37] and their vector
representations are aggregated using max and mean pooling over all frames and for each dimension of
the encoding vector, for each track. Finally, the max and mean vectors are concatenated, resulting for
each architecture (VGG19, Inception v3, ResNet) in a video feature vector fk of double dimensionality
with respect to the dimensionality of the visual representation of the pre-trained encoding architecture.
We report the results of the two retrieval systems that reached the highest NDCG within each modality
5Common measures of popularity bias in RSs use the median instead of the mean since it is more robust to outliers.
6We refer the reader to the dashboard for the evaluation of retrieval systems on lists of  = 100 top retrieved tracks.
7https://huggingface.co/sentence-transformers/all-mpnet-base-v2
and refer the reader to the dashboard for the full results. Table 1 summarizes the features fk used for
each modality, and their corresponding dimensionality   .</p>
        <p>Modality
paper we report the results of the two features that reached the highest NDCG within each modality and refer
the reader to the dashboard for the others.</p>
        <p>For multimodal systems we select the two features that reached the best performance in terms of
NDCG within each modality (resulting in six features), and consider all possible combinations (resulting
in eight combinations for each fusion technique). For early fusion, we first normalize the feature vectors
to 1 with  2-norm, and then concatenate them. For late fusion, we apply Z-score normalization to the
distribution of scores of the individual retrieval systems and then average the normalized scores with
weights proportional to NDCG .</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Collaborative Filtering Representations</title>
        <p>To obtain a representation of the user–item interaction data of each track with each recommendation
algorithm (ItemkNN, SVD, MultVAE), we use the set of user–item interactions available in the Music4All
dataset [38].89 The characteristics of this set of user-item interactions are summarized in Table 2.
 inter</p>
        <p>u
4,106,678
14,127</p>
        <p>tinter
67,055
 tw/o inter
1,586
tracks with at least one interaction, and  tw/o inter the number of tracks without interactions.
Characteristics of the set of user-item interactions used to obtain the CF item representations with ItemkNN, SVD,
and MultVAE.  inter represents the number of user–item interactions,  u the number of users,  tinter the number of</p>
        <p>We set the number of factors in SVD to  = 200 and the dimension of the latent representation in
MultVAE to  = 500 . We fix the batch size to 512, the maximum number of epochs to 300 and apply
early stopping with a patience of 10. We set the initial learning rate to 0.003 and reduce the learning
867,055 out of the 68,641 (∼ 98%) relevant tracks from Music4All have been listened to at least once, i.e., correspond to at
least one user–item interaction. Query tracks without user–item interactions lead to a vector of zeros (for ItemkNN) or a
randomly initialized one (for SVD and MultVAE), yielding results that are comparable to those of a random retrieval system.
9We refer the reader to the dashboard for the results obtained with CF representations based on the set of user–item interactions
from the Music4All-Onion dataset, for which 35,702 out of the || = 68,641
(∼ 52%) tracks have been listened to at least
once, i.e., correspond to at least one user–item interaction.
rate by a factor of 0.5 if an epoch shows no reduction in the training loss.10 Following common practice
for MRSs [4, 39, 40], we convert the user–item interactions to implicit feedback, binarizing them by
setting the entry in the interaction matrix to 1 (positive feedback) if the user listened to the track at
least two times, and to 0 otherwise.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this section, we analyze the results of our experiments on music retrieval systems. Section 5.1
compares the performance of uni- and multi-modal retrieval systems in terms of accuracy, coverage  ,
hubness  and popularity bias  defined as described in Section 3.2. Section 5.2 analyzes the robustness
of each content modality (audio, lyrics, videoclips) and the impact of each modality on multi-modal
systems. In the evaluation, NDCG,  ,  , and  are computed over 68,641 queries and for top 10 retrieved
tracks. The rank correlations are computed over 68,641 queries and for lists of 68,640 retrieved tracks,
i.e., ranking all tracks, since restricting to a shorter list often results in disjoint lists of retrieved tracks.
Since all retrieval systems reached higher NDCG,,,  with cosine similarity, we restrict our discussion
to retrieval systems based on cosine.</p>
      <sec id="sec-5-1">
        <title>5.1. Accuracy, Coverage, Hubness, and Popularity Bias</title>
        <p>Table 3 shows the NDCG,,,  , hubness  , coverage  and popularity bias  of the retrieval systems.
As baseline for comparison, the first block of the table shows the results of a system retrieving tracks at
random for each query. The following block refers to retrieval systems based on one feature, either
from one content modality (lyrics, audio, or video)11 or from CF representations. The last block refers to
multimodal retrieval systems based on all content modalities, either with early- or with late-fusion. As
described and motivated in Section 3.2, NDCG,,,  show the mean and  the median over all queries.
For these metrics, all diferences between the best performing system in each sub-block (in bold) and
the remaining ones are statistically significant (  &lt; 0.05 for paired  -tests using Bonferroni correction
to account for multiple comparisons), aside from those between BLF and ResNet. We first observe
that within content-based retrieval systems, video features lead to the highest accuracy, especially
when measured with NDCG,, . The fact that audio features are competitive in terms of NDCG
but reach a worse performance in terms of NDCG,, indicates that both audio and videoclips are
comparable in retrieving tracks that share at least one genre, but videoclips lead to results that share
more genres with the query tracks. Among content-based retrieval systems, fusion techniques generally
tend to reach higher accuracies than systems based on individual modalities, with early-fusion leading
to higher NDCG,,,  compared to late-fusion. ItemkNN reaches the highest NDCG,,,  and all
content-based retrieval systems are outperformed by all CF systems. This shows that collaborative data,
which do not include any explicit information on the track content, are also useful for MIR tasks beyond
recommendation. This higher accuracy, however, comes at the cost of a higher hubness and an overall
tendency to a higher popularity bias (aside from MultVAE). Surprisingly, however, CF systems also
outperform content-based ones in terms of coverage. This indicates their tendency to retrieve diferent,
but more popular, tracks. We hence conclude that if accuracy and coverage are to be prioritized when
retrieving music, it is in the interest of the MIR system provider to select CF representations. However,
these are not always available, e.g., on platforms where interaction data are not collected. In that case,
multimodal systems should be preferred.
10We use default hyperparameters since any data split leading to a reasonable optimization of the MRSs would not be
meaningful for the retrieval system. For instance leaving out a set of tracks for validation would lead to an embedding
dimensionality that is not optimal when all tracks are considered, while a split at the interaction or user level would be
prone to information leakage, since the same tracks would be selected for the hyperparameter optimization and evaluation.
11For unimodal content-based retrieval systems, we report the results of the two features that reached the highest NDCG
within each modality and refer the reader to the dashboard for the others.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Robustness and Feature Impact</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>This work compared the accuracy, coverage, hubness, popularity bias, and robustness of
similaritybased music retrieval systems based on content or collaborative data, as well as the coherence between
unimodal and multimodal systems. The results provide useful information to platform providers,
especially in cases where the choice of modality or fusion technique has to consider aspects beyond
accuracy, or in which one or more representations of the music tracks are missing. One noteworthy
ifnding is the very good accuracy of ResNet features from videoclips, considering they are computed
from the image content only, and disregarding the actual music audio content. This surprising result
might be originating from the genre-based evaluation setting, and could indicate that music tracks of a
same genre share distinctive visual characteristics (e.g., , videoclips for emo rock songs are often filmed
in black and white). Our definition of relevance is framed as finding tracks of the same music genres
of a query track; this constitutes one limitation of the current work. Future work could extend the
evaluation to other evaluation settings, e.g., framing the evaluation as playlist completion given a seed
track. These evaluations, taken together with the current one, would provide a more comprehensive
view on the impact of content features on MIR tasks. Another limitation of our work is that although
we included representations of lyrics, videoclips, and collaborative data based on a NN, we only used
hand-crafted features for the audio signal. The reason is that many (deep) NNs for music are pre-trained
on tags or genres. The learned models would therefore be prone to information leakage, considering our
relevance definition. Additionally, it would be interesting to compare the accuracy and beyond-accuracy
metrics reported in this work with those actually perceived by users, e.g., via user studies. We leave
these analyses for future work.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Declaration on Generative AI</title>
      <p>No generative AI tool was used during the preparation of this work.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Acknowledgments</title>
      <p>This research was funded in whole or in part by the Austrian Science Fund (FWF) https://doi.org/</p>
      <p>2022, pp. 927–971.
[2] M. Schedl, H. Zamani, C. Chen, Y. Deldjoo, M. Elahi, Current challenges and visions in music
recommender systems research, International Journal of Multimedia Information Retrieval 7 (2018)
[3] M. Schedl, P. Knees, B. McFee, D. Bogdanov, Music recommendation systems: Techniques, use
cases, and challenges, in: F. Ricci, L. Rokach, B. Shapira (Eds.), Recommender Systems Handbook,
[4] O. Lesota, A. Melchiorre, N. Rekabsaz, S. Brandl, D. Kowald, E. Lex, M. Schedl, Analyzing item
popularity bias of music recommender systems: Are diferent genders equally afected?, in: Proc.
of ACM RecSys, 2021, pp. 601–606.
[5] D. Schnitzer, A. Flexer, M. Schedl, G. Widmer, Local and global scaling reduce hubs in space,</p>
      <p>Journal of Machine Learning Research 13 (2012) 2871–2902.
[6] K. Seyerlehner, A. Flexer, G. Widmer, On the limitations of browsing top-n recommender systems,
in: Proc. of ACM RecSys, 2009, p. 321–324.
[7] T. Akama, H. Kitano, K. Takematsu, Y. Miyajima, N. Polouliakh, Auxiliary self-supervision to
metric learning for music similarity-based retrieval and auto-tagging, PLOS ONE 18 (2023) 1–20.
[8] A. C. M. da Silva, D. F. Silva, R. M. Marcacini, Multimodal representation learning over
heterogeneous networks for tag-based music retrieval, Expert Systems with Applications 207 (2022)
117969.
[9] Y. Deldjoo, M. Schedl, P. Knees, Content-driven music recommendation: Evolution, state of the
art, and challenges, Computer Science Review 51 (2024) 100618.
[10] H. Eghbal-Zadeh, B. Lehner, M. Schedl, G. Widmer, I-vectors for timbre-based music similarity
and music artist classification, in: Proc. of ISMIR, 2015, pp. 554–560.
[11] K. Seyerlehner, G. Widmer, M. Schedl, P. Knees, Automatic music tag classification based on
block-level features, in: Proc. of SMC, 2010.
[12] P. Knees, M. Schedl, A survey of music similarity and recommendation from music context data,</p>
      <p>ACM Trans. Multimedia Comput. Commun. Appl. 10 (2013).
[13] M. Won, S. Oramas, O. Nieto, F. Gouyon, X. S. Serra, Multimodal metric learning for tag-based
music retrieval, in: Proc. of IEEE ICASSP, 2021, p. 591–595.
[14] J. Lee, N. Bryan, J. Salamon, Z. Jin, J. Nam, Metric learning vs classification for disentangled music
representation learning, in: Proc. of ISMIR, 2020, pp. 439–445.
[15] C. Thomé, S. Piwell, O. Utterbäck, Musical audio similarity with self-supervised convolutional
neural networks, in: Proc. of ISMIR, 2022, p. LBR &amp; Demo Papers.
[16] P. Manocha, Z. Jin, R. Zhang, A. Finkelstein, Cdpam: Contrastive learning for perceptual audio
similarity, in: Proc. of IEEE ICASSP, 2021, pp. 196–200.
[17] A. Ferraro, J. Kim, S. Oramas, A. Ehmann, F. Gouyon, Contrastive learning for cross-modal artist
retrieval, in: Proc. of ISMIR, 2023.
[18] L. Su, C.-C. M. Yeh, J.-Y. Liu, J.-C. Wang, Y.-H. Yang, A systematic evaluation of the
bag-offrames representation for music information retrieval, IEEE Transactions on Multimedia 16 (2014)
1188–1200.
[19] C. Plachouras, P. Alonso-Jiménez, D. Bogdanov, mir_ref: A representation evaluation framework
for music information retrieval tasks, in: Proc. of Machine Learning for Audio Workshop co-located
with NeurIPS, New Orleans, LA, USA, 2023.
[20] M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond accuracy: Evaluating recommender systems by
coverage and serendipity, in: Proc. of ACM RecSys, 2010, pp. 257–260.
[21] M. Kaminskas, D. Bridge, Diversity, serendipity, novelty, and coverage: A survey and empirical
analysis of beyond-accuracy objectives in recommender systems, ACM Transactions on Interactive
Intelligent Systems 7 (2016).
[22] V. W. Anelli, A. Bellogín, T. Di Noia, C. Pomo, Reenvisioning the comparison between neural
collaborative filtering and matrix factorization, in: Proc. of ACM RecSys, 2021, pp. 521–529.
[23] A. Gunawardana, G. Shani, S. Yogev, Evaluating recommender systems, in: F. Ricci, L. Rokach,</p>
      <p>B. Shapira (Eds.), Recommender Systems Handbook, 2022, pp. 547–602.
[24] M. Radovanović;, A. Nanopoulos, M. Ivanović;, Hubs in space: Popular nearest neighbors in
high-dimensional data, Journal of Machine Learning Research 11 (2010) 2487–2531.
[25] D. Liang, R. G. Krishnan, M. D. Hofman, T. Jebara, Variational autoencoders for collaborative
ifltering, in: Proc. of ACM WWW, 2018, pp. 689–698.
[26] A. Flexer, D. Schnitzer, J. Schlüter, A mirex meta-analysis of hubness in audio music similarity, in:</p>
      <p>Proc. of ISMIR, 2012, pp. 175–180.
[27] A. Flexer, M. Dörfler, J. Schlüter, T. Grill, Hubness as a case of technical algorithmic bias in music
recommendation, in: Proc. of ICDMW, 2018, pp. 1062–1069.
[28] M. Moscati, E. Parada-Cabaleiro, Y. Deldjoo, E. Zangerle, M. Schedl, Music4all-onion - A large-scale
multi-faceted content-centric music recommendation dataset, in: Proc. of ACM CIKM, 2022, pp.
4339–4343.
[29] A. Peintner, M. Moscati, Y. Kinoshita, R. Vogl, P. Knees, M. Schedl, H. Strauss, M. Zentner,
E. Zangerle, Nuanced music emotion recognition via a semi‑supervised multi‑relational graph
neural network, Transactions of the International Society for Music Information Retrieval (2025).
[30] M. Schmitt, B. Schuller, Openxbow: Introducing the passau open-source crossmodal bag-of-words
toolkit, Journal of Machine Learning Research 18 (2017) 3370–3374.
[31] D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Herrera, O. Mayor, G. Roma, J. Salamon, J. R. Zapata,
X. Serra, Essentia: An audio analysis library for music information retrieval, in: Proc. of ISMIR,
2013, pp. 493–498.
[32] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient estimation of word representations in vector
space, in: Proc. of ICLR, 2013.
[33] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in:</p>
      <p>Proc. of EMNLP, 2019, pp. 3973–3983.
[34] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural language processing, in:
Proc. of EMNLP System Demos, 2020, pp. 38–45.
[35] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
in: Proc. of ICLR, 2015.
[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich,</p>
      <p>Going deeper with convolutions, in: Proc. of IEEE CVPR, 2015, pp. 1–9.
[37] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. of IEEE</p>
      <p>CVPR, 2016, pp. 770–778.
[38] I. A. P. Santana, F. Pinhelli, J. Donini, L. Catharin, R. B. Mangolin, V. D. Feltrim, M. A. Domingues,
et al., Music4all: A new music database and its applications, in: Proc. of IEEE IWSSIP, 2020, pp.
399–404.
[39] A. B. Melchiorre, N. Rekabsaz, C. Ganhör, M. Schedl, Protomf: Prototype-based matrix factorization
for efective and explainable recommendations, in: Proc. of RecSys, 2022, p. 246–256.
[40] A. B. Melchiorre, N. Rekabsaz, E. Parada-Cabaleiro, S. Brandl, O. Lesota, M. Schedl, Investigating
gender fairness of recommendation algorithms in the music domain, Information Processing &amp;
Management 58 (2021) 102666.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Flexer</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Urbano,</surname>
          </string-name>
          <article-title>The neglected user in music information retrieval research</article-title>
          ,
          <source>J. Intell. Inf. Syst</source>
          .
          <volume>41</volume>
          (
          <year>2013</year>
          )
          <fpage>523</fpage>
          -
          <lpage>539</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>