<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Xiv preprint arXiv:</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards Playlist Continuation Through Large-Scale Context and Audio-Based Music Representations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shahrzad Shashaani</string-name>
          <email>shahrzad.shashaani@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Knees</string-name>
          <email>peter.knees@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Music Recommendation, Representation Learning, Item Embeddings, Playlist Completion.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Wien, Faculty of Informatics</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2210</year>
      </pub-date>
      <volume>03799</volume>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Music recommendation research faces several challenges when modeling the complex relationships between users, items, and the circumstances under which they interact. In spite of access to commercial catalogs and large customer bases, academic research builds its findings on publicly shared datasets. However, these often only contain selected data modalities, limited catalogs, or temporally restricted snapshots of interaction data. Moreover, they might eventually vanish due to licensing issues. A strategy to overcome some of these limitations could consist in learning multimodal representation learning for playlist continuation. For instance, while metadata and interaction data can be used to learn item representations, content-based data can be used to predict representations for tracks where audio is available but interaction data is lacking.</p>
      </abstract>
      <kwd-group>
        <kwd>Music Representations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Music recommenders have become an inseparable feature of modern streaming platforms, aiming to
provide personalized suggestions to enhance users’ listening experience and engagement.</p>
      <p>
        Traditional approaches are mainly based on session-based models, collaborative filtering, or
contentbased filtering. While collaborative filtering leverages user-item interactions, it often sufers from
coldstart problems where new items or users lack suficient historical data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Session-based recommenders
treat song sequences as item indexes, modeling sequential dependencies without capturing actual audio
characteristics [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Content-based recommenders, on the other hand, rely on metadata such as genre,
artist, and lyrics, which may overlook the rich audio features embedded within the music itself [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Beyond this, (especially academic) music recommendation research faces several challenges by
building its findings primarily on publicly shared datasets due to a lack of access to commercial catalogs
and large customer bases. Research datasets often only contain selected data modalities, limited
item coverage, or temporal restrictions both in their availability and data they cover. A strategy to
overcome some of these limitations could consist in learning dataset-transcending and cross-modal
music representations. For instance—similar to strategies addressing various cold-start problems—
metadata and interaction data can be used to learn item representations, while content-based data can</p>
      <p>CEUR</p>
      <p>ceur-ws.org
be used to predict representations for tracks where audio is available but interaction data lacking.</p>
      <p>
        Following this idea, in this work, we apply Word2Vec to the Spotify Million Playlist Dataset (MPSD)
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to obtain track embeddings, which are then used as labels to train a CNN that learns audio-based
representations from MP3 files. These audio and co-occurrence-based representations ofer a promising
foundation for more contextually-aware recommendation tasks, such as playlist continuation, where
understanding the relationships between songs is essential. This relates to an early approach for
addressing cold-start in collaborative filtering by van den Oord et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this work, among further
steps, we explore the method proposed by van den Oord et al. for the purpose of bridging contextual
representation learning and audio-based prediction and evaluate it in the context of playlist continuation.
It needs to be stated that our primary goal at this point is not to outperform the current state of the art
in music recommendation, but rather to investigate the potential of training content-based embeddings
using a large-scale dataset, while also enabling generalization to previously unseen items. In addition,
we conduct a quantitative evaluation, which is missing in the original work by van den Oord et al.
Our study aims to fill this gap and serve as a reproducible reference for this fundamental approach.
Specifically, our contributions are as follows:
1. we provide a comprehensive quantitative evaluation of the method by van den Oord et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ];
2. we extend the model to a cold-start setting via CNN-based content prediction that is applicable
to real-world scenarios; and
3. we publicly release the learned embeddings and prediction models to support reproducibility and
future research. This enables representation learning even in cases where access to certain music
datasets is limited or revoked.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Recent advances in deep learning have significantly influenced music recommendation. Convolutional
Neural Networks (CNNs) are efective at learning audio representations from spectrograms, enabling
improved content-based recommendations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The work of van den Oord et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] introduced a deep
content-based system that extracts audio embeddings via a CNN, which inspired our approach. While
incorporating negative signals has been shown to enhance sequential recommendation by distinguishing
relevant from irrelevant items [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we remain curious about the potential of CNNs in leveraging
usercollected playlists for music recommendation, where explicit negative signals are not available. Since
CNNs can eficiently extract meaningful audio features directly from raw audio, we believe they may
ofer valuable insights, especially in cold-start scenarios where new or less popular items are involved.
Another approach is to use word embedding techniques like Word2Vec to learn meaningful song
representations from playlist co-occurrence patterns [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These methods treat playlists as sentences
and songs as words, generating vector embeddings that capture contextual relationships.
      </p>
      <p>
        Predicting and recommending tracks that appropriately fit into an existing playlist, or playlist
completion, is a key task in music recommendation. The RecSys 2018 challenge focused on this,
aiming to build systems that recommend missing songs for incomplete playlists [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Monti et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
combine Recurrent Neural Networks (RNNs) with pre-trained embeddings to model playlist dynamics
and semantic relationships. Volkovs et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] present a scalable two-stage pipeline that retrieves
and re-ranks candidate songs. Gatzioura et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] propose a hybrid system that combines Latent
Dirichlet Allocation with Case-Based Reasoning to recommend songs based on past playlist similarity.
Bendada et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] introduce a flexible, scalable represent-then-aggregate strategy that can incorporate
Transformers and other techniques. Yang et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] address cold-start and popularity bias with a
two-part model combining autoencoders and character-level CNNs using playlist titles.
      </p>
      <p>
        Although various approaches have addressed playlist continuation, the cold-start problem remains
an ongoing challenge. For playlist recommendation in such settings, Chen et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposed a
multitask learning framework to improve generalization on new playlists and items. More recently,
Yurekli et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] introduced a multistage retrieval system that clusters playlists using user-generated
titles and applies latent semantic indexing to uncover relationships between tracks and titles. For
a new playlist, their model retrieves similar clusters and ranks tracks accordingly. In our work, we
extend this line of research by leveraging embeddings learned directly from audio to improve playlist
continuation. By integrating deep content-based representations into sequential models, we explore
whether audio-derived embeddings provide a more meaningful structure for generating contextually
relevant recommendations.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this section, we describe the methodology used to integrate deep audio embeddings into sequential
recommender models for playlist completion. Our approach consists of several key components: data
preprocessing, Word2Vec embeddings generation, audio embedding extraction using CNNs, and the
integration of these embeddings into sequential recommenders for playlist continuation. The complete
pipeline is shown in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset and Preprocessing</title>
        <p>We used MPSD, which includes track metadata and playlists of varying lengths. Since MPSD does not
provide audio information, we utilize a subset of MP3 files collected using the Spotify API in Fall 2022.
More details about the datasets can be found in Data Preperation and Statistics. The preprocessing
pipeline is as follows:
• Using MPSD playlists as input to train a Word2Vec model, which provides semantic track
embeddings for use in both CNN (as labels) and recommender models (as item vectors).
• Identifying overlapping tracks between MPSD and our collected audio dataset. These tracks are
then split into training and testing sets, with 1,200,455 samples used for training and 187,507
samples reserved for testing.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Word2Vec: Create Track Embeddings</title>
        <p>In this step, we apply the Word2Vec algorithm to the MPSD dataset to generate track embeddings. The
dataset consists of numerous playlists, each containing a sequence of tracks. We treat each playlist as a
”sentence” and each track index (unique track ID) as a ”word”. Given this input, Word2Vec learns an
embedding representation for each track based on its co-occurrence with other tracks in playlists.</p>
        <p>Word2Vec captures meaningful relationships between tracks, as songs frequently appearing together
in playlists are mapped to similar embedding spaces. This property makes the learned embeddings
useful as labels for training our CNN model, enabling the CNN to predict track representations based
on their audio features. Also, these embeddings have meaningful information to be used as item
representations in recommender models.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. CNN: Predicting Item Representations for the Recommender System</title>
        <p>
          Inspired by van den Oord et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], we employ a CNN to predict audio-based track embeddings. The
overlapping tracks between MPSD and our collected audio dataset serve as input data, and by applying
Word2Vec on MPSD, we have embeddings for all the corresponding audio data. Rather than using
raw audio directly, we first convert MP3 files into spectrograms, which represent the time-frequency
distribution of audio signals.
        </p>
        <p>
          Unlike [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which randomly sampled 3-second windows from tracks, we use the full 30-second audio
clips, ensuring that the model captures a broader range of musical characteristics. Additionally, we
experiment with dilated CNNs, which expand the receptive field without increasing the number of
parameters. Dilated CNNs allow the network to capture both local and long-range dependencies in
audio signals, potentially improving representation learning by preserving hierarchical structures.
        </p>
        <p>The network consists of multiple convolutional layers followed by fully connected layers, with
the final output being a low-dimensional audio embedding. The architecture is designed to learn
representations that capture meaningful audio patterns. The model is trained using the pre-learned
Word2Vec embeddings as labels. This ensures that the learned audio-based embeddings align with the
semantic information captured in playlist co-occurrence data.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Sequential Recommendation: Playlist Completion</title>
        <p>
          In the final step, we incorporate deep audio embeddings into sequential recommender models to predict
the next track in a playlist. Our primary goal is to evaluate whether content-based embeddings improve
recommendations in cold-start scenarios, where tracks lack interaction history. To achieve this, we
initialize item embeddings in the recommender system using the Word2Vec representations, except
for the last track in each playlist. The last track’s item embedding is replaced with the CNN-predicted
representation. To prevent data leakage, we ensure that all last tracks belong to the test set and have
never been used in CNN training. For comparison, we also evaluate the performance of models when
their item vector is initialized randomly (similar to the original models’ training method), and when it
is completely replaced with Word2Vec representations. We evaluate two sequential recommenders:
1. Self-Attentive Sequential Recommendation (SASRec) [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]: a self-attentive model capturing
sequence dependencies. We first train the model from scratch as the baseline and then replace item
embeddings with Word2Vec embeddings and also with CNN-predicted deep audio embeddings,
only for the last tracks, to examine their impact.
2. Bidirectional Encoder Representations from Transformers (BERT4Rec) [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]: a
transformer-based sequential recommender that treats playlists as sequences and predicts missing
tracks. We use a similar training process as SASRec.
        </p>
        <p>
          We used the basic implementations of the recommender models provided by [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], and made
modifications based on our proposed algorithm.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Data Preperation and Statistics</title>
        <p>MPSD is a large-scale collection of user-generated playlists from 2010 to 2017, comprising 2,262,292
tracks and 1,000,000 playlists, providing valuable sequence data for playlist modeling, however lacking
audio features. To bridge this gap, we built an audio dataset using a previously collected set of MP3
ifles using the Spotify API. Our audio subset represents about 61% overlap (1,375,634 tracks and 988,585
playlists) with MPSD, which is a significant portion of the MPSD track list. This subset enables us to
learn deep content-based embeddings from raw audio and integrate them into recommender models. To
assess the coverage and distribution of our audio subset, Figure 2 presents a t-SNE plot of the Word2Vec
embeddings trained on MPSD, showing that the audio subset is well-distributed across the embedding
space. This indicates that our subset is representative of the broader playlist structure.</p>
        <p>
          To extract meaningful representations from raw audio, we converted MP3 files into log-compressed
mel-spectrograms using the Librosa library [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The spectrograms were generated with a sampling
rate of 22,050 Hz, 128 mel components, an FFT window size of 1,024, and a hop length of 512. These
parameters were chosen to balance computational eficiency with suficient frequency resolution,
ensuring the spectrogram captures both low and high-frequency musical characteristics.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Models and Training</title>
        <p>4.2.1. Word2Vec
To learn track embeddings from playlists, we trained a Word2Vec model on the full MPSD dataset. We
tested several embedding sizes (40, 50, 100, 400), observing that the overall embedding space structure
remained almost stable in t-SNE visualization. However, we selected 400 for the final experiments,
since larger embedding dimensions allow the model to capture more nuanced relationships between
tracks, especially in high-dimensional feature spaces like music. Although smaller dimensions can still
represent Word2Vec track similarities efectively, they may lose the important details necessary for
CNNs and recommenders.</p>
        <p>
          Figure 3 shows the t-SNE result after applying Word2Vec with an embedding size of 400. With
a closer look, we can identify diferent possible clusters and specify frequent artists in each of the
highlighted regions. We only plot track embeddings of the most frequent artist names within each
cluster in the smaller plots. Interestingly, some regions demonstrate natural closeness: for example,
the Green (Reggae) and Red (Latin, Reggae) regions are adjacent, reflecting the stylistic and cultural
influences between these genres. Similarly, Yellow (Jazz, Musicals) and Purple (Classical) are positioned
closely, emphasizing their shared instrumental and compositional characteristics. This visualization
also captures meaningful relationships in the data: since Word2Vec was trained on track IDs, the
resulting clusters reflect artist associations. As a result, these regions may reveal possible dominant
genre tendencies, ofering insight into how Word2Vec embeddings encode playlist information—useful
for recommendation explainability.
4.2.2. CNN
Our CNN model is designed to learn deep content-based track representations from mel-spectrograms.
The architecture consists of four convolutional layers (128-&gt;256-&gt;256-&gt;512-&gt;512 channels), each
followed by non-overlapping max pooling layers to progressively reduce temporal resolution while
preserving essential features. All convolutions use a stride of 1 to capture fine-grained patterns, and in
the dilated variant, the last two convolutional layers employ dilation rates of 2 and 4, respectively, to
expand the receptive field without increasing kernel size. Inspired by [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], we apply a global temporal
pooling layer after the final convolution to aggregate temporal information using mean, max, and
L2-norm pooling functions. This pooled representation is passed through three fully connected layers,
reducing the dimensionality to a 400-dimensional track embedding.
        </p>
        <sec id="sec-4-2-1">
          <title>4.2.3. Recommender Systems</title>
          <p>For the SASRec model, we set the hidden units to 400 to match the dimension of the learned embeddings.
The model consists of two transformer blocks and uses one attention head. A dropout rate of 0.1 was
applied to regularize the model. In the BERT4Rec model, the model’s hidden size was also set to 400
to match the learned embeddings, with two hidden layers and two attention heads, due to resource
limitations. Both models were trained using the Adam optimizer, with these hyperparameters chosen
to balance model capacity and training eficiency.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>We evaluate our playlist continuation approach using three standard metrics: NDCG@K, which
measures ranking quality by considering the position of relevant tracks; HR@K, which checks if at least
one ground truth track appears in the top-K recommendations; and MAP@K, which computes the
mean precision of relevant tracks at diferent cut-ofs. Table 1 presents the performance of SASRec and
BERT4Rec models under diferent initialization strategies. Since no prior work has evaluated SASRec or
BERT4Rec on MPSD, we trained both models from scratch as baselines for fair comparison. Additionally,
to ensure a valid evaluation, we cut the playlists—only when necessary—so that the final track in each
sequence belongs to the CNN test set. This ensures that the CNN never sees these track embeddings
during training and prevents data leakage. This adjustment may also lead to slightly diferent results
compared to using the original, unmodified dataset.</p>
      <sec id="sec-5-1">
        <title>Model</title>
        <p>SASRec (Training from Scratch)
SASRec (Word2Vec)
SASRec (Word2Vec + CNN)
SASRec (Word2Vec + Dilated-CNN)
BERT4Rec (Training from Scratch)
BERT4Rec (Word2Vec)
BERT4Rec (Word2Vec + CNN)
BERT4Rec (Word2Vec + Dilated-CNN)</p>
      </sec>
      <sec id="sec-5-2">
        <title>Metric</title>
        <p>NDCG
HR
MAP
NDCG
HR
MAP
NDCG
HR
MAP
NDCG
HR
MAP
NDCG
HR
MAP
NDCG
HR
MAP
NDCG
HR
MAP
NDCG
HR
MAP</p>
        <p>For SASRec, incorporating Word2Vec embeddings significantly improves performance over the
baseline. For example, SASRec (Word2Vec) gets the highest NDGC@1 while the baseline only achieves
the same result in NDGC@100. While augmenting Word2Vec with CNN or Dilated-CNN features,
shows slightly lower performance than Word2Vec alone, these models still outperform the baseline
across all metrics, confirming that content-aware representations enhance sequential recommendation
quality. In contrast, BERT4Rec shows more mixed results. The baseline generally outperforms the
others, except at @1, suggesting that BERT4Rec benefits less from external item initializations. This may
be because the pre-trained Word2Vec/CNN embeddings may not align well with the transformer-based
architecture of BERT4Rec. Besides, increasing the number of hidden layers and attention heads can
increase its performance.</p>
        <p>Importantly, initializing item vectors with Word2Vec or CNN-based features not only improves
performance in most cases, but also accelerates training convergence. This highlights a practical benefit
of leveraging pre-trained representations in deep sequential recommender systems.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this study, we experimentally explored using content-based embeddings, extracted either from
Word2Vec or CNNs, to enhance sequential playlist recommendation. Our results show that SASRec
significantly benefits from these embeddings, with all proposed variants outperforming the baseline
model. While Word2Vec embeddings yield the best overall performance, CNN-based features remain
essential in cold-start settings, where new tracks lack listening history but have available audio content.
In contrast, BERT4Rec showed limited gains from content-based initialization, highlighting that its
self-attention mechanisms may already capture suficient contextual information without external
embeddings initialization. Nevertheless, its performance could potentially be improved by tuning
architectural parameters, such as the number of hidden layers and attention heads.</p>
      <p>
        Importantly, our work revisits and builds upon the approach introduced by van den Oord et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
applying their method to a new, large-scale playlist dataset and extending it with a quantitative
evaluation, which was missing in the original paper. Our goal is not to introduce a novel task, but to assess
the potential of deep content-based embeddings in diferent recommendation scenarios—particularly
in addressing the cold-start problem—and to provide learned item representations that can serve as a
reference for future work.
      </p>
      <p>A key strength of our approach lies in its ability to address the cold-start problem. By representing
unseen tracks using CNN-predicted embeddings, without requiring interaction data, we can generate
recommendations even for new or unpopular songs. This highlights the value of content-aware
embeddings in playlist completion, ofering a practical solution not only in cold-start scenarios, but a
strong baseline for future work on explainability.</p>
      <p>
        For future work, several directions can be explored to further enhance performance and
generalizability. First, experimenting with alternative sequential models such as GRU4Rec may yield deeper
insight into how diferent architectures interact with content-based embeddings. Second, improving the
CNN architecture could enhance embedding quality. Third, comparing our approach with other recent
baselines, such as [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], would help measure its efectiveness better. Finally, leveraging the learned
embeddings in new recommendation contexts may reveal their potential for explainability in music
recommendation by linking predictions to interpretable audio features. Beyond these directions, future
work could also investigate dataset-transcending music representations to resolve misalignments across
entities, modalities, and semantic concepts. As this paper indicates, multi-modal and multi-source
representation learning can serve as a viable vehicle for overcoming some of the research challenges
resulting from a fragmented dataset landscape.
      </p>
      <p>In the bigger picture, the goal of this and the intended follow-up work is primarily to establish
baseline results of existing and often referenced pipelines and variations thereof in specific application
tasks and evaluation settings. A central goal is to leverage the (unfortunately vanishing) resources
available to music recommendation research and to identify—ideally—general representations, that can
be reused and built upon in future recommendation tasks by publicly sharing them with the research
community. Learned representations might be a way to overcome the limitations the community is
facing and help to sustain the research area of music RecSys.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research was funded in whole or in part by the Vienna Science and Technology Fund (WWTF)
[Grant ID: 10.47379/DCDH001]. For open access purposes, the author has applied a CC BY public
copyright license to any author-accepted manuscript version arising from this submission.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Volinsky</surname>
          </string-name>
          ,
          <article-title>Matrix factorization techniques for recommender systems</article-title>
          ,
          <source>Computer</source>
          <volume>42</volume>
          (
          <year>2009</year>
          )
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hidasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karatzoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baltrunas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tikk</surname>
          </string-name>
          ,
          <article-title>Session-based recommendations with recurrent neural networks</article-title>
          ,
          <source>arXiv preprint arXiv:1511.06939</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          , C.-W. Chen,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deldjoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elahi</surname>
          </string-name>
          ,
          <article-title>Current challenges and visions in music recommender systems research</article-title>
          ,
          <source>International Journal of Multimedia Information Retrieval</source>
          <volume>7</volume>
          (
          <year>2018</year>
          )
          <fpage>95</fpage>
          -
          <lpage>116</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.-W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lamere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          , Recsys challenge
          <year>2018</year>
          :
          <article-title>Automatic music playlist continuation</article-title>
          ,
          <source>in: Proceedings of the 12th ACM Conference on Recommender Systems</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>527</fpage>
          -
          <lpage>528</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Van den Oord</surname>
          </string-name>
          , S. Dieleman,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schrauwen</surname>
          </string-name>
          ,
          <article-title>Deep content-based music recommendation</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>26</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Choi</surname>
          </string-name>
          , G. Fazekas,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. Cho,</surname>
          </string-name>
          <article-title>Convolutional recurrent neural networks for music classification</article-title>
          ,
          <source>in: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>2392</fpage>
          -
          <lpage>2396</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Seshadri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shashaani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Knees</surname>
          </string-name>
          ,
          <article-title>Enhancing sequential music recommendation with negative feedback-informed contrastive learning</article-title>
          ,
          <source>in: Proceedings of the 18th ACM Conference on Recommender Systems</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1028</fpage>
          -
          <lpage>1032</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Barkan</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Koenigstein,</surname>
          </string-name>
          <article-title>Item2vec: neural item embedding for collaborative filtering</article-title>
          ,
          <source>in: 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Monti</surname>
          </string-name>
          , E. Palumbo, G. Rizzo,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fell</surname>
          </string-name>
          , E. Cabrio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morisio</surname>
          </string-name>
          ,
          <article-title>An ensemble approach of recurrent neural networks using pre-trained embeddings for playlist completion</article-title>
          ,
          <source>in: Proceedings of the ACM Recommender Systems Challenge</source>
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Volkovs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rai</surname>
          </string-name>
          , Z. Cheng, G. Wu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanner</surname>
          </string-name>
          ,
          <article-title>Two-stage model for automatic playlist continuation at scale</article-title>
          ,
          <source>in: Proceedings of the ACM Recommender Systems Challenge</source>
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gatzioura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vinagre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Jorge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanchez-Marre</surname>
          </string-name>
          ,
          <article-title>A hybrid recommender system for improving automatic playlist continuation</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>33</volume>
          (
          <year>2019</year>
          )
          <fpage>1819</fpage>
          -
          <lpage>1830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Bendada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Salha-Galvan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bouabça</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cazenave</surname>
          </string-name>
          ,
          <article-title>A scalable framework for automatic playlist continuation on music streaming services</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>464</fpage>
          -
          <lpage>474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Mmcf: Multimodal collaborative filtering for automatic playlist continuation</article-title>
          ,
          <source>in: Proceedings of the ACM Recommender Systems Challenge</source>
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Ong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <article-title>Cold-start playlist recommendation with multitask learning</article-title>
          , arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>06125</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yürekli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kaleli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bilge</surname>
          </string-name>
          ,
          <article-title>Alleviating the cold-start playlist continuation in music recommendation using latent semantic indexing</article-title>
          ,
          <source>International Journal of Multimedia Information Retrieval</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>185</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>W.-C. Kang</surname>
            ,
            <given-names>J. McAuley</given-names>
          </string-name>
          ,
          <article-title>Self-attentive sequential recommendation, in: 2018 IEEE international conference on data mining (ICDM)</article-title>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sun</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ou</surname>
          </string-name>
          , P. Jiang,
          <article-title>Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer</article-title>
          ,
          <source>in: Proceedings of the 28th ACM international conference on information and knowledge management</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1441</fpage>
          -
          <lpage>1450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Klenitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vasilev</surname>
          </string-name>
          ,
          <article-title>Turning dross into gold loss: is bert4rec really better than sasrec?</article-title>
          ,
          <source>in: Proceedings of the 17th ACM Conference on Recommender Systems</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1120</fpage>
          -
          <lpage>1125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>B.</given-names>
            <surname>McFee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Ellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McVicar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Battenberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
          <article-title>Nieto, librosa: Audio and music signal analysis in python</article-title>
          .,
          <source>SciPy</source>
          <year>2015</year>
          (
          <year>2015</year>
          )
          <fpage>18</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>M. C. McCallum</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Korzeniowski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Oramas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Gouyon</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Ehmann</surname>
          </string-name>
          , Supervised and unsuper-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>