Autoencoders for Next-Track-Recommendation

                     Michael Vötter, Eva Zangerle, Maximilian Mayerl, Günther Specht
                                               Databases and Information Systems
                                                Department of Computer Science
                                                 University of Innsbruck, Austria
                                              {firstname.lastname}@uibk.ac.at


ABSTRACT                                                              search as used in multiple other papers [5, 8, 10].
In music recommender systems, playlist continuation is the               In the field of music recommendation, deep learning ap-
task of continuing a user’s playlist with a fitting next track,       proaches are usually used to include additional features such
often also referred to as next-track or sequential recommen-          as textual information [1] or content features [18] in the rec-
dation. This work investigates the suitability and applicabil-        ommendation process. In contrast, only a few approaches
ity of autoencoders for the task of playlist continuation. We         such as [9] directly apply neural network approaches to the
utilize autoencoders and hence, representation learning to            playlist continuation task. To fill this gap, we propose a
continue playlists. Our approach is inspired by the usage of          novel autoencoder-based approach that directly applies neu-
autoencoders to denoise images and we consider the playlist           ral network-based representation learning on the playlist
without the missing next-track as a noisy input. Particu-             continuation task. The simplest form of an autoencoder
larly, we design different autoencoders for this specific task        is a neural network with a dense input layers and a dense
and investigate the effects of different designs on the overall       output layer which is trained in an unsupervised manner.
suitability of recommendations produced by the resulting              Our approach is inspired by the successful application of au-
recommender systems. To evaluate the suitability of rec-              toencoders for image denoising [19]. We consider the input
ommendations produced by the proposed approach, we uti-               playlist (that has to be continued) as a noisy version of the
lize the AotM-2011 and LFM-1b datasets. Based on those                resulting playlist, where a next-track is added as a contin-
datasets, we show that n-grams are a well performing alter-           uation. We argue that representation learning methods are
native baseline to kNN. Fruther, we show that it is possible          more suitable to take advantage of the features contained in
to outperform a kNN as well as an n-gram baseline with our            the playlist structure than e.g., kNN because representation
autoencoder approach.                                                 learning methods are specifically designed to learn an effec-
                                                                      tive representation and hence features [2]. To the best of
                                                                      our knowledge, this is the first time that autoencoders are
1.   INTRODUCTION                                                     utilized for the playlist continuation task, while the closest
   Recommender systems are applicable to a broad spectrum             related tasks, where autoencoders were used successfully, are
of domains. The music domain is one such application area,            collaborative filtering tasks [15, 16, 21].
where one specific task is next-track music recommendation.              With this work, we investigate the general applicability
Next-track music recommender systems are recommender                  of autoencoders for the playlist continuation task. We re-
systems that aim to find a fitting continuation (next-track)          port the effects of different parameter settings on the overall
for a given playlist. In general, a playlist is an ordered list of    suitability of recommendations produced by the system and
music tracks, where the order is based on time, which means           answer the following research questions:
that the first track in the list is expected to be listened first,
followed by the second track and so on. In other words, a                • RQ1: How can playlists be vectorized for an autoen-
playlist is a time series of tracks.                                       coder?
   Multiple different approaches have been proposed for the
playlist continuation task. As mentioned in [7], these ap-               • RQ2: Is there an alternative baseline to kNN that
proaches are based on a broad spectrum of techniques such                  better utilizes the order of tracks in a playlist?
as Markov models, collaborative filtering, content similarity
as well as hybrids of them. Another traditional approach to              • RQ3: Which autoencoder design produces competi-
compute next-track recommendations is a nearest neighbor                   tive results for the next-track music recommendation
                                                                           task?

                                                                        In [10], Kamehkhosh and Jannach propose to use hand-
                                                                      crafted playlists to evaluate next-track music recommender
                                                                      systems. Inspired by that, we use the AotM-2011 dataset
                                                                      as well as the LFM-1b dataset to evaluate the suitability of
                                                                      recommendations produced by our approach. Our experi-
                                                                      ments show, that the resulting autoencoder approach pro-
31st GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 11.06.2019 - 14.06.2019, Saarburg, Germany.                  duces competitive recommendations compared to the kNN
Copyright is held by the author/owner(s).                             baseline.
   The remaining sections of this paper are organized as fol-                         
lows. First, related work will be discussed in Section 2. Af-                          (1, 0, 1, 1, 1, 0, 1)
                                                                                                                binary
terwards, in Section 3, we describe the algorithm to convert a      [5, 1, 3, 7, 4] →   (2, 0, 3, 5, 1, 0, 4)    order
playlist into a corresponding vector and present our autoen-
                                                                                      
                                                                                       2       3 5 1        4
                                                                                                               
                                                                                         5
                                                                                           , 0,   ,
                                                                                                5 5 5
                                                                                                     ,  , 0, 5
                                                                                                                 normalized-order
coder approach. Following that, the experimental setup to
evaluate our approach by comparing it with a kNN baseline
is described in Section 4. Thereafter, we present the results           Figure 1: Playlist to encoding transformation.
in Section 5 and finally draw a conclusion in Section 6.

                                                                   3.    METHODS
2.     RELATED WORK
                                                                     In this section, the proposed recommendation approach
   This section gives an overview of next-track music recom-       based on an autoencoder is presented. First, we will explain
mendation approaches.                                              how playlists are converted into a vector, which is necessary
   We consider the playlist continuation task to be a special      to use them as an input for the autoencoder. Afterwards, the
case of the more general task of playlist generation. Ac-          structure and implementation details of the autoencoder are
cording to Bonnin and Jannach [5], the preconditions for           presented in Section 3.2. This includes the general training
a playlist generation task are a background knowledge base         procedure used and a modified autoencoder layout to over-
and target characteristics for the resulting playlist. Based on    come overfitting by simulating the continuation task during
that, a sequence of tracks (playlist) best fitting the charac-     training.
teristics needs to be found. The playlist generation problem
may be converted to the playlist continuation problem by           3.1     Vector Encoding of Playlists
considering all playlists/sessions as the background database
                                                                      Playlists are usually represented as an ordered lists of
and using a target characteristic that describes a fitting track
                                                                   tracks. In the special case of the playlist continuation task,
given a playlist to be continued.
                                                                   the playlist used as input for the algorithms is often referred
   Sedhain et al. [15] introduced an autoencoder approach
                                                                   to as “history”. This history can be considered a list of past
for collaborative filtering. They report that their approach
                                                                   listening events.
outperforms current state-of-the-art methods such as matrix
                                                                      To use playlists as an input for autoencoders it is necessary
factorization and neighborhood methods. Zhang et al. [21]
                                                                   to convert the ordered list into a vector representation. We
use an autoencoder as part of a hybrid collaborative fil-
                                                                   propose three different ways to determine the value of each
tering framework able to produce personalized top-n rec-
                                                                   dimension of the generated playlist vector, as presented in
ommendations and rating predictions. For their proposed
                                                                   the following. An example of all three encodings is shown
Semi-AutoEncoder approach, they removed the restriction
                                                                   in Figure 1.
that the input and output layer must be of the same dimen-
                                                                      Binary Encoding is the simplest encoding and it is inspired
sionality and choose to make the input layer wider than the
                                                                   by the (one-hot) vector encoding used in [9]. Each track
output layer. This allows to feed the autoencoder with addi-
                                                                   t in the playlist is converted to the corresponding one-hot
tional feature vectors. In [9], Jannach and Ludewig compare                                  ~ t where all dimensions, except the
                                                                   encoded track vector vt
a recurrent neural network (RNN) approach with a kNN ap-
                                                                   one assigned to the track (index t), are set to 0 while the
proach for the task of session-based recommendations. Their
                                                                   dimension with index t is setPto 1. After that, the playlist
findings show, that the RNN approach is inferior but they                                              ~ i . Note, that the ordering
                                                                   vector p~ is computed by p  ~ = i vt
believe that further research will probably lead to better
                                                                   information of the playlist is lost.
RNN configurations that are able to outperform the kNN
                                                                      Order Encoding is a modified version of the Binary En-
approach. Nevertheless, this shows that kNN is a strong
                                                                   coding and includes ordering information. We propose to
baseline to compare against.
                                                                   use the track’s index i in the playlist as the value of the di-
   Jannach et al. [8] present multiple extensions to the kNN
                                                                   mension t assigned to the track. Therefore, the track vector
approach, which they compare to a kNN baseline. They
                                                                   encoding contains 0 for all dimensions except of the dimen-
propose to take additional measures into account with a
                                                                   sion with index t, to which the value i is assigned. To obtain
weighted sum. By using the social tags that are assigned
                                                                   a playlist vector p~, all track vectors are summed up.
to tracks by Last.fm1 users, they take content similarities
                                                                      Normalized-Order Encoding is an extension to Order En-
into account. Further, they suggest using numerical features
                                                                   coding and takes the length of a playlist into account. The
such as tempo, loudness and release year. Additionally, they
                                                                   playlist vector p
                                                                                   ~ is normalized by the number of tracks con-
state that it is possible to take long-term content-based user
                                                                   tained in the corresponding playlist, which reduces the ef-
profiles into account.
                                                                   fects of the playlist length on the encoding.
   The evaluation approach presented by McFee and Lanck-
riet [12] supports our assumption that playlists and their
tracks contain enough information to find fitting next-tracks
                                                                   3.2     Autoencoders for Playlist Continuation
for a given playlist. They come up with the idea to consider         Autoencoders are an unsupervised learning method used
playlist generation as a natural language modeling problem         for representation learning [19]. In its simplest form, an
instead of an information retrieval problem. Therefore, they       autoencoder is a neural network with one input layer, one
consider a playlist to be equivalent to a sentence in a natural    hidden layer and one output layer, where the input layer is
language and tracks to be equivalent to words. Further, they       fully connected to the hidden code layer which again is fully
show how techniques known from natural language process-           connected to the output layer.
ing can be used to evaluate playlist generation algorithms.          In contrast to a representation learning task, where the
                                                                   decoder is removed after the training phase to get the code
1
    https://www.last.fm/                                           as an output of the network, we use the whole network
                                                                   its batch size and uses Adam as an optimizer because pre-
                                                                   liminary experiments showed that these settings work well
                                                                   and that they show similar performance compared to other
                                                                   parameter choices.
                                                                      This setup allows to compare different parameter config-
                                                                   urations of an autoencoder where the basic structure of the
                                                                   used neural network, which can be seen in the center part
                                                                   of Figure 2, remains the same.

                                                                   4.    EXPERIMENTS
                                                                     In this section, we present the setup used for the con-
                                                                   ducted experiments to evaluate the suitability of recom-
Figure 2: Training workflow for the autoencoders.                  mendations produced by the autoencoder-based approach
                                                                   in comparison to a kNN baseline. In Section 4.1, we intro-
                                                                   duce the used datasets. Thereafter, in Section 4.2 the kNN
                                                                   baseline recommender is introduced. Followed by an expla-
to compute recommendations. Recommendations are com-               nation of the n-gram baseline in Section 4.3, continued by
puted based on the following procedure: Given a playlist           the overall experimental setup in Section 4.4.
that should be continued with a fitting next-track as an in-
put, it is necessary to convert this playlist to a vector. Af-     4.1    Datasets
terwards, this vector encoded playlist is used as an input for
                                                                      To evaluate the recommender systems, we aim for datasets
the trained network that produces an output vector repre-
                                                                   that are based on user interaction such as listening logs
sented by the output layer. The output vector holds rating
                                                                   and playlists because next-track music recommender sys-
values for all tracks (number of dimensions of the vector)
                                                                   tems must satisfy the needs of users. Along the lines of
contained in the dataset, which was used to train the net-
                                                                   previous work [12, 13, 4, 5, 8, 7, 9], we use datasets based
work. This output vector is then converted to a prioritized
                                                                   on the data gathered from the two music platforms Last.fm3
list of tracks by creating a list of indexes (track id’s) or-
                                                                   and Art of the Mix4 .
dered by their corresponding value in the vector. The index
                                                                      Based on the LFM-1b dataset [14] gathered from Last.fm,
with the highest value is the first in the list while the index
                                                                   listening sessions were extracted by Jacob Winder in [20].
with the lowest value is last. Further, all tracks contained
                                                                   These sessions are created by assuming that two listening
in the input playlist are removed from this prioritized list
                                                                   events of a single user belong to the same listening session
to get next-track recommendations differing from the tracks
                                                                   if there are no more than 30 minutes between them. The
contained in the input playlist. Mostly, it is necessary to
                                                                   resulting sessions are further filtered. All successive occur-
compute a given number of possible continuations. This is
                                                                   rences of the same track are merged into one occurrence.
achieved by chopping off the list after the given number of
                                                                   After doing so, all sessions of length one are dropped.
tracks.
                                                                      This results in approximately 62 million sessions which we
   Using Keras2 , we implemented an autoencoder in Python.
                                                                   filtered further. In a first step, a session chunk containing
Our implementation allows to set the number of epochs, the
                                                                   the first 3 million sessions with a minimum length of three
hidden code layer activation function, the output layer acti-
                                                                   was created. This chunk shows a high number of different
vation function and the used loss function. The input layer
                                                                   tracks. Reducing the number of tracks contained in a dataset
size equals the output layer size and is determined auto-
                                                                   is an important step to keep the size of the resulting neural
matically base on the dataset. Further, we decided to auto-
                                                                   network low enough to train it in reasonable time. This
matically adapt the code layer size based on the input size
                                                                   was achieved by dropping all playlists that contain rarely
divided by 40, which is a result from preliminary experi-
                                                                   occurring tracks. We dropped all playlists containing tracks
ments. Keep in mind that dense layers are used to build the
                                                                   with 840 or fewer occurrences.
network, which means that all nodes of one layer are con-
                                                                      Further, we use the AotM-2011 dataset [13] provided as a
nected to each node of the neighboring layers. Our network
                                                                   Python pickle export5 by Vall et al. [17], as the AotM dataset
consists of one input layer followed by a filtering layer that
                                                                   is often used in literature. Instead of using their split, we
removes the last track of the input during training. This
                                                                   merged the training and test set and used five-fold cross-
filtering layer is followed by an optional dropout layer with
                                                                   validation, as we also applied five-fold cross-validation on the
0.5 dropout rate that can be disabled. This layer is then
                                                                   LFM-1b based dataset. The AotM-2011 dataset contains far
followed by a hidden code layer and an output layer.
                                                                   fewer playlists and far more tracks than the LFM-1b based
   To train the neural network, a training set is used, which
                                                                   datasets (see Table 1). This leads to a playlists/track ratio
consists of an input and the expected output, which are
                                                                   of 0.220 which is substantially smaller than the ratio of the
both the same for autoencoders. The autoencoder learning
                                                                   LFM-1b based datasets.
method is depicted in Figure 2. To train an autoencoder-
based on the previously introduced playlist vectors, it first      4.2    Baseline kNN Recommender Systems
is necessary to create an autoencoder with an input/output
                                                                     A k-Nearest Neighbors (kNN) approach [5] is used as on of
layer size fitting the dimensionality of the playlist vectors in
                                                                   the baseline for our experiments. We have chosen kNN as it
the dataset. The training process is further configured to
      1
use 64   of the total number of playlists the training set as      3
                                                                     http://www.last.fm
                                                                   4
                                                                     http://www.artofthemix.org
2                                                                  5
    https://keras.io/                                                https://git.io/fhNfZ
                                                                           using the training procedure. The second step of the evalua-
         Table 1: Detailed dataset description.                            tion process computes recommendations with the previously
         Dataset     Playlists   Tracks     Playlists/Track                trained recommender. Therefore, each playlist in the train-
         LFM-1b         20,824      2,673                7.790             ing set is decomposed into a history (all tracks except the
         AotM            2,715     12,355                0.220             last one) and the last track [6, 10]. The history is used as
                                                                           input for the recommender system, while the last track is
                                                                           the expected recommendation, based on which the metrics
has been used for the playlist continuation task in multiple               are computed. Recommendation tests of different length are
other papers such as [5, 8, 7, 10].                                        considered to get an impression of the effects of the recom-
  The basic idea behind kNN is to find k different items that              mendation test length.
are the nearest neighbors of a given item. Neighborhood                       The autoencoder implementation presented in Section 3.2
of items in general is defined based on a distance-function,               has a high degree of freedom in terms of modifiable parame-
which allows to take different properties into account. Fur-               ters. Therefore, we decided to fix the reduction factor (code
ther, it is necessary to get to a conclusion from those k                  layer size), the batch size and the optimizer. The used val-
neighbors which can, for example, be done with a majority                  ues were obtained from preliminary experiments. Based on
vote [11].                                                                 the knowledge gained from those preliminary experiments
  In [10], a binary cosine similarity is used as the distance-             we determined value ranges for the other parameters used
function. We run a grid search with k’s of 10, 20, 50, 100,                in the grid search.
200 and 300. We further include the three different ranking
functions cosine similarity, item-item similarity and tf-idf               5.    RESULTS
similarity defined by the kNN implementation of the implicit                 In the following, we first report the recommendation suit-
Python package6 (version 0.3.8) in the grid search.                        ability of the different encoding types in Section 5.1. After
                                                                           that, the results achieved by the n-gram baseline are shown
4.3     Baseline N-Gram Recommender System                                 in Section 5.2. Finally, we compare the suitability of recom-
  N-grams are a common statistical model in natural lan-                   mendations produced by our approach on different datasets
guage processing. This technique is for example used in [3]                in Section 5.3.
for word predictions. Instead of using sequences of words
in sentence, we use sequences of tracks in a playlist. The n               5.1   Encoding Type
parameter specifies the number of successive tracks that are                  In a first evaluation step, the recommendation suitabil-
taken into account by the model.                                           ity of the different encoding types introduced in Section 3.1
  Therefore, the simplest model is a unigram model (n = 1)                 was compared using the kNN baseline, based on the AotM
which only counts track occurrences in playlists.                          and LFM-1b datasets. Table 2 shows the results for kNN
  Increasing the number n of the n-gram model makes it                     using the item-item distance metric. We observe that the
possible to consider the previous n − 1 tracks for the pre-                normalized-order encoding outperforms both other encod-
diction by calculating probabilities as given in the following             ings on both datasets and for different values of k. Interest-
equation:                                                                  ingly, order encoding without normalization has a negative
                                                                          effect on the performance of the kNN implementation. This
                                F ti−(n−1) , . . . , ti                   can be explained by the fact that the length of a playlist is
        Pn-gram ti |ti−(n−1) =                              (1)
                               F ti−(n−1) , . . . , ti−1                   encoded as well. In addition, order encoding has a larger
                                                                           vector space than binary and normalized-order encoding as
where ti is the ith track in a sequence of tracks t1 , t2 , . . . , tn .   the values in each dimension have a bigger range. Due to
F (seq) is the frequency of occurrences of a given sequence                space reasons we do not include results for the cosine and tf-
of tracks seq in the training set. Ranking the tracks by their             idf distance metric and other values of k, as these show that
probabilities (highest first) allows to make predictions based             normalized-order encoding works best. Further experiments
on the probabilities learned by an n-gram model.                           showed, that the Autoencoders show similar behavior when
                                                                           the encoding type is changed.
4.4     Experimental Setup                                                    Based on these findings, we argue that the normalized-
  We used scikit-learn7 to run a grid search on the parame-                order encoding should be used among the three encodings
ters of the approaches. To ensure that the reported results                introduced in Section 3.1, which also answers RQ1. There-
are not bound to a specific train-test split of the datasets, we           fore, we use normalized-order encoding for all further eval-
run a five-fold cross-validation. The k-fold splitting proce-              uations.
dure of scikit-learn was used with a fixed random state (seed)
to ensure the reproducibility and comparability of the re-                 5.2   N-Gram Baseline
sults. The metrics used for the evaluation are recall (r) and                To evaluate the suitability of an n-gram model we de-
mean reciprocal rank (mrr).                                                cided to compare a bigram (n = 2) and a trigram (n = 3)
  For the evaluation of all recommender systems, a two-step                model with the best performing kNN (found using a grid
process is used for each of the two presented datasets. In                 search) configuration on each dataset. The kNN baseline
the first step, each recommender system is trained using the               using the item-item as a distance metric with 20 neigh-
training data. For this purpose, a new instance of the rec-                bors (kNNi20) works best on the AotM dataset while the
ommender system is created for each run and then trained                   cosine distance with 50 neighbors (kNNc50) works best on
                                                                           the LFM-1b dataset. In addition, we give the results of a
6
    https://pypi.org/project/implicit/0.3.8/                               unigram (n = 1) model that always recommends the most
7
    https://scikit-learn.org/                                              popular tracks.
Table 2: Impact of encoding types measured as re-                 Table 3: Performance of different recommender sys-
call (r) and mean reciprocal rank (mrr) of kNN.                   tems (RecSys) measured as recall (r) and mean re-
                                    r/mrr   recall   mrr          ciprocal rank (mrr).
      Dataset   Encoding       k
                                     @1     @20      @20                                 r/mrr       recall             mrr
                                                                    Dataset    RecSys
                              20    0.022   0.056    0.029                                @1       @5     @20      @5     @20
                binary
                              200   0.022   0.062    0.031                     unigram   0.001    0.004   0.016   0.002   0.003
                              20    0.025   0.057    0.031                     bigram    0.015    0.027   0.031   0.020   0.020
      AotM      order
                              200   0.025   0.060    0.032                     trigram   0.015    0.015   0.015   0.015   0.015
                              20    0.027   0.062    0.035                     kNNi20    0.027    0.044   0.062   0.033   0.035
                norm.-order
                              200   0.027   0.065    0.035          AotM       kNNc50    0.018    0.024   0.036   0.020   0.022
                              20    0.471   0.886    0.593                     AE1       0.028   0.048    0.076   0.035   0.038
                binary
                              200   0.471   0.864    0.590                     AE2       0.028    0.045   0.073   0.035   0.039
                              20    0.435   0.885    0.556                     AE3       0.027    0.045   0.080   0.034   0.037
      LFM-1b    order
                              200   0.437   0.856    0.556                     AE4       0.027    0.045   0.069   0.034   0.036
                              20    0.618   0.883    0.695                     unigram   0.004    0.014   0.046   0.008   0.011
                norm.-order
                              200   0.619   0.866    0.694                     bigram    0.748   0.841    0.881   0.784   0.788
                                                                               trigram   0.727    0.771   0.775   0.746   0.746
                                                                               kNNi20    0.619    0.786   0.883   0.619   0.695
   Table 3, shows that a bi- and trigram is able to outperform      LFM-1b     kNNc50    0.635    0.799   0.869   0.635   0.708
a unigram model. Further, it can be seen that those n-gram                     AE1       0.391    0.684   0.806   0.500   0.515
models work much better on our LFM-1b dataset variation                        AE2       0.619    0.788   0.848   0.686   0.693
than on the AotM dataset. This can be lead back to the
                                                                               AE3       0.650    0.795   0.849   0.708   0.714
fact that each track on average occurs in 0.22 playlists in the
                                                                               AE4       0.647    0.806   0.855   0.711   0.717
AotM dataset compared to 7.79 occurrences in the LFM-1b
dataset, as stated in Table 1. Therefore, common sequences
of tracks among playlists are more likely in the LFM-1b
                                                                  loss and was trained over 40 epochs. AE3 and AE4 are both
dataset than in the AotM dataset. We argue that this is
                                                                  trained over 40 epochs use tanh as the code layer activation
also the reason why the absolute values of all metrics differ
                                                                  and cosine proximity as loss. While AE3 is configured with a
that much when comparing the results on both datasets.
                                                                  softmax output activation, AE4 is configured to use sigmoid
Compared to both kNN baselines it can be seen that the n-
                                                                  for output activation.
gram models work better on the LFM-1b dataset especially
                                                                     It can be seen in the results that autoencoders outperform
for short recommendation lengths. In contrast, they are less
                                                                  both the kNN and n-gram baselines on the AotM dataset
effective on the AotM dataset than kNN models.
                                                                  while they are not able to outperform n-gram models on the
   It can be seen that n-gram models form a strong baseline
                                                                  LFM-1b dataset. Surprisingly, AE1 works best on AotM
for the LFM-1b dataset. Note that the kNN models operate
                                                                  when trained for 5 epochs. While AE2, AE3 and AE4 reveal
on the normalized-order encoding while the n-gram models
                                                                  similar results per dataset AE1 only produces comparable
utilize the track sequences directly without any encoding,
                                                                  results on the AotM dataset which answers RQ3. One pos-
which answers RQ2.
                                                                  sible explanation is that it overfits on the particular training
5.3    Autoencoder Approach                                       set which is tried to be prevented using a dropout layer.
                                                                     In the above section, the recommendation suitability im-
   In Table 3 we depict the results of our autoencoder ap-
                                                                  pact of multiple configurations of an autoencoder were pre-
proach in comparison to the kNN and n-gram baseline. The
                                                                  sented. Additionally, results of the best performing config-
results show, that it is possible to outperform a kNN base-
                                                                  urations on different datasets are given. The results show,
line on both datasets using the proposed autoencoder ap-
                                                                  that autoencoders can be used for the playlist continuation
proach. This is especially true for longer recommendation
                                                                  tasks when configured correctly.
lengths. To give a better overview of the capabilities of our
autoencoder approach we give the results of four autoen-
coder configurations. To distinguish the different parameter      6.   CONCLUSIONS
configurations of our autoencoder approach we decided to             In this work, we proposed a novel autoencoder approach
name each configuration (AE1–AE4), where we report re-            for the playlist continuation task. To use playlists as an
sults. For each dataset we include results for one autoen-        input for autoencoders, we introduced a procedure to encode
coder including the dropout layer (see Section 3.2) and one       playlists as vectors. The evaluation shows that the proposed
without dropout. AE1 without the dropout layer and AE2            autoencoder approach outperforms a basic kNN approach.
with the dropout layer are respectively the best performing       Particularly, the results show that this is the case regardless
autoencoder configurations for the AotM dataset, while AE3        of the playlists/track ratio of the used dataset.
(without dropout) and AE4 (with dropout) perform best on             This work solely focuses on determining if an autoen-
LFM-1b according to the mrr@1. AE1 uses tanh as the               coder approach can be used for the playlist continuation
code layer and output layer activation function with cosine       task. We showed that outperforming basic kNN is possi-
proximity as the loss function and is trained over 5 epochs.      ble for datasets, that we consider small in comparison to
AE2 utilizes relu as the code layer activation and softmax        the amount of data given in a real-world scenario. One
as the output layer activation with categorical crossentropy      possible source of improvement is the training procedure.
Autoencoders are usually trained to reconstruct the input             Networks Meet the Neighborhood for Session-Based
which we modified slightly. We introduced a filtering layer           Recommendation. In 11th ACM Conf. on Rec. Sys.,
in the training phase that removes the last track of the in-          RecSys, pages 306–310. ACM, 2017.
put. This trains the autoencoder to “reconstruct” playlists      [10] I. Kamehkhosh and D. Jannach. User Perception of
including the last track (next-track) filtered in the input.          Next-Track Music Recommendations. In 25th Conf.
Additionally, it would be possible to specifically designing          on User Modeling, Adaptation and Personalization,
a loss function for the continuation task. Strub et al. [16]          UMAP, pages 113–121. ACM, 2017.
present a loss function that disregards unknown values to        [11] J. M. Keller, M. R. Gray, and J. A. Givens. A fuzzy
train autoencoders as a collaborative filtering method. Ap-           K-nearest neighbor algorithm. IEEE Transactions on
plying a similar loss function to our training procedure is           Sys., Man, and Cybernetics, SMC-15(4):580–585,
part of future work.                                                  1985.
   Additionally, a more advanced training procedure could        [12] B. McFee and G. Lanckriet. THE NATURAL
lead to a well performing deep autoencoder. One way of                LANGUAGE OF PLAYLISTS. In 12th Intl. Society
creating a deep autoencoder would be to first train an au-            for Music Information Retrieval Conf., ISMIR, 2011.
toencoder with one input and one output layer (as the ones       [13] B. McFee and G. R. Lanckriet. Hypergraph Models of
proposed in this work) and then use the learned code as an            Playlist Dialects. In 13th Intl. Society for Music
input to train another autoencoder. After that, it is possible        Information Retrieval Conf., volume 12 of ISMIR,
to split the first autoencoder into the encoder and decoder           pages 343–348, 2012.
parts and insert the second autoencoder in between. The
                                                                 [14] M. Schedl. The lfm-1b dataset for music retrieval and
resulting autoencoder then can be fine-tuned and extended
                                                                      recommendation. In 2016 ACM on Intl. Conf. on
in the same way. This process is like the one proposed by
                                                                      Multimedia Retrieval, ICMR, pages 103–110. ACM,
Vincent et al. [19], where they stack denoising autoencoders.
                                                                      2016.
Using such an advanced training procedures for deep autoen-
coders is part of future work.                                   [15] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie.
   In addition, a user-study to evaluate the approaches should        AutoRec: Autoencoders Meet Collaborative Filtering.
be conducted in future work. This is important to get an              In 24th Intl. Conf. on World Wide Web, WWW,
impression of the user-perceived quality of the approaches.           pages 111–112. ACM, 2015.
                                                                 [16] F. Strub, R. Gaudel, and J. Mary. Hybrid
                                                                      recommender system based on autoencoders. In 1st
7.   REFERENCES                                                       Workshop on Deep Learning for Rec. Sys., DLRS,
 [1] T. Bansal, D. Belanger, and A. McCallum. Ask the                 pages 11–16. ACM, 2016.
     GRU: Multi-task Learning for Deep Text                      [17] A. Vall, H. Eghbal-zadeh, M. Dorfer, M. Schedl, and
     Recommendations. In 10th ACM Conf. on Rec. Sys.,                 G. Widmer. Music Playlist Continuation by Learning
     RecSys, pages 107–114, 2016.                                     from Hand-Curated Examples and Song Features:
 [2] Y. Bengio, A. Courville, and P. Vincent.                         Alleviating the Cold-Start Problem for Rare and
     Representation Learning: A Review and New                        Out-of-Set Songs. In 2Nd Workshop on Deep Learning
     Perspectives. IEEE Transactions on Pattern Analysis              for Rec. Sys., DLRS, pages 46–54. ACM, 2017.
     and Machine Intelligence, 35(8):1798–1828, 2013.            [18] A. van den Oord, S. Dieleman, and B. Schrauwen.
 [3] S. Bickel, P. Haider, and T. Scheffer. Predicting                Deep content-based music recommendation. In
     Sentences using N-Gram Language Models. In                       C. J. C. Burges, L. Bottou, M. Welling,
     Empirical Methods in NLP, 2005.                                  Z. Ghahramani, and K. Q. Weinberger, editors,
 [4] G. Bonnin and D. Jannach. Evaluating the quality of              Advances in Neural Information Processing Sys. 26,
     playlists based on hand-crafted samples. In 14th Conf.           NIPS, pages 2643–2651. Curran Associates, Inc., 2013.
     of the Intl. Society for Music Information Retrieval,       [19] P. Vincent, H. Larochelle, Y. Bengio, and P.-A.
     ISMIR, pages 263–268, 2013.                                      Manzagol. Extracting and Composing Robust
 [5] G. Bonnin and D. Jannach. Automated Generation of                Features with Denoising Autoencoders. In 25th Intl.
     Music Playlists: Survey and Experiments. ACM                     Conf. on Machine Learning, ICML, pages 1096–1103.
     Comput. Surv., 47(2):26:1–26:35, 2014.                           ACM, 2008.
 [6] S. Craw, B. Horsburgh, and S. Massie. Music                 [20] J. Winder. Session-Based Track Embedding for
     Recommenders: User Evaluation Without Real Users?                Context-Aware Music Recommendation. Master’s
     In 24th Intl. Joint Conf. on Artificial Intelligence,            thesis, University of Innsbruck, 2018.
     IJCAI. AAAI, 2015.                                          [21] S. Zhang, L. Yao, X. Xu, S. Wang, and L. Zhu. Hybrid
 [7] D. Jannach, I. Kamehkhosh, and G. Bonnin. Biases in              Collaborative Recommendation via
     Automated Music Playlist Generation: A Comparison                Semi-AutoEncoder. In Intl. Conf. on Neural
     of Next-Track Recommending Techniques. In 24th                   Information Processing, ICONIP, pages 185–193.
     Conf. on User Modeling, Adaptation and                           Springer, 2017.
     Personalization, UMAP, pages 281–285. ACM, 2016.
 [8] D. Jannach, L. Lerche, and I. Kamehkhosh. Beyond
     ”Hitting the Hits”: Generating Coherent Music
     Playlist Continuations with the Right Tracks. In 9th
     ACM Conf. on Rec. Sys., RecSys, pages 187–194.
     ACM, 2015.
 [9] D. Jannach and M. Ludewig. When Recurrent Neural