<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Autoencoders for Next-Track-Recommendation</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Michael Vötter, Eva Zangerle, Maximilian Mayerl, Günther Specht Databases and Information Systems Department of Computer Science University of Innsbruck</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>In music recommender systems, playlist continuation is the task of continuing a user's playlist with a tting next track, often also referred to as next-track or sequential recommendation. This work investigates the suitability and applicability of autoencoders for the task of playlist continuation. We utilize autoencoders and hence, representation learning to continue playlists. Our approach is inspired by the usage of autoencoders to denoise images and we consider the playlist without the missing next-track as a noisy input. Particularly, we design di erent autoencoders for this speci c task and investigate the e ects of di erent designs on the overall suitability of recommendations produced by the resulting recommender systems. To evaluate the suitability of recommendations produced by the proposed approach, we utilize the AotM-2011 and LFM-1b datasets. Based on those datasets, we show that n-grams are a well performing alternative baseline to kNN. Fruther, we show that it is possible to outperform a kNN as well as an n-gram baseline with our autoencoder approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Recommender systems are applicable to a broad spectrum
of domains. The music domain is one such application area,
where one speci c task is next-track music recommendation.
Next-track music recommender systems are recommender
systems that aim to nd a tting continuation (next-track)
for a given playlist. In general, a playlist is an ordered list of
music tracks, where the order is based on time, which means
that the rst track in the list is expected to be listened rst,
followed by the second track and so on. In other words, a
playlist is a time series of tracks.</p>
      <p>
        Multiple di erent approaches have been proposed for the
playlist continuation task. As mentioned in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], these
approaches are based on a broad spectrum of techniques such
as Markov models, collaborative ltering, content similarity
as well as hybrids of them. Another traditional approach to
compute next-track recommendations is a nearest neighbor
search as used in multiple other papers [
        <xref ref-type="bibr" rid="ref10 ref5 ref8">5, 8, 10</xref>
        ].
      </p>
      <p>
        In the eld of music recommendation, deep learning
approaches are usually used to include additional features such
as textual information [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or content features [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] in the
recommendation process. In contrast, only a few approaches
such as [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] directly apply neural network approaches to the
playlist continuation task. To ll this gap, we propose a
novel autoencoder-based approach that directly applies
neural network-based representation learning on the playlist
continuation task. The simplest form of an autoencoder
is a neural network with a dense input layers and a dense
output layer which is trained in an unsupervised manner.
Our approach is inspired by the successful application of
autoencoders for image denoising [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. We consider the input
playlist (that has to be continued) as a noisy version of the
resulting playlist, where a next-track is added as a
continuation. We argue that representation learning methods are
more suitable to take advantage of the features contained in
the playlist structure than e.g., kNN because representation
learning methods are speci cally designed to learn an e
ective representation and hence features [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To the best of
our knowledge, this is the rst time that autoencoders are
utilized for the playlist continuation task, while the closest
related tasks, where autoencoders were used successfully, are
collaborative ltering tasks [
        <xref ref-type="bibr" rid="ref15 ref16 ref21">15, 16, 21</xref>
        ].
      </p>
      <p>With this work, we investigate the general applicability
of autoencoders for the playlist continuation task. We
report the e ects of di erent parameter settings on the overall
suitability of recommendations produced by the system and
answer the following research questions:</p>
      <p>RQ1: How can playlists be vectorized for an
autoencoder?
RQ2: Is there an alternative baseline to kNN that
better utilizes the order of tracks in a playlist?
RQ3: Which autoencoder design produces
competitive results for the next-track music recommendation
task?</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Kamehkhosh and Jannach propose to use
handcrafted playlists to evaluate next-track music recommender
systems. Inspired by that, we use the AotM-2011 dataset
as well as the LFM-1b dataset to evaluate the suitability of
recommendations produced by our approach. Our
experiments show, that the resulting autoencoder approach
produces competitive recommendations compared to the kNN
baseline.
      </p>
      <p>The remaining sections of this paper are organized as
follows. First, related work will be discussed in Section 2.
Afterwards, in Section 3, we describe the algorithm to convert a
playlist into a corresponding vector and present our
autoencoder approach. Following that, the experimental setup to
evaluate our approach by comparing it with a kNN baseline
is described in Section 4. Thereafter, we present the results
in Section 5 and nally draw a conclusion in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>This section gives an overview of next-track music
recommendation approaches.</p>
      <p>
        We consider the playlist continuation task to be a special
case of the more general task of playlist generation.
According to Bonnin and Jannach [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the preconditions for
a playlist generation task are a background knowledge base
and target characteristics for the resulting playlist. Based on
that, a sequence of tracks (playlist) best tting the
characteristics needs to be found. The playlist generation problem
may be converted to the playlist continuation problem by
considering all playlists/sessions as the background database
and using a target characteristic that describes a tting track
given a playlist to be continued.
      </p>
      <p>
        Sedhain et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] introduced an autoencoder approach
for collaborative ltering. They report that their approach
outperforms current state-of-the-art methods such as matrix
factorization and neighborhood methods. Zhang et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
use an autoencoder as part of a hybrid collaborative
ltering framework able to produce personalized top-n
recommendations and rating predictions. For their proposed
Semi-AutoEncoder approach, they removed the restriction
that the input and output layer must be of the same
dimensionality and choose to make the input layer wider than the
output layer. This allows to feed the autoencoder with
additional feature vectors. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Jannach and Ludewig compare
a recurrent neural network (RNN) approach with a kNN
approach for the task of session-based recommendations. Their
ndings show, that the RNN approach is inferior but they
believe that further research will probably lead to better
RNN con gurations that are able to outperform the kNN
approach. Nevertheless, this shows that kNN is a strong
baseline to compare against.
      </p>
      <p>
        Jannach et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] present multiple extensions to the kNN
approach, which they compare to a kNN baseline. They
propose to take additional measures into account with a
weighted sum. By using the social tags that are assigned
to tracks by Last.fm1 users, they take content similarities
into account. Further, they suggest using numerical features
such as tempo, loudness and release year. Additionally, they
state that it is possible to take long-term content-based user
pro les into account.
      </p>
      <p>
        The evaluation approach presented by McFee and
Lanckriet [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] supports our assumption that playlists and their
tracks contain enough information to nd tting next-tracks
for a given playlist. They come up with the idea to consider
playlist generation as a natural language modeling problem
instead of an information retrieval problem. Therefore, they
consider a playlist to be equivalent to a sentence in a natural
language and tracks to be equivalent to words. Further, they
show how techniques known from natural language
processing can be used to evaluate playlist generation algorithms.
1https://www.last.fm/
[5; 1; 3; 7; 4] !
8&gt; (1; 0; 1; 1; 1; 0; 1)
&lt;
order
normalized-order
      </p>
      <p>In this section, the proposed recommendation approach
based on an autoencoder is presented. First, we will explain
how playlists are converted into a vector, which is necessary
to use them as an input for the autoencoder. Afterwards, the
structure and implementation details of the autoencoder are
presented in Section 3.2. This includes the general training
procedure used and a modi ed autoencoder layout to
overcome over tting by simulating the continuation task during
training.</p>
      <p>Playlists are usually represented as an ordered lists of
tracks. In the special case of the playlist continuation task,
the playlist used as input for the algorithms is often referred
to as \history". This history can be considered a list of past
listening events.</p>
      <p>To use playlists as an input for autoencoders it is necessary
to convert the ordered list into a vector representation. We
propose three di erent ways to determine the value of each
dimension of the generated playlist vector, as presented in
the following. An example of all three encodings is shown
in Figure 1.</p>
      <p>
        Binary Encoding is the simplest encoding and it is inspired
by the (one-hot) vector encoding used in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Each track
t in the playlist is converted to the corresponding one-hot
encoded track vector v~tt where all dimensions, except the
one assigned to the track (index t), are set to 0 while the
dimension with index t is set to 1. After that, the playlist
vector p~ is computed by p~ = Pi v~ti. Note, that the ordering
information of the playlist is lost.
      </p>
      <p>Order Encoding is a modi ed version of the Binary
Encoding and includes ordering information. We propose to
use the track's index i in the playlist as the value of the
dimension t assigned to the track. Therefore, the track vector
encoding contains 0 for all dimensions except of the
dimension with index t, to which the value i is assigned. To obtain
a playlist vector p~, all track vectors are summed up.</p>
      <p>
        Normalized-Order Encoding is an extension to Order
Encoding and takes the length of a playlist into account. The
playlist vector p~ is normalized by the number of tracks
contained in the corresponding playlist, which reduces the
effects of the playlist length on the encoding.
Autoencoders are an unsupervised learning method used
for representation learning [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. In its simplest form, an
autoencoder is a neural network with one input layer, one
hidden layer and one output layer, where the input layer is
fully connected to the hidden code layer which again is fully
connected to the output layer.
      </p>
      <p>In contrast to a representation learning task, where the
decoder is removed after the training phase to get the code
as an output of the network, we use the whole network
to compute recommendations. Recommendations are
computed based on the following procedure: Given a playlist
that should be continued with a tting next-track as an
input, it is necessary to convert this playlist to a vector.
Afterwards, this vector encoded playlist is used as an input for
the trained network that produces an output vector
represented by the output layer. The output vector holds rating
values for all tracks (number of dimensions of the vector)
contained in the dataset, which was used to train the
network. This output vector is then converted to a prioritized
list of tracks by creating a list of indexes (track id's)
ordered by their corresponding value in the vector. The index
with the highest value is the rst in the list while the index
with the lowest value is last. Further, all tracks contained
in the input playlist are removed from this prioritized list
to get next-track recommendations di ering from the tracks
contained in the input playlist. Mostly, it is necessary to
compute a given number of possible continuations. This is
achieved by chopping o the list after the given number of
tracks.</p>
      <p>Using Keras2, we implemented an autoencoder in Python.
Our implementation allows to set the number of epochs, the
hidden code layer activation function, the output layer
activation function and the used loss function. The input layer
size equals the output layer size and is determined
automatically base on the dataset. Further, we decided to
automatically adapt the code layer size based on the input size
divided by 40, which is a result from preliminary
experiments. Keep in mind that dense layers are used to build the
network, which means that all nodes of one layer are
connected to each node of the neighboring layers. Our network
consists of one input layer followed by a ltering layer that
removes the last track of the input during training. This
ltering layer is followed by an optional dropout layer with
0.5 dropout rate that can be disabled. This layer is then
followed by a hidden code layer and an output layer.</p>
      <p>To train the neural network, a training set is used, which
consists of an input and the expected output, which are
both the same for autoencoders. The autoencoder learning
method is depicted in Figure 2. To train an
autoencoderbased on the previously introduced playlist vectors, it rst
is necessary to create an autoencoder with an input/output
layer size tting the dimensionality of the playlist vectors in
the dataset. The training process is further con gured to
use 614 of the total number of playlists the training set as</p>
      <sec id="sec-2-1">
        <title>2https://keras.io/</title>
        <p>its batch size and uses Adam as an optimizer because
preliminary experiments showed that these settings work well
and that they show similar performance compared to other
parameter choices.</p>
        <p>This setup allows to compare di erent parameter con
gurations of an autoencoder where the basic structure of the
used neural network, which can be seen in the center part
of Figure 2, remains the same.
4.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS</title>
      <p>In this section, we present the setup used for the
conducted experiments to evaluate the suitability of
recommendations produced by the autoencoder-based approach
in comparison to a kNN baseline. In Section 4.1, we
introduce the used datasets. Thereafter, in Section 4.2 the kNN
baseline recommender is introduced. Followed by an
explanation of the n-gram baseline in Section 4.3, continued by
the overall experimental setup in Section 4.4.
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Datasets</title>
      <p>
        To evaluate the recommender systems, we aim for datasets
that are based on user interaction such as listening logs
and playlists because next-track music recommender
systems must satisfy the needs of users. Along the lines of
previous work [
        <xref ref-type="bibr" rid="ref12 ref13 ref4 ref5 ref7 ref8 ref9">12, 13, 4, 5, 8, 7, 9</xref>
        ], we use datasets based
on the data gathered from the two music platforms Last.fm3
and Art of the Mix4.
      </p>
      <p>
        Based on the LFM-1b dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] gathered from Last.fm,
listening sessions were extracted by Jacob Winder in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
These sessions are created by assuming that two listening
events of a single user belong to the same listening session
if there are no more than 30 minutes between them. The
resulting sessions are further ltered. All successive
occurrences of the same track are merged into one occurrence.
After doing so, all sessions of length one are dropped.
      </p>
      <p>This results in approximately 62 million sessions which we
ltered further. In a rst step, a session chunk containing
the rst 3 million sessions with a minimum length of three
was created. This chunk shows a high number of di erent
tracks. Reducing the number of tracks contained in a dataset
is an important step to keep the size of the resulting neural
network low enough to train it in reasonable time. This
was achieved by dropping all playlists that contain rarely
occurring tracks. We dropped all playlists containing tracks
with 840 or fewer occurrences.</p>
      <p>
        Further, we use the AotM-2011 dataset [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] provided as a
Python pickle export5 by Vall et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], as the AotM dataset
is often used in literature. Instead of using their split, we
merged the training and test set and used ve-fold
crossvalidation, as we also applied ve-fold cross-validation on the
LFM-1b based dataset. The AotM-2011 dataset contains far
fewer playlists and far more tracks than the LFM-1b based
datasets (see Table 1). This leads to a playlists/track ratio
of 0.220 which is substantially smaller than the ratio of the
LFM-1b based datasets.
4.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Baseline kNN Recommender Systems</title>
      <p>
        A k-Nearest Neighbors (kNN) approach [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is used as on of
the baseline for our experiments. We have chosen kNN as it
      </p>
      <sec id="sec-5-1">
        <title>3http://www.last.fm 4http://www.artofthemix.org 5https://git.io/fhNfZ</title>
        <p>
          has been used for the playlist continuation task in multiple
other papers such as [
          <xref ref-type="bibr" rid="ref10 ref5 ref7 ref8">5, 8, 7, 10</xref>
          ].
        </p>
        <p>
          The basic idea behind kNN is to nd k di erent items that
are the nearest neighbors of a given item. Neighborhood
of items in general is de ned based on a distance-function,
which allows to take di erent properties into account.
Further, it is necessary to get to a conclusion from those k
neighbors which can, for example, be done with a majority
vote [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], a binary cosine similarity is used as the
distancefunction. We run a grid search with k's of 10, 20, 50, 100,
200 and 300. We further include the three di erent ranking
functions cosine similarity, item-item similarity and tf-idf
similarity de ned by the kNN implementation of the implicit
Python package6 (version 0.3.8) in the grid search.
4.3
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Baseline N-Gram Recommender System</title>
      <p>
        N-grams are a common statistical model in natural
language processing. This technique is for example used in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
for word predictions. Instead of using sequences of words
in sentence, we use sequences of tracks in a playlist. The n
parameter speci es the number of successive tracks that are
taken into account by the model.
      </p>
      <p>Therefore, the simplest model is a unigram model (n = 1)
which only counts track occurrences in playlists.</p>
      <p>Increasing the number n of the n-gram model makes it
possible to consider the previous n 1 tracks for the
prediction by calculating probabilities as given in the following
equation:</p>
      <p>Pn-gram tijti (n 1) =</p>
      <p>F ti (n 1); : : : ; ti
F ti (n 1); : : : ; ti 1
(1)
where ti is the ith track in a sequence of tracks t1; t2; : : : ; tn.
F (seq) is the frequency of occurrences of a given sequence
of tracks seq in the training set. Ranking the tracks by their
probabilities (highest rst) allows to make predictions based
on the probabilities learned by an n-gram model.
4.4</p>
    </sec>
    <sec id="sec-7">
      <title>Experimental Setup</title>
      <p>We used scikit-learn7 to run a grid search on the
parameters of the approaches. To ensure that the reported results
are not bound to a speci c train-test split of the datasets, we
run a ve-fold cross-validation. The k-fold splitting
procedure of scikit-learn was used with a xed random state (seed)
to ensure the reproducibility and comparability of the
results. The metrics used for the evaluation are recall (r) and
mean reciprocal rank (mrr).</p>
      <p>
        For the evaluation of all recommender systems, a two-step
process is used for each of the two presented datasets. In
the rst step, each recommender system is trained using the
training data. For this purpose, a new instance of the
recommender system is created for each run and then trained
6https://pypi.org/project/implicit/0.3.8/
7https://scikit-learn.org/
using the training procedure. The second step of the
evaluation process computes recommendations with the previously
trained recommender. Therefore, each playlist in the
training set is decomposed into a history (all tracks except the
last one) and the last track [
        <xref ref-type="bibr" rid="ref10 ref6">6, 10</xref>
        ]. The history is used as
input for the recommender system, while the last track is
the expected recommendation, based on which the metrics
are computed. Recommendation tests of di erent length are
considered to get an impression of the e ects of the
recommendation test length.
      </p>
      <p>The autoencoder implementation presented in Section 3.2
has a high degree of freedom in terms of modi able
parameters. Therefore, we decided to x the reduction factor (code
layer size), the batch size and the optimizer. The used
values were obtained from preliminary experiments. Based on
the knowledge gained from those preliminary experiments
we determined value ranges for the other parameters used
in the grid search.
5.</p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS</title>
      <p>In the following, we rst report the recommendation
suitability of the di erent encoding types in Section 5.1. After
that, the results achieved by the n-gram baseline are shown
in Section 5.2. Finally, we compare the suitability of
recommendations produced by our approach on di erent datasets
in Section 5.3.
5.1</p>
    </sec>
    <sec id="sec-9">
      <title>Encoding Type</title>
      <p>In a rst evaluation step, the recommendation
suitability of the di erent encoding types introduced in Section 3.1
was compared using the kNN baseline, based on the AotM
and LFM-1b datasets. Table 2 shows the results for kNN
using the item-item distance metric. We observe that the
normalized-order encoding outperforms both other
encodings on both datasets and for di erent values of k.
Interestingly, order encoding without normalization has a negative
e ect on the performance of the kNN implementation. This
can be explained by the fact that the length of a playlist is
encoded as well. In addition, order encoding has a larger
vector space than binary and normalized-order encoding as
the values in each dimension have a bigger range. Due to
space reasons we do not include results for the cosine and
tfidf distance metric and other values of k, as these show that
normalized-order encoding works best. Further experiments
showed, that the Autoencoders show similar behavior when
the encoding type is changed.</p>
      <p>Based on these ndings, we argue that the
normalizedorder encoding should be used among the three encodings
introduced in Section 3.1, which also answers RQ1.
Therefore, we use normalized-order encoding for all further
evaluations.
5.2</p>
    </sec>
    <sec id="sec-10">
      <title>N-Gram Baseline</title>
      <p>To evaluate the suitability of an n-gram model we
decided to compare a bigram (n = 2) and a trigram (n = 3)
model with the best performing kNN (found using a grid
search) con guration on each dataset. The kNN baseline
using the item-item as a distance metric with 20
neighbors (kNNi20) works best on the AotM dataset while the
cosine distance with 50 neighbors (kNNc50) works best on
the LFM-1b dataset. In addition, we give the results of a
unigram (n = 1) model that always recommends the most
popular tracks.</p>
      <p>Table 3, shows that a bi- and trigram is able to outperform
a unigram model. Further, it can be seen that those n-gram
models work much better on our LFM-1b dataset variation
than on the AotM dataset. This can be lead back to the
fact that each track on average occurs in 0:22 playlists in the
AotM dataset compared to 7:79 occurrences in the LFM-1b
dataset, as stated in Table 1. Therefore, common sequences
of tracks among playlists are more likely in the LFM-1b
dataset than in the AotM dataset. We argue that this is
also the reason why the absolute values of all metrics di er
that much when comparing the results on both datasets.
Compared to both kNN baselines it can be seen that the
ngram models work better on the LFM-1b dataset especially
for short recommendation lengths. In contrast, they are less
e ective on the AotM dataset than kNN models.</p>
      <p>It can be seen that n-gram models form a strong baseline
for the LFM-1b dataset. Note that the kNN models operate
on the normalized-order encoding while the n-gram models
utilize the track sequences directly without any encoding,
which answers RQ2.
5.3</p>
    </sec>
    <sec id="sec-11">
      <title>Autoencoder Approach</title>
      <p>In Table 3 we depict the results of our autoencoder
approach in comparison to the kNN and n-gram baseline. The
results show, that it is possible to outperform a kNN
baseline on both datasets using the proposed autoencoder
approach. This is especially true for longer recommendation
lengths. To give a better overview of the capabilities of our
autoencoder approach we give the results of four
autoencoder con gurations. To distinguish the di erent parameter
con gurations of our autoencoder approach we decided to
name each con guration (AE1{AE4), where we report
results. For each dataset we include results for one
autoencoder including the dropout layer (see Section 3.2) and one
without dropout. AE1 without the dropout layer and AE2
with the dropout layer are respectively the best performing
autoencoder con gurations for the AotM dataset, while AE3
(without dropout) and AE4 (with dropout) perform best on
LFM-1b according to the mrr@1. AE1 uses tanh as the
code layer and output layer activation function with cosine
proximity as the loss function and is trained over 5 epochs.
AE2 utilizes relu as the code layer activation and softmax
as the output layer activation with categorical crossentropy
loss and was trained over 40 epochs. AE3 and AE4 are both
trained over 40 epochs use tanh as the code layer activation
and cosine proximity as loss. While AE3 is con gured with a
softmax output activation, AE4 is con gured to use sigmoid
for output activation.</p>
      <p>It can be seen in the results that autoencoders outperform
both the kNN and n-gram baselines on the AotM dataset
while they are not able to outperform n-gram models on the
LFM-1b dataset. Surprisingly, AE1 works best on AotM
when trained for 5 epochs. While AE2, AE3 and AE4 reveal
similar results per dataset AE1 only produces comparable
results on the AotM dataset which answers RQ3. One
possible explanation is that it over ts on the particular training
set which is tried to be prevented using a dropout layer.</p>
      <p>In the above section, the recommendation suitability
impact of multiple con gurations of an autoencoder were
presented. Additionally, results of the best performing con
gurations on di erent datasets are given. The results show,
that autoencoders can be used for the playlist continuation
tasks when con gured correctly.
6.</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSIONS</title>
      <p>In this work, we proposed a novel autoencoder approach
for the playlist continuation task. To use playlists as an
input for autoencoders, we introduced a procedure to encode
playlists as vectors. The evaluation shows that the proposed
autoencoder approach outperforms a basic kNN approach.
Particularly, the results show that this is the case regardless
of the playlists/track ratio of the used dataset.</p>
      <p>
        This work solely focuses on determining if an
autoencoder approach can be used for the playlist continuation
task. We showed that outperforming basic kNN is
possible for datasets, that we consider small in comparison to
the amount of data given in a real-world scenario. One
possible source of improvement is the training procedure.
Autoencoders are usually trained to reconstruct the input
which we modi ed slightly. We introduced a ltering layer
in the training phase that removes the last track of the
input. This trains the autoencoder to \reconstruct" playlists
including the last track (next-track) ltered in the input.
Additionally, it would be possible to speci cally designing
a loss function for the continuation task. Strub et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
present a loss function that disregards unknown values to
train autoencoders as a collaborative ltering method.
Applying a similar loss function to our training procedure is
part of future work.
      </p>
      <p>
        Additionally, a more advanced training procedure could
lead to a well performing deep autoencoder. One way of
creating a deep autoencoder would be to rst train an
autoencoder with one input and one output layer (as the ones
proposed in this work) and then use the learned code as an
input to train another autoencoder. After that, it is possible
to split the rst autoencoder into the encoder and decoder
parts and insert the second autoencoder in between. The
resulting autoencoder then can be ne-tuned and extended
in the same way. This process is like the one proposed by
Vincent et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], where they stack denoising autoencoders.
Using such an advanced training procedures for deep
autoencoders is part of future work.
      </p>
      <p>In addition, a user-study to evaluate the approaches should
be conducted in future work. This is important to get an
impression of the user-perceived quality of the approaches.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belanger</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          .
          <article-title>Ask the GRU: Multi-task Learning for Deep Text Recommendations</article-title>
          .
          <source>In 10th ACM Conf. on Rec. Sys</source>
          .,
          <source>RecSys</source>
          , pages
          <volume>107</volume>
          {
          <fpage>114</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          .
          <article-title>Representation Learning: A Review and New Perspectives</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>35</volume>
          (
          <issue>8</issue>
          ):1798{
          <year>1828</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Haider</surname>
          </string-name>
          , and
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>Sche er. Predicting Sentences using N-Gram Language Models</article-title>
          . In Empirical Methods in
          <string-name>
            <surname>NLP</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bonnin</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          .
          <article-title>Evaluating the quality of playlists based on hand-crafted samples</article-title>
          .
          <source>In 14th Conf. of the Intl. Society for Music Information Retrieval, ISMIR</source>
          , pages
          <volume>263</volume>
          {
          <fpage>268</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bonnin</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          .
          <source>Automated Generation of Music Playlists: Survey and Experiments. ACM Comput. Surv.</source>
          ,
          <volume>47</volume>
          (
          <issue>2</issue>
          ):
          <volume>26</volume>
          :1{
          <fpage>26</fpage>
          :
          <fpage>35</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Craw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Horsburgh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Massie</surname>
          </string-name>
          .
          <article-title>Music Recommenders: User Evaluation Without Real Users? In 24th Intl</article-title>
          .
          <source>Joint Conf. on Arti cial Intelligence</source>
          ,
          <source>IJCAI. AAAI</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kamehkhosh</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Bonnin.</surname>
          </string-name>
          <article-title>Biases in Automated Music Playlist Generation: A Comparison of Next-Track Recommending Techniques</article-title>
          .
          <source>In 24th Conf. on User Modeling, Adaptation and Personalization</source>
          , UMAP, pages
          <volume>281</volume>
          {
          <fpage>285</fpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lerche</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Kamehkhosh.</surname>
          </string-name>
          <article-title>Beyond "Hitting the Hits": Generating Coherent Music Playlist Continuations with the Right Tracks</article-title>
          .
          <source>In 9th ACM Conf. on Rec. Sys</source>
          .,
          <source>RecSys</source>
          , pages
          <volume>187</volume>
          {
          <fpage>194</fpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          and
          <string-name>
            <surname>M. Ludewig.</surname>
          </string-name>
          <article-title>When Recurrent Neural Networks Meet the Neighborhood for Session-Based Recommendation</article-title>
          .
          <source>In 11th ACM Conf. on Rec. Sys</source>
          .,
          <source>RecSys</source>
          , pages
          <volume>306</volume>
          {
          <fpage>310</fpage>
          . ACM,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kamehkhosh</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          .
          <article-title>User Perception of Next-Track Music Recommendations</article-title>
          .
          <source>In 25th Conf. on User Modeling, Adaptation and Personalization</source>
          , UMAP, pages
          <volume>113</volume>
          {
          <fpage>121</fpage>
          . ACM,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>J. M. Keller</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            , and
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Givens</surname>
          </string-name>
          .
          <article-title>A fuzzy K-nearest neighbor algorithm</article-title>
          .
          <source>IEEE Transactions on Sys., Man, and Cybernetics</source>
          , SMC-
          <volume>15</volume>
          (
          <issue>4</issue>
          ):
          <volume>580</volume>
          {
          <fpage>585</fpage>
          ,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>McFee</surname>
          </string-name>
          and
          <string-name>
            <surname>G. Lanckriet.</surname>
          </string-name>
          <article-title>THE NATURAL LANGUAGE OF PLAYLISTS</article-title>
          .
          <source>In 12th Intl. Society for Music Information Retrieval Conf., ISMIR</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>McFee</surname>
          </string-name>
          and
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Lanckriet</surname>
          </string-name>
          .
          <article-title>Hypergraph Models of Playlist Dialects</article-title>
          .
          <source>In 13th Intl. Society for Music Information Retrieval Conf</source>
          ., volume
          <volume>12</volume>
          <source>of ISMIR</source>
          , pages
          <volume>343</volume>
          {
          <fpage>348</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          .
          <article-title>The lfm-1b dataset for music retrieval and recommendation</article-title>
          .
          <source>In 2016 ACM on Intl. Conf. on Multimedia Retrieval, ICMR</source>
          , pages
          <volume>103</volume>
          {
          <fpage>110</fpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sedhain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanner</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. Xie.</surname>
          </string-name>
          <article-title>AutoRec: Autoencoders Meet Collaborative Filtering</article-title>
          .
          <source>In 24th Intl. Conf. on World Wide Web, WWW</source>
          , pages
          <volume>111</volume>
          {
          <fpage>112</fpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>F.</given-names>
            <surname>Strub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gaudel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Mary</surname>
          </string-name>
          .
          <article-title>Hybrid recommender system based on autoencoders</article-title>
          .
          <source>In 1st Workshop on Deep Learning for Rec. Sys., DLRS</source>
          , pages
          <volume>11</volume>
          {
          <fpage>16</fpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vall</surname>
          </string-name>
          , H. Eghbal-zadeh, M. Dorfer,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Widmer. Music Playlist</surname>
          </string-name>
          <article-title>Continuation by Learning from Hand-Curated Examples and Song Features: Alleviating the Cold-Start Problem for Rare and Out-of-Set Songs</article-title>
          .
          <source>In 2Nd Workshop on Deep Learning for Rec. Sys., DLRS</source>
          , pages
          <volume>46</volume>
          {
          <fpage>54</fpage>
          . ACM,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>A. van den Oord</surname>
            , S. Dieleman, and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Schrauwen</surname>
          </string-name>
          .
          <article-title>Deep content-based music recommendation</article-title>
          . In C. J.
          <string-name>
            <surname>C. Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Q. Weinberger, editors,
          <source>Advances in Neural Information Processing Sys</source>
          .
          <volume>26</volume>
          , NIPS, pages
          <volume>2643</volume>
          {
          <fpage>2651</fpage>
          . Curran Associates, Inc.,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.-A.</given-names>
            <surname>Manzagol</surname>
          </string-name>
          .
          <article-title>Extracting and Composing Robust Features with Denoising Autoencoders</article-title>
          .
          <source>In 25th Intl. Conf. on Machine Learning</source>
          , ICML, pages
          <volume>1096</volume>
          {
          <fpage>1103</fpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Winder</surname>
          </string-name>
          .
          <article-title>Session-Based Track Embedding for Context-Aware Music Recommendation</article-title>
          .
          <source>Master's thesis</source>
          , University of Innsbruck,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Hybrid Collaborative Recommendation via Semi-AutoEncoder</article-title>
          .
          <source>In Intl. Conf. on Neural Information Processing, ICONIP</source>
          , pages
          <volume>185</volume>
          {
          <fpage>193</fpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>