<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Supporting Serendipitous Recommendations With Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oliver Baumann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mirco Schoenfeld</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bayreuth</institution>
          ,
          <addr-line>Universitätsstraße 30, 95447 Bayreuth</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recommender systems are commonly designed and evaluated with high precision and accuracy in mind. Optimising systems for these metrics alone can, however, lead to a decrease in overall collection coverage of recommended items, and over-emphasize popular content. In order to present useful suggestions to users, it has been argued that a recommender system should also provide novel and diverse items diferent to what the user has experienced in the past. This closely ties in with the notion of serendipity, i.e., making a surprise discovery akin to a “happy accident”, that is nevertheless interesting and relevant. We implement a recommender system based on a knowledge graph of musical items with serendipity, novelty and diversity in mind. Using acoustic features as contextual information for vertices in the graph, we explicitly select content dissimilar from the user's previous experience. We compare our results to a set of baseline algorithms and find that the investigated approach is able to recommend diverse and novel items.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;knowledge graphs</kwd>
        <kwd>recommender systems</kwd>
        <kwd>serendipity</kwd>
        <kwd>novelty</kwd>
        <kwd>diversity</kwd>
        <kwd>attributed graphs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Finding relevant content in potentially vast repositories of information is a key challenge for
users. Whether it’s finding interesting products in an online shop, valuable social contacts
in an Online Social Network, or items in a digital library, the domain is often so large in size
and broad in variety that gaining a complete overview is at best hard, at worst impossible.
Recommendation systems can aid in the process of discovery by suggesting items the user has
not been exposed to previously, but that may be of interest to them. A common assessment
of recommender systems is by precision and recall, as well as top-n variants thereof that only
take into account the n highest ranked items. While precision is a measure for how relevant
the suggested items are to the user, recall measures the extent to which all relevant items are
retrieved. A system exhibiting high precision and recall is hence able to accurately predict the
user’s interests, and returns all relevant items.</p>
      <p>
        It has been argued [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], however, that optimising a recommender system for precision
and recall alone disregards a variety of other dimensions that prove useful both in assessing and
designing recommender systems. Notions such as novelty, diversity, and serendipity can benefit
users when factored into the design of such systems by enabling them to venture outside the
space of familiar content and discover new items they have not been exposed to.
      </p>
      <p>
        A serendipitous discovery can be compared to a “happy accident”: a finding that is unexpected
and novel, but nevertheless relevant. For instance, Iaquinta et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] argue that novelty is a
key contributor to serendipitous discovery, as an item the user has not experienced is more
likely to result in a “surprise finding” than one they already know, but disregarded as irrelevant
or uninteresting. As such, it is hard to measure serendipity directly, and thus previous work
[
        <xref ref-type="bibr" rid="ref1 ref2 ref5 ref6">5, 1, 2, 6</xref>
        ] commonly considers notions of serendipitous discovery alongside diversity and
novelty.
      </p>
      <p>In this work, we explore a graph-based approach to generating novel and diverse
recommendations, thus contributing to the body of work concerned with beyond accuracy measures and
serendipity in recommender systems. We abstract parts of the musical domain into a knowledge
graph, consisting of tracks and their genre-annotations. Based on this graph, relevant, yet novel
and diverse candidate tracks for recommendation are determined and ranked with scoring
functions that intentionally discount items similar to the user’s past experience. A set of
contentand graph-based measures taken from existing literature are employed to evaluate the results
in terms of novelty and diversity. We believe that our approach can be generalized to fit other
domains than music, providing the items subjected to recommendation can be compared in
terms of similarity or distance, and a simple ontology can be utilized to construct the required
graph.</p>
      <p>The remainder of this work is structured as follows: in Section 2, we provide an overview of
previous work; Section 3 outlines the datasets used in our analysis; in Section 4 we present our
method for finding diverse and novel recommendations; Section 5 presents our main findings
and compares them to a set of baselines; and Section 6 concludes this work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Graph-based approaches for recommender systems have been used to model relationships
between users, between items, and between both in conjunction, e.g., in the context of digital
libraries [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Online Social Networks [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or music [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Apart from items or users, knowledge
graphs provide the possibility to model entities as part of an ontology, along with their relations
[
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
      </p>
      <p>
        The knowledge graph employed in the paper at hand is an instance of an attributed graph, as
vertices carry additional context-information. Recently, Schoenfeld and Pfefer [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] proposed
an approach for calculating shortest path-based centrality-measures on attributed graphs. At
each step during path discovery between two nodes, a distance function is evaluated, its input
being vertex-attributes or attribute-vectors. Paths exceeding a provided threshold can then be
pruned from the search tree, enabling vertex-specific views of a graph and restricting graph
traversal to only a subset of vertices and incident edges that meet a custom decision criterion.
      </p>
      <p>
        Acoustic features of songs have been previously used to, e.g., analyze the quality of music
recommendations for “beyond mainstream” users [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Users are assigned a mainstreaminess
score based on a user’s artist-playcount and the global artist-playcounts. By employing a
two dimensional embedding of tracks’ acoustic features, four clusters of musical genres are
detected. Users are then assigned to clusters based on the number of tracks in the cluster they
have listened to. A set of baseline recommendation algorithms is evaluated against these
userclusters, showing statistically significant results across groups, thus highlighting the importance
of user-modelling.
      </p>
      <p>
        Zangerle et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] have also used acoustic features to create culturally-aware user models.
Users’ listening preferences are modelled based on the tracks they have listened to by
aggregating the tracks’ acoustic feature vectors. The model is further enriched with Hofstede’s six
cultural dimensions based on users’ country of origin. Using XGBoost, precision and recall
are evaluated for the music-cultural and only-music models, with the combined music-cultural
model outperforming the baselines, and the music-model performing second best. These
findings indicate that modelling users through acoustic feature vectors has a positive impact on
recommendation accuracy.
      </p>
      <p>
        Ge et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] argue that metrics applied to evaluation of recommender systems should
incorporate a notion of quality of recommendations rather than merely testing for accuracy. They
present a set of measures for coverage and serendipity and approach the latter through the
notion of unexpectedness of results as compared to a primitive prediction model. The presented
measure for serendipity constrains the unexpected items to also be useful, an evaluation that
ultimately only a real user of the system can make. Concluding, they argue that in order to
foster serendipity, a system should aim for high catalog coverage, so as to include rarely or low
rated items.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>
        We base our analysis on three datasets: the LFM-1b dataset [15], the CultMRS dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and
a subset of LFM-1b annotated with musical genres1.
      </p>
      <p>The LFM-1b dataset contains more than one billion listening events across 120 000 users and
32 million tracks. The data was collected between January 2013 and August 2014 by querying
listening histories of users on the music-streaming platform last.fm through the platform’s
public Web-API. For the purposes our analysis, a listening event is a tuple (, ) consisting of
a unique, anonymous user-ID  and a unique track-ID ; additional identifiers for artists and
albums are available, but not relevant to this study. Track-IDs can be unambiguously related
to a single musical item, and tuples need not be unique, i.e., multiple events for the same user
and track are valid. Please note that, while the timestamp when the user began listening to a
track is recorded in the dataset, no indication can be made as to whether they listened to it
completely, or skipped to a new item at some point through the track.</p>
      <p>The EmoMTB dataset contains genre-annotations for a subset of 533 970 tracks in the LFM-1b
collection. Genres were fetched by querying the last.fm-API for the user-generated tags on
a track and matched against two dictionaries of musical genres. In this dataset, a track is
annotated with an average of 4.92 (± 4.29) genres, the median is 4.</p>
      <p>
        A further extension to the LFM-1b collection is the CultMRS dataset that enriches 3 471 884
tracks with acoustic features. The data was collected for 55 190 users providing
country1This dataset was kindly provided to us by the authors of the EmoMTB-project [16].
information on their last.fm profile. Acoustic features were obtained through the public Spotify
Web-API2. We defer a detailed definition of these features to Zangerle et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and the API
documentation published by Spotify. Out of the set of available features, we use acousticness,
danceability, energy, instrumentalness, liveness, speechiness, tempo, and valence and omit key,
loudness, and mode as we don’t believe these to be indicative of users’ listening behaviour. All
feature values range in [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], with the exception of tempo, which denotes the beats-per-minute
(BPM) as a floating-point number. Using linear min-max normalization, we normalize tempo
into the [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] range, following the approach of Zangerle et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        As Kowald et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] noted, there is a tendency for last.fm users to assign coarse-grained
genres such as “pop” and “rock” to tracks. Since these overly prevalent genres would distort our
model of the musical domain, we follow the approach from Kowald et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and use inverse
document frequency (IDF) to remove overly popular genres from the EmoMTB-dataset. To this
end, we view genres as terms and tracks as documents and determine a genre ’s IDF score as:
 () = log
      </p>
      <p>#
# ℎ_</p>
      <p>By visual inspection of the IDF scores of the top 100 genres we choose a cutof score of 0.903
(dashed line in Figure 1), and omit the genres “rock”, “pop”, “metal”, and “alternativerock” (in
ascending order of their respective IDF score) from our analysis. This removes 7 227 tracks from
the dataset (ca. 2.80%) that have no other except the excluded genres assigned. All remaining
tracks have at least one genre, with a mean of 4.59 (± 3.98) and a median of 3.</p>
      <p>For further analysis, we select all tracks that appear both in the filtered EmoMTB- and
CultMRS-dataset, which results in a total of 250 654 tracks and 2 126 genres.</p>
      <p>Prior to sampling the set of users for which we determine recommendations, we remove
users appearing as outliers in terms of the number of distinct tracks they listened to. Using
the full LFM-1b set of listening events, we count the unique tracks each user has listened
to and perform outlier detection using median absolute deviation (MAD) [17] with a “very
conservative” threshold of 3, i.e., we consider a user an outlier if the number of tracks they
have listened to falls outside of three median absolute deviations from the median. From the
resulting distribution, we draw a random sample of 1 000 users that have listened to between
200 and 500 tracks. From this sample, we select all tracks that also appear in our annotated sets,
resulting in a collection of 1 000 users that listened to 106 679 distinct tracks across 1 898 genres.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methods</title>
      <p>In this section, we report on the methods employed to construct a knowledge graph of the
musical domain at hand. We then outline how we find candidate tracks for recommendation
and subsequently score these to arrive at a ranked list of tracks we recommend for a given user.
2https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features
1.8
1.6</p>
      <sec id="sec-4-1">
        <title>4.1. Knowledge Graph</title>
        <p>Tracks and genres are modelled as an undirected, bipartite graph  = (, , ) consisting
of two types of vertices: tracks  and genres  . For two vertices of distinct types the edge
(, ) ∈  if the track-genre-annotation is present in the dataset.</p>
        <p>We use the merged CultMRS and EmoMTB datasets (see Section 3) to construct this graph,
which has 252 780 vertices and 1 149 345 edges. The mean degree is 9.09 (± 201.43), the median
is 3. As can be seen in Figure 2, the degree distribution for the entire graph is long-tailed,
with the majority of vertices exhibiting a degree of 500 or less. Across all track-vertices, the
mean degree is 4.59 (± 3.98) with a median of 3, and across genre-vertices the mean is 540.61
(± 2130.59) with a median of 22.</p>
        <p>For each user, we extract a bipartite subgraph  = (, , ) from  containing only tracks
 the user has listened to, along with their genres . Furthermore, for each subgraph we
determine the one-mode projection onto genre-vertices,  = (,  ), where  denotes the
set of edges and an edge (1, 2) ∈  if these two genre-vertices share a common track-vertex
as their neighbour in the user-subgraph .</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Generating Recommendations</title>
        <p>To find candidate tracks for recommendation, we attempt to close triangles in users’
genresubgraphs . This relates to the idea of reciprocity in social networks, where closed
triangles indicate mutual ties among people knowing each other (“a friend of a friend is also
my friend”)[18]. Transferring this idea to the genre-subgraph, two genre-vertices may have
a mutual neighbour the user is unaware of. Incorporating this genre into the user’s listening
experience has the potential of venturing deeper into the global genre-graph, while maintaining
a strong connection with existing preferences and thus constituting a relevant discovery. As
we wish to form connections among genres in the one-mode projection, we need to introduce
vertices into the bipartite graph that are of the track-type in order for new edges to manifest in
the projection.</p>
        <p>Thus, we determine, for all pairs of genre-vertices ,  ∈  the user has listened to, the
set-intersection of their (open) neighborhood () in the main bipartite graph , resulting in
in a set of candidate tracks  that are annotated with both genres:
 =</p>
        <p>⋃︁
,∈</p>
        <p>() ∩ ( )</p>
        <p>From this set, we exclude any track-vertices  that are already present in the user’s listening
history:</p>
        <p>′ =  − {  :  ∈ }</p>
        <p>The remaining candidate tracks in ′ are scored to obtain a final ranked list of
recommendations.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Scoring</title>
        <p>We determine a ranked list of tracks by applying a scoring function to each candidate track. In
this work, we evaluate two scoring functions parameterized on a similarity metric and overall
popularity in order to balance similar items against popular ones.</p>
        <p>
          We use cosine similarity on vectors of acoustic features to determine how close two musical
items are perceptually. Each track is assigned a feature-vector ⃗ based on the CultMRS dataset.
For each user, we compute a profile-vector ⃗ resembling their listening profile by summarising
all feature-vectors of tracks in their history. Following Zangerle et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], we apply MAD to
remove outliers from users’ listening histories in terms of acoustic features. We consider a track
an outlier if any component of its feature-vector falls outside of three MADs around the median
of that feature. The remaining tracks’ vectors are summarised into a single profile-vector using
the median of each component.
        </p>
        <p>
          As an indicator of popularity, we count in how many listening histories within our sample a
track appears, and scale this value into the range [
          <xref ref-type="bibr" rid="ref1">1, 1000</xref>
          ] using min-max normalization. In
our scoring functions, we opt for the square-root of this value to alleviate the impact of large
popularity scores.
        </p>
        <p>
          Both scoring functions are designed to assign higher scores to tracks with low popularity
and thus favour items that appear in the global long tail. In the same way as this increases the
recommender system’s coverage of the entire collection of music, it also opens up the door
for recommendations to feature items that appear novel and unexpected to the user. Here, we
follow Ge et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] who argue that serendipitous discovery may happen where relevant, yet
unexpected findings are encountered.
        </p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Popularity-discounted similarity</title>
          <p>In order to assign lower scores to highly popular tracks, we discount the cosine similarity of a
user’s profile vector ⃗ and a track’s feature vector ⃗ by the track’s popularity ():
() =
(⃗, ⃗)</p>
          <p>√︀()</p>
          <p>This scoring is applied to all tracks in the candidate set, and the resulting scores are put in
decreasing order. The top 20 tracks are then subjected to evaluation.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Logistic scoring</title>
          <p>To discount globally popular tracks while boosting low similarities and hence potentially novel
tracks, we employ scoring based on the logistic function:
() = 1 −</p>
          <p>1
1 + − (⃗,⃗)√()
1</p>
          <p>This is, in essence, a logistic function  () = 1+− (− 0) with  = (⃗, ⃗),  =
√︀() and 0 = 0; we use the complement to ensure a declining function rather than an
increasing one.
cosine similarity</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation and Discussion</title>
      <p>
        For each of the 1 000 users in our dataset, we evaluate how diverse and how novel the top-20
recommendations are using a set of measures outlined in the following sections. In addition,
we compare our results against a set of baseline algorithms, which we implement using the
Python-library Surprise [19]. As the algorithms we implement require explicit ratings, whereas
our dataset contains implicit ratings in the form of listening events, we count how often a track
appears in a user’s listening history and scale this value into the range [
        <xref ref-type="bibr" rid="ref1">1, 1000</xref>
        ] using min-max
normalization.
      </p>
      <p>The baselines that we use from Surprise are: NormalPredictor, which samples a
user-itemrating from a normal distribution with an estimated mean and standard deviation from the
existing user-item-ratings; BaselineOnly, which predicts a rating using a baseline estimate of
user-average and item-average deviations from the global mean rating [20]; and NMF,
nonnegative matrix factorization [21].</p>
      <p>All baseline algorithms are trained on the entire dataset, and predictions are generated for
the “anti-training-set”, i.e., all user-item pairs missing from the training set. NMF is trained
with the default parameters set by the Surprise library. As with our own approach, we return
the 20 highest-ranked predictions.</p>
      <p>In the following evaluation, we report measures for diversity and novelty across our two
scoring functions (see Section 4.3) “popularity-discounted similarity”, which we shall refer to
as PDS, and “logistic scoring”, which we refer to simply as logistic. The same measures are
computed for the baseline recommendations and contrasted with our approaches.</p>
      <sec id="sec-5-1">
        <title>5.1. Diversity</title>
        <p>
          Diversity of recommendations tries to capture how diferent recommendations are, either among
themselves, i.e., within a recommendation list, or on a global level, e.g., which proportion of all
items in the catalog appear as the output of a recommender system [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>We employ two measures concerned with recommendation lists: Intra-list Diversity and
Structural Diversity, a measure tailored to graphs.</p>
        <sec id="sec-5-1-1">
          <title>5.1.1. Intra-list Diversity</title>
          <p>Simply put, Intra-list Diversity (ILD) measures the average pairwise distance among all items in
a recommendation list  with respect to a distance measure :
() =</p>
          <p>1
||(|| − 1) ∈ ∈
∑︁ ∑︁ (, )</p>
          <p>In our case, || = 20, and  is the cosine distance, i.e., the complement to cosine similarity,
of the recommended tracks’ acoustic feature vectors.</p>
          <p>As Table 1 shows, NormalPredictor returns the overall most diverse recommendation lists.
All baseline algorithms perform considerably better in terms of ILD than our approach. Of our
approaches, logistic scoring performs better than PDS, indicating that by scoring dissimilar
content higher on a per-item basis, more diverse results can be achieved. To assess whether the
diferences between our scoring functions are statistically significant, we conduct a Wilcoxon
signed-rank test as the data are not normally distributed (Shapiro-Wilk test,   = 0.679,
 = 0.618,  ≤ 0.001); the diferences are significant at a level of  ≤ 0.001 ( =
211). All pairwise diferences between our own approaches and the baseline comparisons are
significant at  ≤ 0.001, as assessed through paired Wilcoxon signed rank tests.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Structural Diversity</title>
          <p>The notion of structural diversity has been used to assess connections between users in social
recommender systems [22, 23]. Sanz-Cruzado et al. [23] suggest a measure of connective
diversity based on the degree distribution of a graph and the Gini index. They argue that a
lfat or even distribution is indicative of a “distinctive social circle” of a user, whereas a highly
skewed degree distribution indicates that a few individuals form relationships to many others,
thus acting as hubs.</p>
          <p>We translate this view to our domain of tracks and genres, and argue that a highly skewed
degree distribution in a graph containing recommended tracks and genres indicates, in the most
extreme case, that all recommended tracks pertain to the same genre, resulting in a star-shaped
graph. The other extreme, a completely flat degree distribution, can be viewed as a lattice,
where each track is connected to the same number of genres, and vice versa.</p>
          <p>
            The Gini index, originally a gauge of income disparity, has been suggested to evaluate
diversity of recommendations [
            <xref ref-type="bibr" rid="ref6">24, 6, 25</xref>
            ]; we adopt the degree Gini complement (DGC) suggested
by Sanz-Cruzado et al. [23] with a slight modification for undirected graphs 3.
          </p>
          <p>Let ′ = ( ′,  ′, ′) denote a graph containing recommended tracks  ′ and genres  ′; |′|
the number of vertices in the graph; () the degree of a node  ∈ ′; and |′| the number
of edges in the graph. The degree Gini complement is then defined as:
|′|
∑︁(2 − | ′| − 1)
|′| − 1 =1</p>
          <p>1 ()
(′) = 1 − |′|
2</p>
          <p>By using the complement, we ensure that higher values indicate a more even or flat
distribution.</p>
          <p>For all recommendation lists returned by the employed algorithms, we construct an
induced subgraph ′ of the bipartite graph , containing only tracks and vertices appearing as
recommendations, and determine the mean DGC across all users (see Table 2).</p>
          <p>Logistic scoring appears to produce the most evenly distributed recommendation graphs in
terms of degree. Except NormalPredictor ( = 0.547) all other data is non-normally distributed,
according to a Shapiro-Wilk test. Therefore, we again determine statistical significance of the
results of logistic scoring against all other algorithms using a Wilcoxon signed rank test. With
the exception of PDS ( ≤ 0.001), none of the observed diferences are significant at  ≤ 0.05.</p>
          <p>We thus assess that, while not significantly outperforming the baseline algorithms, logistic
scoring tends to produce more diverse recommendations in terms of structural diversity than
the simpler discounting approach.
3We use 12 |′| in the denominator to avoid counting edges twice</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Novelty</title>
        <p>
          Diferent notions of novelty exist for recommender systems, again on a local, per-user or
peritem level, as well as a global level [23, 26]. It has also been argued that a system produces novel
recommendations if its output difers from that of a primitive model generating more “expected”
recommendations [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>In this work, we examine novelty both from a user-perspective, as well as in comparison to
the baseline algorithms.</p>
        <sec id="sec-5-2-1">
          <title>5.2.1. Unexpectedness</title>
          <p>
            The notion of unexpectedness lacks a clear definition. Whereas Ge et al. [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] view it as the
deviation from a primitive reference system, Castells et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] approach it via user-specific
unexpectedness based on item attributes. Thus, a recommended item appears unexpected to
the user if it exhibits high distance (low similarity) to their previous experience.
          </p>
          <p>We adopt this measurement of user-specific unexpectedness and determine, for each item
in a user’s recommendation list , the mean cosine-distance  to all items they have
previously experienced in our dataset, H (their listening history):
 () =
1</p>
          <p>∑︁ ∑︁ (, ℎ)
|||| ∈ ℎ∈</p>
          <p>Table 3 lists mean   for all recommendation lists. Logistic scoring seems to return
the most unexpected tracks to users. A Shapiro-Wilk test for normality indicated non-normal
distributions for all data on the significance level  ≤ 0.001, hence we test for significance of
diferences between logistic scoring and the other algorithms using Wilcoxon signed-rank tests,
as before. All diferences are significant at  ≤ 0.001.</p>
          <p>We attribute the high unexpectedness of results from logistic scoring to the fact that items
with low similarity to the user-profile are ranked more highly, providing they also exhibit low
popularity.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Jaccard Index</title>
          <p>We further assess novelty using the Jaccard index on items produced by the PDS and logistic
scoring methods, versus those of the baseline algorithms, which we thus treat as reference</p>
          <p>BaselineOnly
NormalPredictor
logistic</p>
          <p>PDS
 (, ) = | ∩ |
| ∪ |</p>
          <p>To this end, we take the recommendation lists for a user and determine the Jaccard index
pairwise on all algorithms, to ensure that we compare the same sets of recommended tracks for
a user, and report the mean index value per algorithm. As the heatmap in Figure 4 shows, mean
Jaccard indexes tend to be close to 0.00, with the exception of the pair (NMF, BaselineOnly), for
which we observe a value of 0.055.</p>
          <p>From these results, we conclude neither popularity-discounted similarity nor logistic scoring
produces recommendations which would be expected by a baseline algorithm.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Summary of results</title>
        <p>In summary, we observe that logistic scoring, which assigns higher ranks to less popular and
more diferent items, results in a more diverse list of recommendations in terms of Intra-list
Diversity than the simpler popularity-discounted similarity. In terms of structural
degreediversity, logistic scoring outperforms all other models, although significance at  ≤ 0.001
was only observed in comparison to PDS. Logistic scoring also produces the most unexpected
recommendations when compared to the other models, with statistical significance. Using
the Jaccard index, we determined that none of our scoring functions result in “predictable”
recommendations that could be expected from the baseline algorithms.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we constructed a knowledge graph of the musical domain consisting of tracks and
their genre-annotations. Using this knowledge graph and contextual information on tracks in the
form of acoustic features, we generate a set of recommendation tracks for a sample of 1 000 users
drawn from the LFM-1b dataset. We aim to suggest items that tie in with users’ previous musical
interest, but are tailored towards novelty and diversity, as opposed to reproducing previous
experience, thus aiming to foster serendipitous discovery of tracks of music. We compare our
results to those of several baseline algorithms and find that while our recommendations exhibit
higher novelty in terms of user-specific unexpectedness, the baselines tend to perform better in
terms of Intra-list Diversity.</p>
      <p>We attribute these results to two reasons: 1), our approach assigns high scores to tracks
that are at the same time far from the user’s listening profile and unpopular on a global level,
thus favouring more novel music with relation to a user’s previous encounters; and 2), as we
determine candidate tracks for recommendations from pairs of genres the user has already
experienced, they are likely to exhibit similar acoustic features, thus exhibiting lower diversity
on a per-item level. As an example, consider two tracks being recommended that are both part
of the “jazz” and “blues” genres. These tracks are likely more similar in a musical sense than one
track from the “blues” and one from the “blackmetal” genre. This is a constraint inherent to our
approach, as we aim to suggest novel, but relevant tracks, and ensure relevance by grounding
recommendations on genres the user has previously been exposed to.</p>
      <p>It must also be noted that the datasets we base our study on may be missing tracks the user
has indeed consumed, but that either did not have any (meaningful) genres assigned in the form
of last.fm tags, or whose acoustic features were not retrievable from the Spotify-API. In addition,
as applies to many other recommendation settings, the system only has a partial view on users’
previous exposure to items. As such, recommendations may well contain content the user is
familiar with, but that they have consumed via other on- or ofline media.</p>
      <p>As we validated our results in an ofline setting, it is not possible to reliably state the utility
of our recommendations to users, i.e., if they would indeed consider listening to these tracks.
To gain deeper insight, an online evaluation in the form of a user-study would be required.</p>
      <p>Lastly, although we conducted our work in the domain of music, we believe the ideas extend to
other settings, e.g., digital archives. Given an ontology of the domain and a notion of similarity
between items, a set of candidates can be ranked to favour those more diferent to a reference
collection. Similarity may be determined in terms of documents’ metadata, or latent features,
e.g., topic- or sentiment-models.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This article is the outcome of research conducted within the Africa Multiple Cluster of Excellence
at the University of Bayreuth, funded by the Deutsche Forschungsgemeinschaft (DFG, German
Research Foundation) under Germany’s Excellence Strategy – EXC 2052/1 – 390713894.
Fusing acoustic and cultural cues, Transactions of the International Society for Music
Information Retrieval 3 (2020) 1–16. doi:10.5334/tismir.37.
[15] M. Schedl, The LFM-1b dataset for music retrieval and recommendation, in: Proceedings of
the 2016 ACM on International Conference on Multimedia Retrieval, ACM, 2016. doi:http:
//dx.doi.org/10.1145/2911996.2912004.
[16] M. Schedl, M. Mayr, P. Knees, Music Tower Blocks: Multi-Faceted Exploration Interface
for Web-Scale Music Access, in: Proceedings of the 2020 International Conference on
Multimedia Retrieval, Association for Computing Machinery, New York, NY, USA, 2020,
pp. 388–392. doi:10.1145/3372278.3391928.
[17] C. Leys, C. Ley, O. Klein, P. Bernard, L. Licata, Detecting outliers: Do not use standard
deviation around the mean, use absolute deviation around the median, Journal of Experimental
Social Psychology 49 (2013) 764–766. doi:10.1016/j.jesp.2013.03.013.
[18] S. Wasserman, K. Faust, S. U. o. I. W. Urbana-Champaign), Social Network Analysis:</p>
      <p>Methods and Applications, Cambridge University Press, 1994.
[19] N. Hug, NicolasHug/Surprise, 2022. URL: https://github.com/NicolasHug/Surprise.
[20] Y. Koren, Factor in the neighbors, ACM Transactions on Knowledge Discovery from Data
4 (2010) 1–24. doi:10.1145/1644873.1644874.
[21] X. Luo, M. Zhou, Y. Xia, Q. Zhu, An Eficient Non-Negative Matrix-Factorization-Based
Approach to Collaborative Filtering for Recommender Systems, IEEE Transactions on
Industrial Informatics 10 (2014) 1273–1284. doi:10.1109/TII.2014.2308433.
[22] X. L. Huang, M. Tiwari, S. Shah, Structural diversity in social recommender systems, in:
Proceedings of the 5th ACM RecSys Workshop on Recommender Systems and the Social
Web, Citeseer, 2013.
[23] J. Sanz-Cruzado, S. M. Pepa, P. Castells, Structural Novelty and Diversity in Link Prediction,
in: Companion Proceedings of the The Web Conference 2018, WWW ’18, International
World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE,
2018, pp. 1347–1351. doi:10.1145/3184558.3191576.
[24] D. M. Fleder, K. Hosanagar, Recommender systems and their impact on sales diversity, in:
Proceedings of the 8th ACM conference on Electronic commerce, EC ’07, Association for
Computing Machinery, New York, NY, USA, 2007, pp. 192–199. doi:10.1145/1250910.
1250939.
[25] A. Gunawardana, G. Shani, Evaluating Recommender Systems, Springer US, Boston, MA,
2015, pp. 265–308. doi:10.1007/978-1-4899-7637-6_8.
[26] S. Vargas, P. Castells, Rank and relevance in novelty and diversity metrics for recommender
systems, in: Proceedings of the fifth ACM conference on Recommender systems, RecSys
’11, Association for Computing Machinery, New York, NY, USA, 2011, pp. 109–116. doi:10.
1145/2043932.2043955.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.-N.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>McNee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          , G. Lausen,
          <article-title>Improving recommendation lists through topic diversification</article-title>
          ,
          <source>in: Proceedings of the 14th international conference on World Wide Web, WWW '05</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2005</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>32</lpage>
          . doi:
          <volume>10</volume>
          .1145/1060745.1060754.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Adomavicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <article-title>Maximizing aggregate recommendation diversity: A graphtheoretic approach</article-title>
          ,
          <source>in: Proc. of the 1st International Workshop on Novelty and Diversity in Recommender Systems (DiveRS</source>
          <year>2011</year>
          ), Citeseer,
          <year>2011</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delgado-Battenfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <article-title>Beyond accuracy: evaluating recommender systems by coverage and serendipity</article-title>
          ,
          <source>in: Proceedings of the fourth ACM conference on Recommender systems, RecSys '10</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2010</year>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>260</lpage>
          . doi:
          <volume>10</volume>
          .1145/1864708.1864761.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Iaquinta</surname>
          </string-name>
          , M. de Gemmis, P. Lops, G. Semeraro,
          <string-name>
            <given-names>M.</given-names>
            <surname>Filannino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Molino</surname>
          </string-name>
          ,
          <article-title>Introducing serendipity in a content-based recommender system</article-title>
          , in: 2008 Eighth International Conference on Hybrid Intelligent Systems, IEEE,
          <year>2008</year>
          . doi:
          <volume>10</volume>
          .1109/his.
          <year>2008</year>
          .
          <volume>25</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y. C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Ó</given-names>
            <surname>Séaghdha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Quercia</surname>
          </string-name>
          , T. Jambor,
          <article-title>Auralist: introducing serendipity into music recommendation, in: Proceedings of the fifth ACM international conference on Web search and data mining</article-title>
          ,
          <source>WSDM '12</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2012</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>22</lpage>
          . doi:
          <volume>10</volume>
          .1145/2124295.2124300.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Castells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Hurley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vargas</surname>
          </string-name>
          ,
          <source>Novelty and Diversity in Recommender Systems</source>
          , Springer US, Boston, MA,
          <year>2015</year>
          , pp.
          <fpage>881</fpage>
          -
          <lpage>918</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-1-
          <fpage>4899</fpage>
          -7637-6_
          <fpage>26</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-H.</given-names>
            <surname>Ong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A graph-based recommender system for digital library</article-title>
          ,
          <source>in: Proceedings of the second ACM/IEEE-CS joint conference on Digital libraries - JCDL '02</source>
          , ACM Press,
          <year>2002</year>
          . doi:
          <volume>10</volume>
          .1145/544220.544231.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tommasel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Godoy</surname>
          </string-name>
          , I Want to Break Free!
          <article-title>Recommending Friends from Outside the Echo Chamber</article-title>
          ,
          <source>in: Fifteenth ACM Conference on Recommender Systems</source>
          , Association for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>33</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3460231.3474270.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Escaping your comfort zone: A graph-based recommender system for ifnding novel recommendations among relevant items</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>42</volume>
          (
          <year>2015</year>
          )
          <fpage>4851</fpage>
          -
          <lpage>4858</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2014</year>
          .
          <volume>07</volume>
          .024.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Azaria</surname>
          </string-name>
          , T. Mitchell,
          <article-title>An entity graph based recommender system</article-title>
          ,
          <source>AI</source>
          Communications
          <volume>30</volume>
          (
          <year>2017</year>
          )
          <fpage>141</fpage>
          -
          <lpage>149</lpage>
          . doi:
          <volume>10</volume>
          .3233/AIC-170728.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Oramas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. C.</given-names>
            <surname>Ostuni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Serra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Sciascio</surname>
          </string-name>
          ,
          <article-title>Sound and Music Recommendation with Knowledge Graphs</article-title>
          ,
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>8</volume>
          (
          <year>2016</year>
          )
          <volume>21</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          :
          <fpage>21</fpage>
          . doi:
          <volume>10</volume>
          .1145/2926718.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schoenfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pfefer</surname>
          </string-name>
          ,
          <article-title>Shortest path-based centrality metrics in attributed graphs with node-individual context constraints, Social Networks (</article-title>
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1016/j.socnet.
          <year>2021</year>
          .
          <volume>10</volume>
          .004.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kowald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Muellner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          , E. Lex,
          <article-title>Support the underground: characteristics of beyond-mainstream music listeners</article-title>
          ,
          <source>EPJ Data Science</source>
          <volume>10</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          . 1140/epjds/s13688-021-00268-9.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pichl</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Schedl, User models for culture-aware music recommendation:</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>