=Paper=
{{Paper
|id=Vol-1673/paper5
|storemode=property
|title=RDF Graph Embeddings for Content-based Recommender Systems
|pdfUrl=https://ceur-ws.org/Vol-1673/paper5.pdf
|volume=Vol-1673
|authors=Jessica Rosati,Petar Ristoski,Tommaso Di Noia,Renato De Leone,Heiko Paulheim
|dblpUrl=https://dblp.org/rec/conf/recsys/RosatiRNLP16
}}
==RDF Graph Embeddings for Content-based Recommender Systems==
<pdf width="1500px">https://ceur-ws.org/Vol-1673/paper5.pdf</pdf>
<pre>
 RDF Graph Embeddings for Content-based Recommender
                      Systems

                                               Jessica Rosati1,2                         Petar Ristoski
                                          1
                                           University of Camerino –             Data and Web Science Group,
                                        Piazza Cavour 19/f – 62032               University of Mannheim, B6,
                                               Camerino, Italy                      26, 68159 Mannheim,
                                       2
                                         Polytechnic University of Bari                    Germany
                                          – Via Orabona, 4 – 70125              petar.ristoski@informatik.uni-
                                                  Bari, Italy                         mannheim.de
                            jessica.rosati@unicam.it
                    Tommaso Di Noia         Renato De Leone                                                    Heiko Paulheim
              Polytechnic University of Bari                        University of Camerino –           Data and Web Science Group,
               – Via Orabona, 4 – 70125                           Piazza Cavour 19/f – 62032            University of Mannheim, B6,
                       Bari, Italy                                      Camerino, Italy                    26, 68159 Mannheim,
             tommaso.dinoia@poliba.it renato.deleone@unicam.it                                                    Germany
                                                                                                          heiko@informatik.uni-
                                                                                                              mannheim.de

ABSTRACT                                                                                    recommendation approaches is that the information on which
Linked Open Data has been recognized as a useful source                                     they rely is generally insufficient to elicit user’s interests and
of background knowledge for building content-based rec-                                     characterize all the aspects of her interaction with the sys-
ommender systems. Vast amount of RDF data, covering                                         tem. This is the main drawback of the approaches built
multiple domains, has been published in freely accessible                                   on textual and keyword-based representations, which can-
datasets. In this paper, we present an approach that uses                                   not capture complex relations among objects since they lack
language modeling approaches for unsupervised feature ex-                                   the semantics associated to their attributes. A process of
traction from sequences of words, and adapts them to RDF                                    “knowledge infusion” [40] and semantic analysis has been
graphs used for building content-based recommender sys-                                     proposed to face this issue, and numerous approaches that
tem. We generate sequences by leveraging local information                                  incorporate ontological knowledge have been proposed, giv-
from graph sub-structures and learn latent numerical rep-                                   ing rise to the newly defined class of semantics-aware content-
resentations of entities in RDF graphs. Our evaluation on                                   based recommender systems [6]. More recently the Linked
two datasets in the domain of movies and books shows that                                   Open Data (LOD) initiative [3] has opened new interesting
feature vector representations of general knowledge graphs                                  possibilities to realize better recommendation approaches.
such as DBpedia and Wikidata can be effectively used in                                     The LOD initiative in fact gave rise to a variety of open
content-based recommender systems.                                                          knowledge bases freely accessible on the Web and being
                                                                                            part of a huge decentralized knowledge base, the LOD cloud,
                                                                                            where each piece of little knowledge is enriched by links to re-
Categories and Subject Descriptors                                                          lated data. LOD is an open, interlinked collection of datasets
H.3.3 [Information Systems]: Information Search and Re-                                     in machine-interpretable form, built on World Wide Web
trieval                                                                                     Consortium (W3C) standards as RDF1 , and SPARQL2 . Cur-
                                                                                            rently the LOD cloud consists of about 1, 000 interlinked
                                                                                            datasets covering multiple domains from life science to gov-
Keywords                                                                                    ernment data [39]. It has been shown that LOD is a valu-
Recommender System; Graph Embeddings; Linked Open Data able source of background knowledge for content-based rec-
                                                                                            ommender systems in many domains [12]. Given that the
1. INTRODUCTION                                                                             items to be recommended are linked to a LOD dataset, in-
                                                                                            formation from LOD can be exploited to determine which
   One of the main limitations of traditional content-based
                                                                                            items are considered to be similar to the ones that the user
Permission to make digital or hard copies of all or part of this work for personal or       has consumed in the past, allowing to discover hidden infor-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-      mation and implicit relations between objects [26]. While
tion on the first page. Copyrights for components of this work owned by others than         LOD is rich in high quality data, it is still challenging to
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-        find effective and efficient way of exploiting the knowledge
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
                                                                                            for content-based recommendations. So far, most of the pro-
CBRecSys 2016, September 16, 2016, Boston, MA, USA.
                                                                                  1
Copyright remains with the authors and/or original copyright holders                http://www.w3.org/TR/2004/
c 2016 ACM. ISBN 978-1-4503-2138-9.                                               REC-rdf-concepts-20040210/, 2004.
                                                                                  2
DOI: 10.1145/1235                                                                   http://www.w3.org/TR/rdf-sparql-query/, 2008
posed approaches in the literature are supervised or semi-        content becomes essential. Another hybrid approach is pro-
supervised, which means cannot work without human inter-          posed in [36], which builds on training individual base rec-
action.                                                           ommenders and using global popularity scores as generic rec-
   In this work, we adapt language modeling approaches for        ommenders. The results of the individual recommenders are
latent representation of entities in RDF graphs. To do so, we     combined using stacking regression and rank aggregation.
first convert the graph into a set of sequences of entities us-   Most of these approaches can be referred to as top-down ap-
ing graph walks. In the second step, we use those sequences       proaches [6], since they rely on the integration of external
to train a neural language model, which estimates the likeli-     knowledge and cannot work without human intervention.
hood of a sequence of entities appearing in the graph. Once       On the other side, bottom-up approaches ground on the dis-
the training is finished, each entity in the graph is repre-      tributional hypothesis [15] for language modeling, according
sented with a vector of latent numerical values. Projecting       to which the meaning of words depends on the context in
such latent representation of entities into a lower dimen-        which they occur, in some textual content. The resulting
sional feature space shows that semantically similar entities     strategy is therefore unsupervised, requiring a corpora of
appear closer to each other. Such entity vectors can be di-       textual documents for training as large as possible. Ap-
rectly used in a content-based recommender system.                proaches based on the distributional hypothesis, referred to
   In this work, we utilize two of the most prominent RDF         as discriminative models, behave as word embeddings tech-
knowledge graphs [29], i.e. DBpedia [18] and Wikidata [42].       niques where each term (and document) becomes a point
DBpedia is a knowledge graph which is extracted from struc-       in the vector space. They substitute the term-document
tured data in Wikipedia. The main source for this extraction      matrix typical of Vector Space Model with a term-context
are the key-value pairs in the Wikipedia infoboxes. Wiki-         matrix on which they apply dimensionality reduction tech-
data is a collaboratively edited knowledge graph, operated        niques such as Latent Semantic Indexing (LSI) [8] and the
by the Wikimedia foundation3 that also hosts various lan-         more scalable and incremental Random Indexing (RI) [38].
guage editions of Wikipedia.                                      The latter has been involved in [22] and [23] to define the
   The rest of this paper is structured as follows. In Sec-       so called enhanced Vector Space Model (eVSM) for content-
tion 2, we give an overview of related work. In Section 3, we     based RS, where user’s profile is incrementally built sum-
introduce our approach, followed by an evaluation in Sec-         ming the features vectors representing documents liked by
tion 4. We conclude with a summary and an outlook on              the user and a negation operator is introduced to take into
future work.                                                      account also negative preferences.
                                                                     Word embedding techniques are not limited to LSI and RI.
                                                                  The word2vec strategy has been recently presented in [19]
2.     RELATED WORK                                               and [20], and to the best of our knowldge, has been applied
   It has been shown that LOD can improve recommender             to item recommendations in a few works [21, 28]. In partic-
systems towards a better understanding and representation         ular, [21] is an empirical evaluation of LSI, RI and word2vec
of user preferences, item features, and contextual signs they     to make content-based movie recommendation exploiting
deal with. LOD has been used in content-based, collabo-           textual information from Wikipedia, while [28] deals with
rative, and hybrid techniques, in various recommendation          check-in venue (location) recommendations and adds a non-
tasks, i.e., rating prediction, top-N recommendations and         textual feature, the past check-ins of the user. They both
improving of diversity in content-based recommendations.          draw the conclusion that word2vec techniques are promising
LOD datasets, e.g. DBpedia, have been used in content-            for the recommendation task. Finally there is a single exam-
based recommender systems in [11] and [12]. The former            ple of product embedding [14], namely prod2vec, which oper-
performs a semantic expansion of the item content based on        ates on the artificial graph of purchases, treating a purchase
ontological information extracted from DBpedia and Linked-        sequence as a “sentence” and products within the sequence
MDB [16], the first open semantic web database for movies,        as words.
and tries to derive implicit relations between items. The lat-
ter involves DBpedia and LinkedMDB too, but is an adapta-
tion of the Vector Space Model to Linked Open Data: it rep-
                                                                  3.    APPROACH
resents the RDF graph as a 3-dimensional tensor where each           In our approach, we adapt neural language models for
slice is an ontological property (e.g. starring, director,...)    RDF graph embeddings. Such approaches take advantage
and represents its adjacency matrix. It has been proven           of the word order in text documents, explicitly modeling
that leveraging LOD datasets is also effective for hybrid         the assumption that closer words in the word sequence are
recommender systems [4], that is in those approaches that         statistically more dependent. In the case of RDF graphs, we
boost the collaborative information with additional knowl-        follow the approach sketched in [37], considering entities and
edge, such as the item content. In [10] the authors propose       relations between entities instead of word sequences. Thus,
SPRank, a hybrid recommendation algorithm that extracts           in order to apply such approaches on RDF graph data, we
semantic path-based features from DBpedia and uses them           have to transform the graph data into sequences of entities,
to compute top-N recommendations in a learning to rank            which can be considered as sentences. After the graph is
approach and in multiple domains, movies, books and mu-           converted into a set of sequences of entities, we can train
sical artists. SPRank is compared with numerous collab-           the same neural language models to represent each entity in
orative approaches based on matrix factorization [17, 34]         the RDF graph as a vector of numerical values in a latent
and with other hybrid RS, such as BPR-SSLIM [25], and             feature space. Such entity vectors can be directly used in a
exhibits good performance especially in those contexts char-      content-based recommender system.
acterized by high sparsity, where the contribution of the
                                                                  3.1   RDF Graph Sub-Structures Extraction
3
    http://wikimediafoundation.org/                                 We propose random graph walks as an approach for con-
verting graphs into a set of sequences of entities.               words w1 , w2 , w3 , ..., wT , and a context window c, the ob-
                                                                  jective of the CBOW model is to maximize the average log
  Definition 1. An RDF graph is a graph G = (V, E),               probability:
where V is a set of vertices, and E is a set of directed edges.                          T
                                                                                 1 X
                                                                                        log p(wt |wt−c ...wt+c ),            (1)
   The objective of the conversion functions is for each vertex                  T t=1
v ∈ V to generate a set of sequences Sv , where the first         where the probability p(wt |wt−c ...wt+c ) is calculated using
token of each sequence s ∈ Sv is the vertex v followed by a       the softmax function:
sequence of tokens, which might be edges, vertices, or any                                                      0
                                                                                                      exp(v̄ T vw   )
substructure extracted from the RDF graph, in an order                        p(wt |wt−c ...wt+c ) = PV           t
                                                                                                                         ,   (2)
                                                                                                          exp(v̄  T v0 )
that reflects the relations between the vertex v and the rest                                         w=1             w
                                                                           0
of the tokens, as well as among those tokens.                     where vw is the output vector of the word w, V is the com-
   In this approach, for a given graph G = (V, E), for each       plete vocabulary of words, and v̄ is the averaged input vector
vertex v ∈ V we generate all graph walks Pv of depth d            of all the context words:
rooted in the vertex v. To generate the walks, we use the                                   1      X
breadth-first algorithm. In the first iteration, the algorithm                        v̄ =               vwt+j               (3)
                                                                                            2c
                                                                                               −c≤j≤c,j6=0
generates paths by exploring the direct outgoing edges of the
root node vr . The paths generated after the first iteration      3.2.2      Skip-Gram Model
will have the following pattern vr ->e1i , where i ∈ E(vr ).        The Skip-Gram model does the inverse of the CBOW
In the second iteration, for each of the previously explored      model and tries to predict the context words from the tar-
edges the algorithm visits the connected vertices. The paths      get words. More formally, given a sequence of training words
generated after the second iteration will follow the following    w1 , w2 , w3 , ..., wT , and a context window c, the objective of
pattern vr ->e1i ->v1i . The algorithm continues until d          the skip-gram model is to maximize the following average
iterations are reached. The final set of sequences for the        log probability:
given graph G is the union of the sequences of all the vertices
S                                                                                    T
  v∈V Pv .
                                                                                  1 X        X
                                                                                                        log p(wt+j |wt ),      (4)
                                                                                  T t=1
                                                                                          −c≤j≤c,j6=0
3.2     Neural Language Models – word2vec                         where the probability p(wt+j |wt ) is calculated using the soft-
   Until recently, most of the Natural Language Processing        max function:
systems and techniques treated words as atomic units, rep-                                            0T
resenting each word as a feature vector using a one-hot rep-                                     exp(vwo vwi )
                                                                                  p(wo |wi ) = PV                 ,            (5)
                                                                                                    exp(v 0T
resentation, where a word vector has the same length as the                                     w=1       w vwi )
size of a vocabulary. In such approaches, there is no notion of                    0
                                                                  where vw and vw    are the input and the output vector of the
semantic similarity between words. While such approaches          word w, and V is the complete vocabulary of words.
are widely used in many tasks due to their simplicity and            In both cases, calculating the softmax function is compu-
robustness, they suffer from several drawbacks, e.g., high di-    tationally inefficient, as the cost for computing is propor-
mensionality and severe data sparsity, which limit the per-       tional to the size of the vocabulary. Therefore, two opti-
formance of such techniques. To overcome such limitations,        mization techniques have been proposed, i.e., hierarchical
neural language models have been proposed, inducing low-          softmax and negative sampling [20]. The empirical studies
dimensional, distributed embeddings of words by means of          show that in most cases negative sampling leads to better
neural networks. The goal of such approaches is to estimate       performances than hierarchical softmax, which depends on
the likelihood of a specific sequence of words appearing in a     the selected negative samples, but it has higher runtime.
corpus, explicitly modeling the assumption that closer words         Once the training is finished, semantically similar words
in the word sequence are statistically more dependent.            appear close to each other in the feature space. Furthermore,
   While some of the initially proposed approaches suffered       basic mathematical functions can be performed on the vec-
from inefficient training of the neural network models, with      tors, to extract different relations between the words.
the recent advancements in the field several efficient ap-
proaches has been proposed. One of the most popular and           4.     EVALUATION
widely used is the word2vec neural language model [19, 20].
Word2vec is a particularly computationally-efficient two-layer       We evaluate different variants of our approach on two dis-
neural net model for learning word embeddings from raw            tinct datasets, and compare them to common approaches
text. There are two different algorithms, the Continuous          for creating content-based item representations from LOD
Bag-of-Words model (CBOW) and the Skip-Gram model.                and with state of the art collaborative approaches. Further-
                                                                  more, we investigate the use of two different LOD datasets
3.2.1    Continuous Bag-of-Words Model                            as background knowledge, i.e., DBpedia and Wikidata.
   The CBOW model predicts target words from context              4.1      Datasets
words within a given window.The input layer is comprised
                                                                    In order to test the effectiveness of our proposal, we eval-
from all the surrounding words for which the input vectors
                                                                  uate it in terms of ranking accuracy and aggregate diversity
are retrieved from the input weight matrix, averaged, and
                                                                  on two datasets belonging to different domains, i.e. Movie-
projected in the projection layer. Then, using the weights
                                                                  lens 1M4 for movies and LibraryThing5 for books. The
from the output weight matrix, a score for each word in the
                                                                  4
vocabulary is computed, which is the probability of the word          http://grouplens.org/datasets/movielens/
                                                                  5
being a target word. Formally, given a sequence of training           https://www.librarything.com/
former contains 1 million 1-5 stars ratings from 6,040 users
on 3,883 movies. The LibraryThing dataset contains more                                          N
than 2 millions ratings from 7,564 users on 39,515 books.                                    1   X  2rel(u,i) − 1
                                                                              nDCG@ N =        ·                             (6)
As there are many duplicated ratings in the dataset, when                                  iDCG i=1 log2 (1 + i)
a user has rated more than once the same item, we select
                                                                    where rel(u, i) is a boolean function representing the rel-
her last rating. This choice brings to have 626,000 rat-
                                                                 evance of item i for user u and iDCG is a normalization
ings in the range from 1 to 10. The user-item interactions
                                                                 factor that sets nDCG@ N value to 1 when an ideal ranking
contained in the datasets are enriched with side informa-
                                                                 is returned [2]. As suggested in [41] and set up in [10], in
tion thanks to the item mapping and linking to DBpedia
                                                                 the computation of nDCG@N we fixed a default “neutral”
technique detailed in [27], whose dump is available at http:
                                                                 value for those items with no ratings, i.e. 3 for Movielens
//sisinflab.poliba.it/semanticweb/lod/recsys/datasets/. In
                                                                 and 5 for LibraryThing.
the attempt to reduce the popularity bias from our final
                                                                    Providing accurate recommendations has been recognized
evaluation we decided to remove the top 1% most popular
                                                                 as just one of the main task a recommender system must be
items from both datasets [5]. Moreover we keep out, from
                                                                 able to perform. We therefore evaluate the contribution of
LibraryThing, users with less than five ratings and items
                                                                 our latent features in terms of aggregate diversity, and more
rated less than five times, and to have a dataset character-
                                                                 specifically by means of catalog coverage and Gini coeffi-
ized by lower sparsity we retain for Movielens only users
                                                                 cient [1]. The catalog coverage represents the percentage of
with at least fifty ratings, as already done in [10]. Table 1
                                                                 available candidate items recommended at least once. It is
contains the final statistics for our datasets.
                                                                 an important quality dimension for both user and business
                                                                 perspective [13], since it exhibits the capacity to not settle
                          Movielens      LibraryThing
                                                                 just on a subset of items (e.g. the most popular). This met-
      Number of users       4,186            7,149
                                                                 ric however should be supported by a distribution metric
      Number of items       3,196            4,541               which has to show the ability of a recommendation engine
      Number of ratings    822,597          352,123              to equally spread out the recommendations across all users.
      Data sparsity        93.85%           98.90%               Gini coefficient [1] is used for this purpose, since it measures
                                                                 the concentration degree of top-N recommendations across
       Table 1: Statistics about the two datasets                items and is defined as
                                                                                           n                       
                                                                                          X     n+1−i          rec(i)
                                                                            Gini = 2                       ·                  (7)
4.1.1     RDF Embeddings                                                                  i=1
                                                                                                  n+1           total
   As RDF datasets we use DBpedia and Wikidata.
                                                                   In Equation (7), n is the number of candidate items avail-
   We use the English version of the 2015-10 DBpedia dataset,
                                                                 able for recommendation, total represents the total num-
which contains 4, 641, 890 instances and 1, 369 mapping-based
                                                                 ber of top-N recommendations made across all users, and
properties. In our evaluation we only consider object prop-
                                                                 rec(i) is the number of users to whom item i has been rec-
erties, and ignore the data properties and literals.
                                                                 ommended. Gini coefficient gives therefore an idea of the
   For the Wikidata dataset we use the simplified and de-
                                                                 “equity” in the distribution of the items. It is worth to re-
rived RDF dumps from 2016-03-286 . The dataset contains
                                                                 mind that we are following the notion given in [1], where
17, 340, 659 entities in total. As for the DBpedia dataset, we
                                                                 the complement of the standard Gini coefficient is used, so
only consider object properties, and ignore the data proper-
                                                                 that higher values correspond to more balanced recommen-
ties and literals.
                                                                 dations.
4.2     Evaluation Protocol                                      4.3      Experimental Setup
   As evaluation protocol for our comparison, we adopted the        The first step of our approach is to convert the RDF
all unrated items methodology presented in [41] and already      graphs into a set of sequences. Therefore, to extract the
used in [10]. Such methodology asks to predict a score for       entities embeddings for the large RDF datasets, we use only
each item not rated by a user, irrespective of the existence     random graph walks entity sequences. More precisely, we
of an actual rating, and to compare the recommendation list      follow the approach presented in [32] to generate only a lim-
with the test set.                                               ited number of random walks for each entity. For DBpedia,
   The metrics involved in the experimental comparison are       we experiment with 500 walks per entity with depth of 4
precision, recall and nDCG as accuracy metrics, and cata-        and 8, while for Wikidata, we use only 200 walks per entity
log coverage and Gini coefficient for the aggregate diversity.   with depth of 4. Additionally, for each entity in DBpedia
precision@N represents the fraction of relevant items in the     and Wikidata, we include all the walks of depth 2, i.e., di-
top-N recommendations. recall@N indicates the fraction of        rect outgoing relations. We use the corpora of sequences to
relevant items, in the user test set, occurring in the top-N     build both CBOW and Skip-Gram models with the follow-
list. As relevance threshold, we set 4 for Movielens and 8 for   ing parameters: window size = 5; number of iterations =
LibraryThing, as previously done in [10]. Although preci-        5; negative sampling for optimization; negative samples =
sion and recall are good indicators to evaluate the accuracy     25; with average input vector for CBOW. We experiment
of a recommendation engine, they are not rank-sensitive.         with 200 and 500 dimensions for the entities’ vectors. All
nDCG@N [2] instead takes into account also the position in       the models are publicly available7 .
the recommendation list, being defined as                           We compare our approach to several baselines. For gener-
6                                                                ating the data mining features, we use three strategies that
  http://tools.wmflabs.org/wikidata-exports/rdf/index.
                                                                 7
php?content=dump\ download.php\&dump=20160328                        http://data.dws.informatik.uni-mannheim.de/rdf2vec/
take into account the direct relations to other resources in      where ratedItems(u) is the set of items already evaluated
the graph [30], and two strategies for features derived from      by user u, ru,j indicates the rating for item j by user u
graph sub-structures [7]:                                         and cosineSim(j, i) is the cosine similarity score between
                                                                  items j and i. In our experiments, the size of the considered
    • Features derived from specific relations. In the ex-        neighbourhood is limited to 5. The computation of recom-
       periments we use the relations rdf:type (types), and       mendations has been done with the publicly available library
       dcterms:subject (categories) for datasets linked to DB-    RankSys9 . All the results have been computed @10, that is
       pedia.                                                     considering the top-10 lists recommended to the users: pre-
                                                                  cision, recall and nDCG are computed for each user and then
    • Features derived from generic relations, i.e., we gen-      averaged across all users, while diversity metrics are global
       erate a feature for each incoming (rel in) or outgoing     measures.
       relation (rel out) of an entity, ignoring the value of the    Tables 2 and 3 contain the values of precision, recall and
       relation.                                                  nDCG, respectively for Movielens and LibraryThing, for
                                                                  each kind of features we want to test. The best approach
    • Features derived from generic relations-values, i.e, we
                                                                  for both datasets is retrieved with a Skip-Gram model and
       generate feature for each incoming (rel-vals in) or out-
                                                                  with a size of 200 for vectors built upon DBpedia. For the
       going relation (rel-vals out) of an entity including the
                                                                  sake of truth, on the Movielens dataset the highest value
       value of the relation.
                                                                  of precision is achieved using vector size of 500, but the
    • Kernels that count substructures in the RDF graph           size 200 is prevalent according to the F1 measure, i.e. the
       around the instance node. These substructures are          harmonic mean of precision and recall. A substantial dif-
       explicitly generated and represented as sparse feature     ference however concerns the exploratory depth of the ran-
       vectors.                                                   dom walks, since for Movielens the results related to depth
                                                                  4 outdo those computed with depth 8, while the tendency
          – The Weisfeiler-Lehman (WL) graph kernel for RDF [7] is reversed for LibraryThing. The advantage of the Skip-
             counts full subtrees in the subgraph around the      Gram model over the CBOW is a constant both on DBpedia
             instance node. This kernel has two parameters,       and Wikidata. Moreover, the employment of the Wikidata
             the subgraph depth d and the number of itera-        RDF dataset turns out to be more effective for Library-
             tions h (which determines the depth of the sub-      Thing, where the Skip-Gram vectors with depth 4 exceeds
             trees). We use d = 1 and h = 2 and therefore we      the corresponding DBpedia vectors. Moving to the features
             will indicate this strategy as WL12.                 extracted from direct relations, the contribution of the “cat-
                                                                  egories” stands clearly out, together with relations-values
          – The Intersection Tree Path kernel for RDF [7]
                                                                  “rel-vals”, especially when just incoming relations are con-
             counts the walks in the subtree that span from the
                                                                  sidered. The extraction of features from graph structures,
             instance node. Only the walks that go through
                                                                  i.e. WC2 and WL12 approaches, seems not to provide sig-
             the instance node are considered. We will there-
                                                                  nificant advantages to the recommendation algorithm.
             fore refer to it as the root Walk Count (WC) ker-
             nel. The root WC kernel has one parameter: the
                                                                     To point out that our latent features are able to capture
             length of the paths l, for which we test 2. This
                                                                  the structure of the RDF graph, placing closely semantically
             strategy will be denoted accordingly as WC2.
                                                                  similar items, we provide some examples of the neighbouring
   The strategies for creating propositional features from Linked sets retrieved using our graph embeddings technique and
Open Data are implemented in the RapidMiner LOD exten-            used within the ItemKNN. Table 4 is related to movies and
     8
sion [31, 35].                                                    displays that neighboring items are highly relevant and close
                                                                  to the query item, i.e. the item for which neighbors are
4.4 Results                                                       searched for.
                                                                     To further analyse the semantics of the vector represen-
   The target of the experimental section of this paper is
                                                                  tations, we employ Principal Component Analysis (PCA)
two-fold. On the one hand, we want to prove that the la-
                                                                  to project the “high”-dimensional entities’ vectors in a two
tent features we extracted are able to subsume the other
                                                                  dimensional feature space, or 2D scatter plot. For each of
kind of features in terms of accuracy and aggregate diver-
                                                                  the query movies in Table 4 we visualize the vectors of the
sity. On the other hand we aim at qualifying our strategies
                                                                  5 nearest neighbors as shown in Figure 1. The figure illus-
as valuable means for the recommendation task, through a
                                                                  trates the ability of the model to automatically cluster the
first comparison with state of the art approaches. Both goals
                                                                  movies.
are pursued implementing an item-based K-nearest-neighbor
method, hereafter denoted as ItemKNN, with cosine simi-
                                                                  The impact on the aggregate diversity. As a further valida-
larity among features vectors. Formally, this method deter-
                                                                  tion of the interactiveness of our latent features for recom-
mines similarities between items through cosine similarity
                                                                  mendation task, we report the performances of the ItemKNN
between relative vectors and then selects a subset of them –
                                                                  approach in terms of aggregate diversity. The relation be-
the neighbors – for each item, that will be used to estimate
                                                                  tween accuracy and aggregate diversity has gained the at-
the rating of user u for a new item i as follows:
                           X                                      tention of researchers in the last few years and is generally
         r∗ (u, i) =                 cosineSim(j, i) · ru,j       characterized as a trade-off [1]. Quite surprisingly, however,
                     j∈ratedItems(u)                              the increase in accuracy, shown in Tables 2 and 3, seems not
8                                                                 to rely on a concentration on a subset of items, e.g. the most
  http://dws.informatik.uni-mannheim.de/en/research/
                                                                  9
rapidminer-lod-extension                                            http://ranksys.org/
       Strategy            P@10       R@10      nDCG@10           Query Movie               K Nearest Neighbours
 DB2vec CBOW 200 4        0.03893    0.02167     0.30782          Batman                    Batman Forever, Batman Re-
 DB2vec CBOW 500 4        0.03663    0.02088     0.30557                                    turns, Batman & Robin, Su-
  DB2vec SG 200 4         0.05681    0.03119     0.31828                                    perman IV: The Quest for
  DB2vec SG 500 4         0.05786     0.0304     0.31726                                    Peace, Dick Tracy
 DB2vec CBOW 200 8        0.01064    0.00548     0.29245          Bambi                     Cinderella, Dumbo, 101 Dal-
 DB2vec CBOW 500 8        0.01137    0.00567     0.29289                                    matians , Pinocchio, Lady and
  DB2vec SG 200 8         0.04424    0.02693     0.30997                                    the Tramp
  DB2vec SG 500 8         0.02191    0.01478     0.29863          Star Trek: Generations    Star Trek VI: The Undiscov-
 WD2vec CBOW 200 4        0.01217    0.00596     0.29362                                    ered Country, Star Trek: In-
 WD2vec CBOW 500 4        0.01027    0.00427     0.29211                                    surrection, Star Trek III: The
  WD2vec SG 200 4         0.02902    0.01479     0.30189                                    Search for Spock, Star Trek V:
                                                                                            The Final Frontier, Star Trek:
  WD2vec SG 500 4         0.02644    0.01246     0.29967
                                                                                            First Contact (1996)
          types           0.00313    0.00145     0.28864
       categories          0.0305    0.02093     0.30444
          rel in          0.01122    0.00589     0.29183         Table 4: Examples of K-nearest-neighbor sets on
         rel out          0.02844    0.01607     0.30274         Movielens, for the Skip-Gram model with depth of
      rel in & out        0.02852    0.01566      0.3006         4 and size vectors 200, on DBpedia.
       rel-vals in        0.03883    0.02293     0.29411
      rel-vals out        0.01279    0.00971     0.29378
   rel-vals in & out      0.01174    0.00913     0.29333
          WC2             0.00684    0.00343     0.29032
         WL12             0.00601    0.00288     0.28977

Table 2: Results of the ItemKNN approach on
Movielens dataset. P and R stand respectively for
precision and recall, SG indicates the Skip-Gram
model, and DB and WD represent DBpedia and
Wikidata respectively.

       Strategy            P@10       R@10      nDCG@10
 DB2vec CBOW 200 4        0.05127    0.11777     0.21244
 DB2vec CBOW 500 4        0.05065    0.11557     0.21039
  DB2vec SG 200 4         0.05719    0.12763      0.2205
  DB2vec SG 500 4         0.05811    0.12864     0.22116
 DB2vec CBOW 200 8        0.00836    0.02334     0.14147
 DB2vec CBOW 500 8        0.00813    0.02335     0.14257
  DB2vec SG 200 8         0.07681    0.17769     0.25234
  DB2vec SG 500 8         0.07446     0.1743     0.24809
 WD2vec CBOW 200 4        0.00537    0.01084     0.13524
 WD2vec CBOW 500 4        0.00444    0.00984     0.13428
  WD2vec SG 200 4         0.06416    0.14565     0.23309
  WD2vec SG 500 4         0.06031    0.14194     0.22752         Figure 1: Two-dimensional PCA projection of the
          types           0.01854    0.04535     0.16064         200-dimensional Skip-gram vectors of movies in Ta-
       categories         0.06662    0.15258     0.23733         ble 4.
          rel in          0.04577    0.10219     0.20196
         rel out          0.04118    0.09055     0.19449
      rel in & out        0.04531    0.10165     0.20115         analysis. For both movies and books domain, the best ap-
       rel-vals in        0.06176    0.14101     0.22574         proaches found on DBpedia for the accuracy metrics, i.e.
      rel-vals out        0.06163    0.13763     0.22826         respectively “DB2vec SG 200 4” and “DB2vec SG 200 8”,
   rel-vals in & out      0.06087    0.13662     0.22615         perform better also in terms of aggregate diversity. For the
          WC2             0.00159    0.00306     0.12858
                                                                 LibraryThing dataset the Skip-Gram model computed with
         WL12             0.00155    0.00389     0.12937
                                                                 random walks on Wikidata and size vector limited to 200 is
Table 3: Results of the ItemKNN approach on Li-                  very close to the highest scores retrieved in DBpedia, while
braryThing dataset.                                              for Movielens is the CBOW model, with depth 4, to gain
                                                                 the best performance on Wikidata. The contribution of the
                                                                 categories, despite being lower than the best approach on
popular ones, according to the results proposed in Tables 5      each dataset, is quite significant for diversity measures too.
and 6. Here we are reporting, for the sake of concisenesses,
only the best approaches for each kind of features. More         Comparison with state of the art collaborative approaches.
clearly, we are displaying the best approach for latent fea-     It is a quite common belief in the RS field that using pure
tures computed on DBpedia, the best approach for latent          content-based approaches would not be enough to provide
features computed on Wikidata and the values for the strat-      accurate suggestions and that the recommendation engines
egy involving categories, since it provides the highest scores   must ground on collaborative information too. This moti-
among features extracted through direct relations. We are        vated us to explicitly compare the best approaches built on
not reporting the values related to WL12 and WC2 algo-           graph embeddings technique with the well-known state of
rithms, since their contribution is rather low also in this      the art collaborative recommendation algorithms listed be-
              Strategy            Coverage       Gini                 Strategy           P@10      R@10       nDCG@10
           DB2vec SG 200 4         0.35198     0.07133              DB2vec SG 200 4      0.0568    0.0312       0.3183
          WD2vec CBOW 200 4        0.27749     0.04052                   MF              0.2522    0.1307       0.4427
              categories           0.29798     0.04714                 PopRank           0.1673    0.0787       0.3910
                                                                       BPRMF             0.2522    0.1307       0.4427
Table 5: Methods comparison in terms of aggregate                       SLIM             0.2632    0.1474      0.4599
diversity on the Movielens dataset. Coverage stands                    RankMF            0.1417    0.0704       0.3736
for catalog coverage and Gini for Gini coefficient.
                                                                   Table 7: Comparison with state of the art collabo-
              Strategy          Coverage       Gini                rative approaches on Movielens.
            DB2vec SG 200 8      0.76386     0.29534
            WD2vec SG 200 4      0.73037     0.28525                  Strategy           P@10      R@10       nDCG@10
              categories          0.7246     0.26409                DB2vec SG 200 8      0.0768    0.1777      0.2523
                                                                         MF              0.0173    0.0209       0.1423
Table 6: Methods comparison in terms of aggregate                      PopRank           0.0397    0.0452       0.1598
                                                                       BPRMF             0.0449    0.0751       0.1858
diversity on the LibraryThing dataset.
                                                                        SLIM             0.0543    0.0988       0.2317
                                                                       RankMF            0.0369    0.0459       0.1714

low, and implemented with the publicly available software          Table 8: Comparison with state of the art collabo-
library MyMediaLite10 .                                            rative approaches on LibraryThing.
      • Biased Matrix Factorization (MF) [17], recognized as
        the state of the art for rating prediction, is a ma-       5.   CONCLUSION
        trix factorization model that minimizes RMSE using            In this paper, we have presented an approach for learn-
        stochastic gradient descent and both user and item         ing low-dimensional real-valued representations of entities in
        bias.                                                      RDF graphs, in a completely domain independent way. We
      • PopRank is a baseline based on popularity. It recom-       have first converted the RDF graphs into a set of sequences
        mends the same recommendations to all users accord-        using graph walks, which are then used to train neural lan-
        ing to the overall items popularity. Recent studies have   guage models. In the experimental section we have shown
        point out that recommending the most popular items         that a content-based RS relying on the similarity between
        could already result in a high performance [5].            items computed according to our latent features vectors,
                                                                   outdo the same kind of system but grounding on explicit
      • Bayesian Personalized Ranking (BPRMF) combines a           features (e.g. types, categories,...) or features generated
        matrix factorization approach with a Bayesian Person-      with the use of kernels, from both perspectives of accuracy
        alized Ranking optimization criterion [34].                and aggregate diversity. Our purely content-based system
                                                                   has been further compared to state of the arts collaborative
      • SLIM [24] is a Sparse LInear Method for top-N recom-       approaches for rating prediction and item ranking, giving
        mendation that learns a sparse coefficient matrix for      outstanding results on a dataset with a realistic sparsity de-
        the items involved in the system by only relying on        gree.
        the users purchase/ratings profile and by solving a L1-       As future work, we intend to introduce the features vec-
        norm and L2-norm regularized optimization problem.         tors deriving from the graph embeddings technique within a
                                                                   hybrid recommender system in order to get a fair comparison
      • Soft Margin Ranking Matrix Factorization (RankMF)          against state of the art hybrids approaches such as SPRank
        is a matrix factorization approach for ranking, whose      [10] and BRP-SSLIM [25]. In this perspective we could take
        loss function is ordinal regression [43].                  advantage of the Factorization Machines [33], general pre-
                                                                   dictor working with any features vector, that combine Sup-
   Tables 7 and 8 provide the comparison results for Movie-        port Vector Machines and factorization models. We aim to
lens and LibraryThing respectively. Table 7 shows that             extend the evaluation to additional metrics, such as the in-
matrix factorization techniques and the SLIM algorithm ex-         dividual diversity [44, 9], and to provide a deeper insight
ceed our approach based only on content information. This          into cold-start users, i.e. users with a small interaction with
outcome was somehow expected, especially considering that,         the system for whom the information inference is difficult to
in our experimental setting, Movielens dataset retains only        draw and that generally benefit most of content “infusion”.
users with at least fifty ratings. The community-based in-
formation is unquestionably predominant for this dataset,          6.   REFERENCES
whose sparsity would probably be unlikely for most real-            [1] Gediminas Adomavicius and YoungOk Kwon. Improving
world scenarios. The behaviour however is completely over-              aggregate recommendation diversity using ranking-based
                                                                        techniques. IEEE Trans. on Knowl. and Data Eng.,
turned on the LibraryThing dataset, whose results are col-              24(5):896–911, May 2012.
lected in Table 8. In this case, the mere use of our features       [2] Alejandro Bellogı́n, Iván Cantador, and Pablo Castells. A
vectors (i.e. the “DB2vec SG 200 8” strategy) is able to                comparative study of heterogeneous item recommendations in
outperform the competitor algorithms, which are generally               social systems. Inf. Sci., 221:142–169, February 2013.
                                                                    [3] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked
regarded as the most efficient collaborative algorithms for             Data – The Story So Far. International journal on semantic
both rating and ranking prediction.                                     web and information systems, 5(3):1–22, 2009.
                                                                    [4] Robin Burke. Hybrid recommender systems: Survey and
                                                                        experiments. User Modeling and User-Adapted Interaction,
10
     http://www.mymedialite.net                                         12(4):331–370, November 2002.
 [5] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin.                       Heidelberg, Berlin, Heidelberg, 2013.
     Performance of recommender algorithms on top-n                      [24] Xia Ning and George Karypis. SLIM: sparse linear methods for
     recommendation tasks. In Proceedings of the Fourth ACM                   top-n recommender systems. In 11th IEEE International
     Conference on Recommender Systems, RecSys ’10, pages                     Conference on Data Mining, ICDM 2011, Vancouver, BC,
     39–46, New York, NY, USA, 2010. ACM.                                     Canada, December 11-14, 2011, pages 497–506, 2011.
 [6] Marco de Gemmis, Pasquale Lops, Cataldo Musto, Narducci             [25] Xia Ning and George Karypis. Sparse linear methods with side
     Fedelucio, and Giovanni Semeraro. Semantics-aware                        information for top-n recommendations. In Proceedings of the
     content-based recommender systems. In Francesco Ricci, Lior              Sixth ACM Conference on Recommender Systems, RecSys
     Rokach, and Bracha Shapira, editors, Recommender Systems                 ’12, pages 155–162, New York, NY, USA, 2012. ACM.
     Handbook, pages 119–159. Springer, 2nd edition, 2015.               [26] Tommaso Di Noia, Vito Claudio Ostuni, Jessica Rosati, Paolo
 [7] Gerben Klaas Dirk de Vries and Steven de Rooij. Substructure             Tomeo, Eugenio Di Sciascio, Roberto Mirizzi, and Claudio
     counting graph kernels for machine learning from rdf data. Web           Bartolini. Building a relatedness graph from linked open data:
     Semantics: Science, Services and Agents on the World Wide                A case study in the it domain. Expert Systems with
     Web, 35:71–84, 2015.                                                     Applications, 44:354 – 366, 2016.
 [8] Scott Deerwester, Susan T. Dumais, George W. Furnas,                [27] V. C. Ostuni, T. Di Noia, E. Di Sciascio, and R. Mirizzi. Top-n
     Thomas K. Landauer, and Richard Harshman. Indexing by                    recommendations from implicit feedback leveraging linked open
     latent semantic analysis. JOURNAL OF THE AMERICAN                        data. In ACM RecSys ’13, pages 85–92, 2013.
     SOCIETY FOR INFORMATION SCIENCE, 41(6):391–407,                     [28] Makbule Gulcin Ozsoy. From word embeddings to item
     1990.                                                                    recommendation. arXiv preprint arXiv:1601.01356, 2016.
 [9] T. Di Noia, V. C. Ostuni, J. Rosati, P. Tomeo, and                  [29] Heiko Paulheim. Knowledge graph refinement: A survey of
     E. Di Sciascio. An analysis of users’ propensity toward diversity        approaches and evaluation methods. Semantic Web,
     in recommendations. In ACM RecSys ’14, RecSys ’14, pages                 (Preprint):1–20, 2016.
     285–288. ACM, 2014.                                                 [30] Heiko Paulheim and Johannes Fümkranz. Unsupervised
[10] T. Di Noia, V. C. Ostuni, P. Tomeo, and E. Di Sciascio.                  generation of data mining features from linked open data. In
     Sprank: Semantic path-based ranking for top-n                            Proceedings of the 2nd international conference on web
     recommendations using linked open data. ACM Transactions                 intelligence, mining and semantics, page 31. ACM, 2012.
     on Intelligent Systems and Technology (TIST), 2016.                 [31] Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, and
[11] Tommaso Di Noia, Roberto Mirizzi, Vito Claudio Ostuni, and               Christian Bizer. Data mining with background knowledge from
     Davide Romito. Exploiting the web of data in model-based                 the web. RapidMiner World, 2014.
     recommender systems. In Proceedings of the Sixth ACM                [32] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk:
     Conference on Recommender Systems, RecSys ’12, pages                     Online learning of social representations. In Proceedings of the
     253–256, New York, NY, USA, 2012. ACM.                                   20th ACM SIGKDD international conference on Knowledge
[12] Tommaso Di Noia, Roberto Mirizzi, Vito Claudio Ostuni,                   discovery and data mining, pages 701–710. ACM, 2014.
     Davide Romito, and Markus Zanker. Linked open data to               [33] Steffen Rendle. Factorization machines with libfm. ACM
     support content-based recommender systems. In Proceedings of             Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.
     the 8th International Conference on Semantic Systems,
                                                                         [34] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and
     I-SEMANTICS ’12, pages 1–8, New York, NY, USA, 2012.
                                                                              Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from
     ACM.
                                                                              implicit feedback. In Proceedings of the Twenty-Fifth
[13] Mouzhi Ge, Carla Delgado-battenfeld, and Dietmar Jannach.                Conference on Uncertainty in Artificial Intelligence, UAI ’09,
     Beyond accuracy: evaluating recommender systems by coverage              pages 452–461, Arlington, Virginia, United States, 2009.
     and serendipity. In In RecSys ’10, page 257, 2010.
                                                                         [35] Petar Ristoski, Christian Bizer, and Heiko Paulheim. Mining
[14] Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric,                   the web of linked data with rapidminer. Web Semantics:
     Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug               Science, Services and Agents on the World Wide Web,
     Sharp. E-commerce in your inbox: Product recommendations at              35:142–151, 2015.
     scale. In Proceedings of the 21th ACM SIGKDD International
                                                                         [36] Petar Ristoski, Eneldo Loza Mencı́a, and Heiko Paulheim. A
     Conference on Knowledge Discovery and Data Mining, KDD
                                                                              hybrid multi-strategy recommender system using linked open
     ’15, pages 1809–1818, New York, NY, USA, 2015. ACM.
                                                                              data. In Semantic Web Evaluation Challenge, pages 150–156.
[15] Z. S. Harris. Mathematical Structures of Language. Wiley,                Springer, 2014.
     New York, NY, USA, 1968.
                                                                         [37] Petar Ristoski and Heiko Paulheim. Rdf2vec: Rdf graph
[16] Oktie Hassanzadeh and Mariano Consens. M.: Linked movie                  embeddings for data mining. In International Semantic Web
     data base. In In: Workshop on Linked Data on the Web, 2009.              Conference (To Appear). Springer, 2016.
[17] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix               [38] Magnus Sahlgren. An introduction to random indexing. In In
     factorization techniques for recommender systems. Computer,              Methods and Applications of Semantic Indexing Workshop at
     42(8):30–37, August 2009.                                                the 7th International Conference on Terminology and
[18] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch,                    Knowledge Engineering, TKE 2005, 2005.
     Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann,          [39] Max Schmachtenberg, Christian Bizer, and Heiko Paulheim.
     Mohamed Morsey, Patrick van Kleef, SÃűren Auer, and                    Adoption of the linked data best practices in different topical
     Christian Bizer. DBpedia – A Large-scale, Multilingual                   domains. In International Semantic Web Conference, pages
     Knowledge Base Extracted from Wikipedia. Semantic Web                    245–260. Springer, 2014.
     Journal, 2013.
                                                                         [40] Giovanni Semeraro, Pasquale Lops, Pierpaolo Basile, and
[19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.                 Marco de Gemmis. Knowledge infusion into content-based
     Efficient estimation of word representations in vector space.            recommender systems. In Proceedings of the Third ACM
     arXiv preprint arXiv:1301.3781, 2013.                                    Conference on Recommender Systems, RecSys ’09, pages
[20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and             301–304, New York, NY, USA, 2009. ACM.
     Jeff Dean. Distributed representations of words and phrases         [41] Harald Steck. Evaluation of recommendations:
     and their compositionality. In Advances in neural information            Rating-prediction and ranking. In Proceedings of the 7th ACM
     processing systems, pages 3111–3119, 2013.                               Conference on Recommender Systems, RecSys ’13, pages
[21] Cataldo Musto, Giovanni Semeraro, Marco De Gemmis, and                   213–220, New York, NY, USA, 2013. ACM.
     Pasquale Lops. Word embedding techniques for content-based          [42] Denny Vrandečić and Markus Krötzsch. Wikidata: a free
     recommender systems: an empirical evaluation. In RecSys                  collaborative knowledgebase. Communications of the ACM,
     Posters, ser. CEUR Workshop Proceedings, P. Castells, Ed,                57(10):78–85, 2014.
     volume 1441.
                                                                         [43] Markus Weimer, Alexandros Karatzoglou, and Alex Smola.
[22] Cataldo Musto, Giovanni Semeraro, Pasquale Lops, and Marco               Improving maximum margin matrix factorization. Mach.
     de Gemmis. Random Indexing and Negative User Preferences                 Learn., 72(3):263–276, September 2008.
     for Enhancing Content-Based Recommender Systems, pages
                                                                         [44] Mi Zhang and Neil Hurley. Avoiding monotony: Improving the
     270–281. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
                                                                              diversity of recommendation lists. In Proceedings of the 2008
[23] Cataldo Musto, Giovanni Semeraro, Pasquale Lops, and Marco               ACM Conference on Recommender Systems, RecSys ’08,
     de Gemmis. Contextual eVSM: A Content-Based                              pages 123–130, New York, NY, USA, 2008. ACM.
     Context-Aware Recommendation Framework Based on
     Distributional Semantics, pages 125–136. Springer Berlin

</pre>