<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Automatic Selection of Linked Open Data features in Graph-based Recommender Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cataldo Musto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco de Gemmis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Lops</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Rutigliano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi di Bari "Aldo Moro" -</institution>
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>20</volume>
      <issue>2015</issue>
      <abstract>
        <p>In this paper we compare several techniques to automatically feed a graph-based recommender system with features extracted from the Linked Open Data (LOD) cloud. Specifically, we investigated whether the integration of LOD-based features can improve the e↵ectiveness of a graph-based recommender system and to what extent the choice of the features selection technique can influence the behavior of the algorithm by endogenously inducing a higher accuracy or a higher diversity. The experimental evaluation showed a clear correlation between the choice of the feature selection technique and the ability of the algorithm to maximize a specific evaluation metric. Moreover, our algorithm fed with LODbased features was able to overcome several state-of-the-art baselines: this confirmed the e↵ectiveness of our approach and suggested to further investigate this research line.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>BACKGROUND</title>
      <p>
        The Linked Open Data (LOD) cloud is a huge set of
interconnected RDF statements covering many topical domains,
ranging from government and geographical data to
structured information about media (movies, books, etc.) and
life sciences. The typical entry point to all this plethora of
data is DBpedia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the RDF mapping of Wikipedia, which
is commonly considered as the nucleus of the emerging Web
of Data. Thanks to the wide-spread availability of this free
machine-readable knowledge, a big e↵ort is now spent to
investigate whether and how the data gathered from the LOD
cloud can be exploited to improve intelligent and adaptive
applications, such a Recommender System (RS).
      </p>
      <p>
        Recent attempts towards the exploitation of Linked Open
Data to build RSs are due to Passant [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], who proposed
a music recommender system based on semantic similarity
calculations based on DBpedia properties. The use of
DBpedia for similarity calculation is also the core of the work
presented by Musto et al. in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: in this paper user
preferences in music extracted from Facebook are used as input
to find other relevant artists and to build a personalized
music playlist. Recently, the use of LOD-based data sources
has been the core of the ESWC 2014 Recommender Systems
Challenge1: in that setting, the best-performing approaches
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] were based on ensembles of several widespread algorithms
running on diverse sets of features gathered from the LOD
cloud. However, none of the above described work tackles
nor the issue of automatically selecting the best subset of
LOD-based features, neither analyzes the impact of such
selection techniques on die↵rent metrics as the diversity of the
recommendations.
      </p>
      <p>
        To this end, in this paper we propose a methodology to
automatically feed a graph-based recommendation algorithm
with features extracted from the LOD cloud. We focused
our attention on graph-based approaches since they use a
uniform formalism to represent both collaborative features
(connections between users and items, expressed through
ratings) and LOD-based ones (connections between di↵erent
items, expressed through RDF statements). As graph-based
algorithm we adopted PageRank with Priors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Moreover,
in this work we compared several techniques to
automatically select the best subset of LOD-based features, with the
aim to investigate to what extent the choice of the feature
selection technique can influence the behavior of the
algorithm and can endogenously lead to a higher accuracy or a
higher diversity of the recommendations.
      </p>
      <p>The rest of the paper is organized as follows: the
description of our recommendation methodology is the core of
Section 2, while the details of the experimental evaluation we
carried out along with the discussion of the results are
provided in Section 3. Finally, Section 4 sketches conclusions
and future work.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>The main idea behind our graph-based model is to
represent users and items as nodes in a graph. Formally, given
a set of users U = {u1 . . . un} and a set of items I =
{i1 . . . um}, a graph G = hV, Ei is instantiated. Given that
for each user and for each item a node is created, |V | =
|U | + |I|. Next, an edge connecting a user ui with an item ij
is created for each positive feedback expressed by that user,
so E = {(ui, ij )|likes(ui, ij ) = true}.</p>
      <p>
        Given this basic formulation, built on the ground of
simple collaborative2 data points, each item i 2 I can be
provided with a relevance score. To calculate the relevance of
each item, we used a well-known variant of the PageRank
called PageRank with Priors [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Di↵erently from
PageRank, which assigns an evenly distributed prior probability
to each node ( 1 , where N is the number of nodes),
PageR
      </p>
      <p>
        N
ank with Priors adopts a non-uniform personalization vector
assigning die↵rent weights to di↵erent nodes to get a bias
towards some nodes (specifically, the preferences of a specific
user). In our algorithm the probability was distributed by
defining a simple heuristics, set after a rough tuning: 80%
of the total weight is evenly distributed among items liked
by the users (0% assigned to items disliked by the users),
while 20% is evenly distributed among the remaining nodes.
Damping factor was set equal to 0.85, as in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Given this setting, the PageRank with Priors is executed
for each user (this is mandatory, since the prior probabilities
change according to user’s feedbacks), and nodes are ranked
according to their PageRank score which is in turn
calculated on the ground of the connectivity in the graph. The
output of the PageRank is a list of nodes ranked according to
PageRank scores, labeled as L. Given L, recommendations
are built by extracting from L only those nodes i1 . . . in 2 I.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Introducing LOD-based features</title>
      <p>As stated above, our basic formulation does not take into
account any data point di↵erent from users’ ratings. The
insight behind this work is to enrich the above described graph
by introducing some extra nodes and some extra edges,
defined on the ground of the information available in the LOD
cloud. Formally, we want to define an extended graph GLOD =
hVLOD ALL, ELOD ALLi, where VLOD ALL = V [ VLOD
and ELOD ALL = E [ ELOD. VLOD and ELOD represent
the extra nodes and the extra edges instantiated by
analyzing the data gathered from the LOD cloud, respectively.</p>
      <p>As an example, if we consider the movie The Matrix, the
property http://dbpedia.org/property/director
encoding the information about the director of the movie is
available in the LOD cloud. Consequently, an extra node The
Wachowski Brothers is added in VLOD and an extra edge,
labeled with the name of the property, is instantiated in ELOD
to connect the movie with its director. Similarly, if we
consider the property http://dbpedia.org/property/starring,
new nodes and new edges are defined, in order to model
the relationship between The Matrix and the main actors,
as Keanu Reeves, for example. In turn, given that Keanu
Reeves acted in several movies, many new edges are added
in the graph and many new paths now connect di↵erent
movies: these paths would not have been available if the
only collaborative data points were instantiated.</p>
      <p>It immediately emerges that, due to this novel enriched
representation, the structure of the graph tremendously changes
since many new nodes and many new edges are added to the
model: the first goal of our experimental session will be to
investigate whether graph-based RSs can benefit of the
introduction of novel LOD-based features.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Selecting LOD-based features</title>
      <p>Thanks to the data points available in the LOD cloud,
many new information can be encoded in our graph.
How2We just modeled user-items couples, as in collaborative
filtering algorithms
ever, as the number of extra nodes and extra edges grows,
the computational load of the PageRank with Priors grows
as well, so it necessary to identify the subset of the most
useful properties gathered from the LOD cloud and to
investigate to what extent (if any) each of them improves the
accuracy of our recommendation strategy.</p>
      <p>A very naive approach may be to manually select the most
relevant LOD-based features, according to simple heuristics
or to domain knowledge (e.g. properties as director, starring,
composer may be considered as relevant for the Movie
domain, whereas properties as runtime or country may be not).
This basic approach has several drawbacks, since it requires
a manual e↵ort, but it is also strictly domain-dependent.</p>
      <p>To avoid this, we employed features selection techniques
to automatically select the most promising LOD-based
features. Formally, our idea is to take as input ELOD, the
overall set of LOD-based properties, and to produce as output
ELOD F ST ✓ ELOD, the set of properties a specific feature
selection technique T returned as relevant. Clearly, the
exploitation of a feature selection technique T also produces
a set VLOD F ST ✓ VLOD, containing all the LOD-based
nodes connected to the properties in ELOD F ST .</p>
      <p>
        In this setting, given a FS technique T , PageRank will be
executed against the graph GLOD T = hVLOD T , ELOD T i,
where VLOD T = V [ VLOD F ST and ELOD T = E [
ELOD F ST . In the experimental session the e↵ectiveness of
seven di↵erent techniques for automatic selection of
LODbased features: PageRank, Principal Component Analysis
(PCA), Support Vector Machines (SVM), Chi-Squared Test
(CHI), Information Gain (IG), Information Gain Ratio (GR)
and Mininum Redundancy Maximum Relevance (MRMR)
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Clearly, a complete description of these techniques is
out of the scope of this paper. We will just limit to evaluate
their impact on the overall accuracy and the overall diversity
obtained by our algorithm.
3.
      </p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTAL EVALUATION</title>
      <p>Our experiments were designed on the ground of four
different research questions:
1. Do graph-based recommender systems benefit of the
introduction of LOD-based features?
2. Do graph-based recommender systems exploiting LOD
features benefit of the adoption of FS techniques?
3. Is there any correlation between the choice of the FS
technique and the behavior of the algorithm?
4. How does our methodology perform with respect to
state-of-the-art techniques?</p>
      <p>Experimental design: experiments were performed by
exploiting MovieLens3 dataset, consisting of 100,000 ratings
provided by 943 users on 1,682 movies. The dataset is
positively balanced (55.17% of positive ratings) and shows an
high sparsity (93.69%). Each user voted 84.83 items on
average and each item was voted by 48.48 users, on average.</p>
      <p>Experiments were performed by carrying out a 5-folds
cross validation. Given that MovieLens preferences are
expressed on a 5-point discrete scale, we decided to consider as
positive ratings only those equal to 4 and 5. As
recommendation algorithms we used the previously described PageRank
3http://grouplens.org/datasets/movielens/
metrics. Another interesting outcome which follows the use
of FS techniques is the saving of computational resources to
run PageRank with Priors on graph GLOD P CA. As shown
in Table 1, the adoption of FS caused a huge decrease of the
run time of the algorithms equal to 33.9% for MovieLens
(from 880 to 581 minutes). This is due to the smaller
number of information which are modeled in the graphs (-8.6%
nodes and -4.8% edges).</p>
      <p>MovieLens</p>
      <sec id="sec-5-1">
        <title>Nodes</title>
        <p>Edges</p>
        <p>Run (min.)</p>
      </sec>
      <sec id="sec-5-2">
        <title>K (LOD prop.)</title>
        <p>with Priors, set as explained in Section 2. We compared the
ee↵ctiveness of our graph-based recommendation
methodology by considering three di↵erent graph topologies: G,
modeling the basics collaborative information about user ratings;
GLOD, which enrichs G by introducing LOD-based features
gathered from DBpedia, and GLOD T which lighten the load
of PageRank with Priors by relying on the features selected
by a FS technique T . In order to enrich the graph G, each
item in the dataset was mapped to a DBpedia entry. In our
experiments 1,600 MovieLens entries (95.06% of the movies)
were successfully mapped to a DBpedia node. The items for
which a DBpedia entry was not found were only represented
by using collaborative data points. Overall, MovieLens
entries were described through 60 di↵erent DBpedia
properties. As feature selection techniques all the approaches
previously mentioned were employed, while for the parameter
K (the number of LOD-based features) three di↵erent
values were compared: 10, 30 and 50. The performance of
each graph topology was evaluated in terms of F1-measure.
Moreover, we also calculated the overall running time4 of
each experiment. To answer the third research questions we
also evaluated the diversity of the recommendations,
calculated by exploiting the classical Intra-List Diversity (ILD).
Statistical significance was assessed by exploiting Wilcoxon
and Friedman tests.</p>
        <p>Discussion of the Results: in the first experiment we
evaluated the introduction of LOD-based features in
graphbased recommender systems. Results are depicted in the
first two columns of Table 1. As regards MovieLens, a
statistically significant improvement (p &lt;&lt; 0.0001, assessed
through a Wilcoxon test) was obtained for all the metrics.
As expected, the expansion of the graph caused an
exponential growth of the run time of the algorithm. This is due
to the fact that the expansion stage introduced many new
nodes and many new edges in the graph (see Table 1). The
growth is particularly significant since 50,000 new nodes and
78,000 new edges were added to the graph.</p>
        <p>Next, we evaluated the impact of all the previously
presented feature selection techniques in such recommendation
setting. By analyzing the results provided in Table 2, it
emerged that our graph-based recommendation strategy does
not often benefit of the application of FS techniques. Indeed,
when a very small number of properties is chosen (K=10 ),
none of the configurations is able to overcome the baseline.
By slightly increasing the value of parameter K (K=30 ),
only three out of seven techniques (PageRank, PCA and
mRMR) improve the F1-measure. Next, when more data
points are introduced (with K=50 ) better results are
obtained and the F1-measure of the baseline is always
overcame. Given that the overall number of LOD-based
properties was equal to 60, it is possible to state that most of the
properties encoded in the extended graph GLOD can be
considered as relevant. Clearly, this is a very domain-specific
outcome, which needs to be confirmed by more thorough
analysis on die↵rent datasets. However, it is possible to
state that the adoption of FS techniques requires a complete
analysis of the usefulness of each of the properties encoded
in the LOD. Overall, the best performing configuration was
PCA, which was the only technique always overcoming the
baseline with K = 50. A Friedman test also showed that
PCA statistically overcomes the other techniques for all the
4Experiments were run on an Intel-i7-3770 CPU3.40 gHZ,
with 32GB RAM.</p>
        <p>In Experiment 3 we shifted the attention from F1-measure
to die↵rent evaluation metrics, and we investigate whether
the adoption of a specific FS technique can endogenously
induce a higher diversity at the expense of a little F1.
Results of the experiments are provided in Figure 1a. Due to
space reasons, only the results for F1@10 are provided. In
both charts we used four di↵erent symbols to identify the
di↵erent behaviors of each technique. It emerged that CHI
was the less useful technique, since it did not provide any
significant benefit to neither F1 nor diversity. Next,
PageRank provided a (small) improvement on F1 and it did not
significantly change the diversity of the recommendations.
It is noteworth that the larger increase in accuracy of PCA
is balanced by a decrease in terms of diversity. On the other
side, Gain Ratio obtained the overall best diversity of the
recommendation but it decreases the F1 of the algorithm.
To sum up, these results show that the choice of a particular
FS technique has a significant impact on the overall behavior
of the recommendation algorithm. As shown in the
experiment, some techniques have the ability of inducing a higher
diversity (or F1) at the expense of a little of F1 (or diversity,
respectively), wheres other can provide a good compromise
between both metrics. Clearly, more investigation is needed
to deeply analyze the behavior of each technique, but these
results already give some general guidelines which can drive
the choice of the FS technique which best fits the
requirements of a specific recommendation scenario.</p>
        <p>Finally, we compared the e↵ectiveness of our
methodology to the current state of the art. As baselines
User-toUser CF (U2U-KNN), Item-to-Item CF (I2I-KNN), a
simple popularity-based approach, a random baseline and the
Bayesian Personalized Ranking Matrix Factorization (BPRMF)
were used. We adopted the implementations available in
MyMediaLite Recommender System library5. As regards
5http://www.mymedialite.net/</p>
        <p>U2U and I2I, neighborhood size was set to 80, while BPRMF
was run by setting the factor parameter equal to 100.
Results are depicted in Figure 1b. As shown in the plots, our
graph-based RS outperforms all the baselines for all the
metrics taken into account. It is worth to note that our
approach obtained a higher F1 also when compared to a
wellperfoming matrix factorization algorithm as BPRMF, thus
this definitely confirmed the e↵ectiveness of our approach.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>In this work we proposed a graph-based recommendation
methodology based on PageRank with Priors, and we
evaluated di↵erent techniques to automatically feed such a
representation with features extracted from the LOD cloud.
Results showed that graph-based RSs can benefit of the
infusion of novel knowledge coming from the LOD cloud and
that a clear correlation between the adoption of a specific FS
technique with the overall results of the recommender exitss,
since some techniques endogenously showed the ability of
increasing also the diversity of the recommendations generated
by the algorithm. We also showed that our methodology was
able to overcome several state-of-the-art baselines on both
datasets. As future work, we will validate the approach by
evaluating it on di↵erent dataset, and we will investigate
the impact of LOD-based features with di↵erent learning
approaches as Random Forest or SVM.
(a) Trade-o↵ between F1 and Diversity</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <surname>Z. Ives.</surname>
          </string-name>
          <article-title>DBpedia: A nucleus for a Web of Open Data</article-title>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Musto</surname>
          </string-name>
          , M. de Gemmis,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lops</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Narducci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Semeraro</surname>
          </string-name>
          .
          <article-title>Aggregation strategies for linked open data-enabled recommender systems</article-title>
          .
          <source>In European Semantic Web Conference</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Haveliwala</surname>
          </string-name>
          .
          <article-title>Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search</article-title>
          .
          <source>IEEE Trans. Knowl</source>
          . Data Eng.,
          <volume>15</volume>
          (
          <issue>4</issue>
          ):
          <fpage>784</fpage>
          -
          <lpage>796</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Musto</surname>
          </string-name>
          , G. Semeraro,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lops</surname>
          </string-name>
          , M. de Gemmis, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Narducci</surname>
          </string-name>
          .
          <article-title>Leveraging social media sources to generate personalized music playlists</article-title>
          .
          <source>In EC-Web</source>
          <year>2012</year>
          , volume
          <volume>123</volume>
          <source>of LNBIP</source>
          , pages
          <fpage>112</fpage>
          -
          <lpage>123</lpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Page</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Winograd</surname>
          </string-name>
          .
          <article-title>The pageRank citation ranking: bringing order to the web</article-title>
          .
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Passant</surname>
          </string-name>
          . dbrec
          <article-title>- Music Recommendations Using DBpedia</article-title>
          . In International Semantic Web Conference,
          <source>Revised Papers</source>
          , volume
          <volume>6497</volume>
          <source>of LNCS</source>
          , pages
          <fpage>209</fpage>
          -
          <lpage>224</lpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Long</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Ding</surname>
          </string-name>
          .
          <article-title>Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          , IEEE Transactions on,
          <volume>27</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1226</fpage>
          -
          <lpage>1238</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>