<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Schema-aware feature selection in Linked Data-based recommender systems (Extended Abstract)?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Corrado Magarelli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Azzurra Ragone</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Tomeo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Di Noia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Palmonari</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Maurino</string-name>
          <email>andrea.maurinog@unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eugenio Di Sciascio</string-name>
          <email>eugenio.disciasciog@poliba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Polytechnic University of Bari</institution>
          ,
          <addr-line>Via Orabona, 4, 70125 Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Milan Bicocca</institution>
          ,
          <addr-line>P.zza Dell'Ateneo Nuovo, 1, 20126 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Semantics-aware recommendation engines have emerged as a new family of systems able to exploit the semantics encoded in unstructured and structured information sources to provide better results in terms of accuracy, diversity and novelty as well as to foster the provisioning of new services such as explanation. In the rising of these new recommender systems, an important role has been played by Linked Data (LD). However, as Linked Data is often very rich and contains many information that may result irrelevant and noisy, an initial step of feature selection may be required in order to select the most meaningful portion of the original dataset. Many approaches have been proposed in the literature for feature selection that exploit di erent statistical dimensions of the original data. In this paper we investigate the role of the semantics encoded in an ontological hierarchy via schema-summarization when exploited to select the most relevant properties for a recommendation task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In the last years we have witnessed a owering of semantics-aware solutions for
Recommender Systems (RSs) exploiting information held in knowledge graphs,
as the ones available in the Linked Data (LD) Cloud. Several approaches using
LD to build RSs have been proposed in the literature. However, almost no one
tackles the issue of automatically selecting the best subset of LD-based features.
Usually, the feature-selection process is done manually by choosing the
properties more "suitable" for the scenario taken into account. For example, in a
scenario related to movies, properties as dbo:starring or dbo:director look
more relevant than dbo:releaseDate or dbo:distributor. As well as for the
music domain, properties as dbo:genre and dbo:writer look more important
than dbo:producer or dbo:recordedIn. However, without an automatic feature
selection process, the human intervention is required every time a new domain
? An extended version of this paper has been published in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
is chosen, while it could be good to have a general way to select properties
regardless of the domain. In machine learning tasks there is the need to perform a
selection of features and this could not be straightforward when attributes are
embedded in a knowledge graph. In many graph-based recommendation systems
the knowledge exploration starts from the data and goes on following the
relations between entities, without taking into account the knowledge lying in the
ontology and then in its class hierarchy. In this paper we investigate how
ontological schema summarization could be used as a feature selection technique for
LDbased recommender systems when features are represented by RDF properties
and compare the results with other "classical" techniques for feature selection.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Feature selection and recommender systems</title>
      <p>
        When dealing with recommender systems, a relevant task is to determine the
impact of a particular feature selection technique on the behavior of the underlying
algorithms. Indeed, some techniques can improve the accuracy of the
recommendation, some improves the diversity while others can provide a good trade-o
between diversity and accuracy. Among all the di erent feature selection
techniques available in the literature, in our experimental setting, we initially selected
Information Gain, Information Gain Ratio, Chi-squared test and Principal
Component Analysis as their computation can be adapted to categorical features, as
the LD ones. Then, the features selected from each technique have been used
as input for two recommendation algorithms based on graph-kernels [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]:
entitybased and path-based. Experimental results showed Information Gain as the
best performing technique1. Information Gain (IG) is de ned as the expected
reduction in entropy occurring when a feature is present versus when it is absent.
For a feature fi, IG is computed as [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]:
      </p>
      <p>IG(fi) = E(I)</p>
      <p>X jIvj E(Iv)
v2dom(fi) jIj
where E(I) is the value of the entropy of the data, Iv is the number of items
in which the feature fi (e.g. starring for movies ) has a value equal to v (e.g. Al
Pacino in the movie domain), and E(Iv) is the entropy computed on data where
the feature fi assumes value v. The IG of a feature fi is higher as the lower is
the value of the entropy E(Iv). Features are ranked according to their IG and
the top-k ones are returned.</p>
      <p>
        Schema summarization for feature selection. Linked Data summarization is the
process of extracting a summary of an input linked data set, such that this
summary is smaller (in size) than the input data, but retains information
useful for certain tasks. Relevance-oriented summaries capture subsets of the input
data sets and/or ontologies. These subsets are estimated to be more relevant
1 The interested reader may refer to https://github.com/sisinflab/SAC2017/FeatureSelection for
results obtained with other feature selection techniques
for the users according to multidimensional relevance criteria [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Vocabularyoriented summaries describe the usage of vocabularies, e.g., ontologies, used in
a dataset. These summaries are usually de ned so as to be complete, i.e., to
provide information about every element of the vocabulary/ontology used in
the data set [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Vocabulary-oriented summaries that provide complete
descriptions of vocabulary usage may support feature selection by providing relevant
information about every possible feature, i.e., property, in the data set.
      </p>
      <p>
        In this paper we use summaries produced by a vocabulary-oriented
summarization framework named ABSTAT2. It takes a linked data set and - when
available - one or more ontologies used in this data set as input, and
returns a summary. The summary consists in a set of patterns having the form
hC; P; Di, with C and D being types, i.e., concepts or datatypes, and P being
an RDF property. We refer to C and D as source and target types, respectively.
Each pattern hC; P; Di tells that there exist some instance of type C linked
to some instance of type D through the property P . For example, a pattern
hdbo:Film; dbo:starring; dbo:Actori tells that there are instances of dbo:Film
linked to instances of type dbo:Actor through the property dbo:starring in
the data set. The summary is complete for relational assertions in an RDF data
set, i.e., assertions about individuals: for every relational assertion hx; p; yi that
exists in the data set, at least one pattern is generated, i.e., every such assertion
is represented by at least one pattern. The generation of these patterns is based
on explicit typing assertions, e.g., hdbr:Tom Cruise; rdf:type; dbo:Actori or on
implicit typing assertions (for literals), e.g., 1962-01-01xsd:date extracted from
the dataset. Di erently from other approaches that also extract
vocabularybased patterns from linked data sets [
        <xref ref-type="bibr" rid="ref3 ref4">4, 3</xref>
        ], ABSTAT applies a pattern
minimalization technique leveraging the relations between types de ned in the
ontologies (when the ontologies are used in the summarization process). Additional
information provided in summaries and of major importance for feature
selection is pattern frequency, which counts the occurrences of patterns in the data
set. For example, hdbo:Film; dbo:starring, dbo:Actori[10662] tells that 10662
instances of dbo:Film are linked to instances of type dbo:Actor through the
property dbo:starring in the data set3.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        For evaluating the quality of a recommendation algorithm, given a particular
feature selection technique, we use four metrics, as each one of them measures a
di erent dimension in the nal result. To evaluate recommendation accuracy,
we use Precision and Mean Reciprocal Rank (MRR). While P recision@N is a
metric denoting the fraction of relevant items in the top-N recommendations,
2 ABSTAT summaries for several datasets can be explored at http://abstat.disco.unimib.it:8880/
3 For more details about the summarization process, the impact of minimalization on
the size of extracted summaries, the use of ABSTAT summaries to support data
set understanding, and the services through which summaries are accessible via web
interfaces we refer to [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
5 0.02035 0.14694 0.54953 9.12
10 0.01651 0.13705 0.64346 9.42
15 0.02062* 0.13757 0.67417 9.42
Top-K features Precision@10 MRR@10 itemCov@10 aggrEntropy@10
8.96
10.24
10.19
MRR computes the average reciprocal rank of the rst relevant recommended
item, and hence results particularly meaningful when users are provided with
few but valuable recommendations (i.e., Top-1 or Top-3)[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. To evaluate
aggregate diversity, we consider catalog coverage, i.e., the percentage of items in the
catalog recommended at least once and aggregate entropy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The former is used
to assess the ability of a system to cover the item catalog, namely to recommend
as many items as possible. While the latter measures the distribution of the
recommendations across all the items, showing whether the recommendations are
concentrated on a few items or are better distributed.
      </p>
      <p>
        The evaluation of the two feature selection methods, IG and ABSTAT, has
been done via the well-know Movielens 1M dataset. In order to enrich it with
information from Linked Data, we started from a dump of the DBpedia dataset4
and we limited it to the movie domain by linking movies in Movielens dataset
with their corresponding DBpedia entries. Table 1 shows the results for
entitybased and path-based graph kernel algorithms [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], respectively. When selecting
only the rst 5 features, the two feature selection methods, IG and ABSTAT,
show good values of accuracy, but lower values of aggregate diversity, especially
in term of coverage. This is not really surprising as with a lower number of
features, the system does not have enough diversi ed information to select more
items and the e ect of the popularity bias is stronger. Increasing the number of
features the value of diversity increases at the expense of the accuracy. However, a
good balance remains between accuracy and diversity thus showing a good
tradeo between the two [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The implementation of the recommendation algorithm
presented in this work and all the experimental results are available https:
//github.com/sisinflab/SAC2017.
4 http://downloads.dbpedia.org/2015-10/
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>G.</given-names>
            <surname>Adomavicius</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kwon</surname>
          </string-name>
          .
          <article-title>Improving aggregate recommendation diversity using ranking-based techniques</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>24</volume>
          (
          <issue>5</issue>
          ), May
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P.</given-names>
            <surname>Castells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Hurley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vargas</surname>
          </string-name>
          .
          <article-title>Novelty and diversity in recommender systems</article-title>
          .
          <source>In Recommender Systems Handbook</source>
          . Springer US, Boston, MA,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>T.</given-names>
            <surname>Gottron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Knauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Scherp</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Schaible.</surname>
          </string-name>
          <article-title>ELLIS: interactive exploration of linked data on the level of induced schema patterns</article-title>
          .
          <source>In Proceedings of the 2nd International Workshop on Summarizing and Presenting Entities and Ontologies., CEUR Workshop Proceedings</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poveda-Villalon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Garc</surname>
          </string-name>
          a
          <article-title>-Castro, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>GomezPerez. Loupe</surname>
          </string-name>
          <article-title>- an online tool for inspecting datasets in the linked data cloud</article-title>
          .
          <source>In Proceedings of the ISWC 2015 Posters &amp; Demonstrations Track, CEUR Workshop Proceedings</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>C.</given-names>
            <surname>Musto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lops</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , M. de Gemmis, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Semeraro</surname>
          </string-name>
          .
          <article-title>Semantics-aware graph-based recommender systems exploiting linked open data</article-title>
          .
          <source>In Proceedings of the 24th Conference on User Modeling Adaptation and Personalization</source>
          ,
          <string-name>
            <surname>UMAP</surname>
          </string-name>
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>V. C.</given-names>
            <surname>Ostuni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oramas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Di</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Serra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E. Di</given-names>
            <surname>Sciascio</surname>
          </string-name>
          .
          <article-title>Sound and music recommendation with knowledge graphs</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology (TIST)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Ragone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tomeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Magarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Di</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E. Di</given-names>
            <surname>Sciascio</surname>
          </string-name>
          .
          <article-title>Schema-summarization in linked-data-based feature selection for recommender systems</article-title>
          .
          <source>In Proceedings of the Symposium on Applied Computing, SAC '17</source>
          , pages
          <fpage>330</fpage>
          {
          <fpage>335</fpage>
          . ACM,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karatzoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baltrunas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Oliver</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          .
          <article-title>Climf: learning to maximize reciprocal rank with collaborative less-is-more ltering</article-title>
          .
          <source>In Proceedings of the sixth ACM conference on Recommender systems. ACM</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>B.</given-names>
            <surname>Spahiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Porrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          .
          <article-title>ABSTAT: ontologydriven linked data summaries with pattern minimalization</article-title>
          .
          <source>In Proceedings of the 2nd International Workshop on Summarizing and Presenting Entities and Ontologies (SumPre</source>
          <year>2016</year>
          )
          <article-title>co-located with ESWC</article-title>
          ., volume
          <volume>1605</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. G. Troullinou,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kondylakis</surname>
          </string-name>
          , E. Daskalaki, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Plexousakis</surname>
          </string-name>
          . RDF Digest:
          <article-title>E cient Summarization of RDF/S KBs</article-title>
          . In ESWC,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>