<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Hybrid Method for Rating Prediction Using Linked Data Features and Text Reviews</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Semih Yumusak</string-name>
          <email>semih.yumusak@karatay.edu.tr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emir Mun~oz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Minervini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erdogan Dogdu</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Halife Kodaz</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fujitsu Ireland Limited</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Insight Centre for Data Analytics, National University of Ireland</institution>
          ,
          <addr-line>Galway</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>KTO Karatay University</institution>
          ,
          <addr-line>Konya</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Selcuk University</institution>
          ,
          <addr-line>Konya</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>TOBB University of Economics and Technology</institution>
          ,
          <addr-line>Ankara</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our entry for the Linked Data Mining Challenge 2016, which poses the problem of classifying music albums as `good' or `bad' by mining Linked Data. The original labels are assigned according to aggregated critic scores published by the Metacritic website. To this end, the challenge provides datasets that contain the DBpedia reference for music albums. Our approach bene ts from Linked Data (LD) and free text to extract meaningful features that help distinguishing between these two classes of music albums. Thus, our features can be summarized as follows: (1) direct object LD features, (2) aggregated count LD features, and (3) textual review features. To build unbiased models, we ltered out those properties somehow related with scores and Metacritic. By using these sets of features, we trained seven models using 10-fold cross-validation to estimate accuracy. We reached the best average accuracy of 87.81% in the training data using a Linear SVM model and all our features, while we reached 90% in the testing data.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked data</kwd>
        <kwd>SPARQL</kwd>
        <kwd>Classi cation</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>#Know@LOD2016</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        We start from the DBpedia knowledge base for referencing of metadata about all albums in
the training and testing datasets. By leveraging such a knowledge base, we de ned a set of
features which are potentially relevant to the classi cation task. As shown in [1], features coming
from textual data (such as reviews) are also relevant for a classi cation problem. Therefore, in
addition to pure Linked Data features, we collected the textual reviews from Metacritic website,
and consider the words content as features herein. Our approach steps (as shown in Figure 1)
can be summarized as follows:
Data Collection. First, we collected and analysed the DBpedia knowledge base and the
Metacritic reviews. For each music album, we crawled the summaries of the corresponding Metacritic
reviews for an album and artist8. The critic reviews were scrapped and saved as text, converted
into RDF and linked to DBpedia using the dbp:rev9 property in a Jena Fuseki instance.
Feature Extraction. Starting from DBpedia knowledge base, a manual selection of predicates
was carried out, leaving out less frequent and irrelevant predicates. With the remaining
predicates, we de ned a set of questions and hypotheses that we later test (see Table 1). Based on
our two sources, our features are divided into two sets: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Linked Data-based features, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
Text-based features. Set (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) is further divided into: (
        <xref ref-type="bibr" rid="ref1">1-1</xref>
        ) Linked Data object speci c features,
where values of speci c predicates are directly used; and (
        <xref ref-type="bibr" rid="ref1 ref2">1-2</xref>
        ) aggregating features, where we
use the count of values of given predicates. In the case of Metacritic reviews, we follow a Bag
of Words approach for part (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) to nd the most discriminant words for each class. Formally, we
generate the following vectors as features: x(LD) = (f1; : : : ; fm) to represent the (
        <xref ref-type="bibr" rid="ref1">1-1</xref>
        ) features (t1
to t14), where m = 15009; x(LDA) = (f1; : : : ; fn) to represent the (
        <xref ref-type="bibr" rid="ref1 ref2">1-2</xref>
        ) features (t15, t16), where
n = 4; and, x(TEXT) = (f1; : : : ; fq) to represent the (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) features (t17), where q = 21973 is the
cardinality of the extracted vocabulary.
      </p>
      <p>In order to answer each question in Table 1, we submitted SPARQL to our enriched DBpedia
knowledge base. For example, the query to get a direct object feature like genre(s) of the album
&lt;AlbumURI&gt;:
SELECT ?o WHERE {&lt; AlbumURI &gt; dbo : genre ?o .}
Similarly, we get the aggregation features, e.g., the number of extra albums for the producer of
album &lt;AlbumURI&gt;:
SELECT count (? s) WHERE {&lt; AlbumURI &gt; dbo : producer ? o1 . ?s dbo : producer ? o1 . ?s a dbo : album &gt;}
8 We use URIs as http://www.metacritic.com/music/AlbumName/ArtistName/critic-reviews
9 URI namespaces are shortened according to pre xes in http://prefix.cc/
A Hybrid Method for Rating Prediction Using Linked Data Features and Text Reviews</p>
      <p>During our manual analysis, we noticed that some properties (e.g., dbp:extra, dbp:source,
dbp:collapsed, dbp:extraColumn, dbp:type) have a strong correlation with the class `good'
over `bad', and vice versa. These properties are also collected and added to the LD feature
set. Moreover, some properties are directly related to Metacritic scores (dbp:mc is the actual
Metacritic score), and other (critic) scores, like dbp:revNscore whose values range from 1 to 15.
To keep our models unbiased, we decided to exclude them from our extraction.</p>
      <p>Besides regular DBpedia properties, we also selected features from textual reviews. For each
review, we use Bag-of-Words with lower-case and non-alphanumeric normalizations and
stopwords removal. For this, NLTK library10 was used for stemming and lemmatization of words
longer than 2 characters. In [1], the authors also show that aggregation features provide better
results when discretized, e.g., based on their numeric range. For instance, the award feature of an
artist could be marked as `high' if the number of awards is more than one; and `low' otherwise.
For other numeric (property) values, we have identi ed the average values and use them to
discretize the values as `high' (above average) and `low' (below average). Few average examples
are runtime is 2800 sec., number of albums per producer is 40, total length is 2900 sec.
Classi cation. We trained seven di erent models listed in Table 2 using k-fold cross-validation
(k = 10). Each model was trained with ve di erent sets of features, and evaluated using
accutp+tn
racy, Acc = tp+f p+f n+tn . The hyperparameters for each model were determined manually via
incremental tests, and results extracted from the training set. For example, for SVM we tested
a linear kernel with C 2 [0:001 0:1] and found 0:025 as best performing value.</p>
    </sec>
    <sec id="sec-2">
      <title>Experimental Results and Analysis</title>
      <p>For our experiments we used the sckit-learn library11 that supports the training of the proposed
seven classi ers using di erent combinations of our features. Table 2 shows the accuracy values
for the best validation values for all seven models with each set of features. We report our best
cross-validation accuracy 87.81% on the training set, whilst the challenge system reports 90%
for our submission on the testing set. This might be seen as an indication that our models did
not over t on the training data, and they are able to generalise to unseen data. We attribute
this mainly to our decision to leave out predicates that are directly or indirectly related to scores
for the music albums. We would also like to highlight the use of textual features to increase the
true positives and false negatives. Considering solely LD features reached up to 76.64%, while
considering solely TEXT features reached up to 85%, both using the SVM model. This fact shows
that for a classi cation problem like this, DBpedia still does not provide enough meta-information
for the entities, and other sources must be taken into account. Also we tested our hypotheses
with the best performing model and extract accuracy for each one in Table 1.</p>
      <p>Linear SVM KNN RBF SVM Dec. Tree Rand. Forest AdaBoost Nave Bayes
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper, we addressed the problem of classi cation by using features from Linked Data
and text reviews. We experimented with several properties related to music albums, however, we
noticed that by also considering textual features we could reach higher accuracies. We enriched
our knowledge base with textual critics and use them as Bag of Words. We selected our model
using 10-fold cross-validation: our best model also showed good predictive accuracy on the test
set as reported by the challenge system. This is an indication that our manual analysis and
feature selection was a useful pre-processing step. For reproducibility, all source les, crawler
code and reviews, enriched knowledge base in RDF, and intermediate les are published as an
open-source repository12.</p>
      <p>Acknowledgement. This research is partly supported by The Scienti c and Technological
Research Council of Turkey (Ref.No: B.14.2. TBT.0.06.01-21514107-020-155998)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aldarra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Mun~oz, E.:
          <article-title>A Linked Data-Based Decision Tree Classi er to Review Movies</article-title>
          .
          <source>In: Proc. of the 4th International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data at ESWC 2015. CEUR Workshop Proceedings</source>
          , vol.
          <volume>1365</volume>
          .
          <string-name>
            <surname>Portoroz</surname>
          </string-name>
          , Slovenia (May
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isele</surname>
            , Robert and Jakob,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            , M., van Kleef,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia</article-title>
          .
          <source>Semantic Web</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <volume>167</volume>
          {
          <fpage>195</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Semantic web in data mining and knowledge discovery: A comprehensive survey</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>