=Paper=
{{Paper
|id=Vol-1586/ldmc3
|storemode=property
|title=A Hybrid Method for Rating Prediction Using Linked Data Features and Text Reviews
|pdfUrl=https://ceur-ws.org/Vol-1586/ldmc3.pdf
|volume=Vol-1586
|authors=Semih Yumuşak,Emir Muñoz,Pasquale Minervini,Erdogan Dogdu,Halife Kodaz
|dblpUrl=https://dblp.org/rec/conf/esws/YumusakMMDK16
}}
==A Hybrid Method for Rating Prediction Using Linked Data Features and Text Reviews==
<pdf width="1500px">https://ceur-ws.org/Vol-1586/ldmc3.pdf</pdf>
<pre>
     A Hybrid Method for Rating Prediction Using Linked
              Data Features and Text Reviews

Semih Yumusak1,2 , Emir Muñoz2,3 , Pasquale Minervini2 , Erdogan Dogdu4 , and Halife Kodaz5
                              1
                                KTO Karatay University, Konya, Turkey,
                                    semih.yumusak@karatay.edu.tr
             2
               Insight Centre for Data Analytics, National University of Ireland, Galway,
                                       3
                                          Fujitsu Ireland Limited,
                  4
                    TOBB University of Economics and Technology,Ankara, Turkey
                                  5
                                     Selcuk University, Konya, Turkey

       Abstract. This paper describes our entry for the Linked Data Mining Challenge 2016,
       which poses the problem of classifying music albums as ‘good’ or ‘bad’ by mining Linked
       Data. The original labels are assigned according to aggregated critic scores published by the
       Metacritic website. To this end, the challenge provides datasets that contain the DBpedia
       reference for music albums. Our approach benefits from Linked Data (LD) and free text
       to extract meaningful features that help distinguishing between these two classes of music
       albums. Thus, our features can be summarized as follows: (1) direct object LD features, (2)
       aggregated count LD features, and (3) textual review features. To build unbiased models,
       we filtered out those properties somehow related with scores and Metacritic. By using
       these sets of features, we trained seven models using 10-fold cross-validation to estimate
       accuracy. We reached the best average accuracy of 87.81% in the training data using a
       Linear SVM model and all our features, while we reached 90% in the testing data.

       Keywords: Linked data, SPARQL, Classification, Machine Learning, #Know@LOD2016

1     Introduction
The potential of using the datasets available in the Linked Open Data (LOD) cloud6 supporting
several tasks in Data Mining (DM) has been pointed out several times (see [3] for a survey).
For instance, the rich content of domain specific and general domain datasets could be used
to generate semantically meaningful feature sets. The linked characteristic of the datasets in
the LOD cloud allows for querying features from different sources for a given entity. The Linked
Data Mining Challenge provides DBpedia URIs that we use to query the DBpedia knowledge base
and extract features of the considered entity. DBpedia [2] knowledge base contains descriptive
information about albums that can be extracted using SPARQL to query for relevant triple
patterns. For instance, we can start from a DBpedia music album URI and access all related
metadata. Furthermore, we can access extra information by navigating the links in the graph
and get, for example, information about the artist(s) or band that recorded the album, number
of awards of the album or artist(s), and information about producers, among others.
    Although users are empowered with the ability to navigate linked data, they still face the
same classical challenges associated to DM, such as feature selection, model selection, etc. Pre-
vious work on this task [1], highlights limitations of features coming solely from DBpedia. Extra
information could come from textual critics from Metacritic7 . Here, we follow a similar approach,
enriching DBpedia to find the best set of features for distinguishing between the two classes (§ 2).
Results of our experiments show that taking all considered features into account yields the best
classification performance (§ 3). Conclusions and final remarks are reported in Section 4.
6
    http://lod-cloud.net/
7
    http://www.metacritic.com
2       S. Yumusak et al.


                                         Fig. 1. System architecture


2     Methodology
We start from the DBpedia knowledge base for referencing of metadata about all albums in
the training and testing datasets. By leveraging such a knowledge base, we defined a set of
features which are potentially relevant to the classification task. As shown in [1], features coming
from textual data (such as reviews) are also relevant for a classification problem. Therefore, in
addition to pure Linked Data features, we collected the textual reviews from Metacritic website,
and consider the words content as features herein. Our approach steps (as shown in Figure 1)
can be summarized as follows:
Data Collection. First, we collected and analysed the DBpedia knowledge base and the Meta-
critic reviews. For each music album, we crawled the summaries of the corresponding Metacritic
reviews for an album and artist8 . The critic reviews were scrapped and saved as text, converted
into RDF and linked to DBpedia using the dbp:rev9 property in a Jena Fuseki instance.
Feature Extraction. Starting from DBpedia knowledge base, a manual selection of predicates
was carried out, leaving out less frequent and irrelevant predicates. With the remaining predi-
cates, we defined a set of questions and hypotheses that we later test (see Table 1). Based on
our two sources, our features are divided into two sets: (1) Linked Data-based features, and (2)
Text-based features. Set (1) is further divided into: (1-1) Linked Data object specific features,
where values of specific predicates are directly used; and (1-2) aggregating features, where we
use the count of values of given predicates. In the case of Metacritic reviews, we follow a Bag
of Words approach for part (2) to find the most discriminant words for each class. Formally, we
generate the following vectors as features: x(LD) = (f1 , . . . , fm ) to represent the (1-1) features (t1
to t14 ), where m = 15009; x(LDA) = (f1 , . . . , fn ) to represent the (1-2) features (t15 , t16 ), where
n = 4; and, x(TEXT) = (f1 , . . . , fq ) to represent the (2) features (t17 ), where q = 21973 is the
cardinality of the extracted vocabulary.
    In order to answer each question in Table 1, we submitted SPARQL to our enriched DBpedia
knowledge base. For example, the query to get a direct object feature like genre(s) of the album
<AlbumURI>:
SELECT ? o WHERE { < AlbumURI > dbo : genre ? o .}


Similarly, we get the aggregation features, e.g., the number of extra albums for the producer of
album <AlbumURI>:
SELECT count (? s ) WHERE { < AlbumURI > dbo : producer ? o1 . ? s   dbo : producer ? o1 . ? s a dbo : album >}


8
    We use URIs as http://www.metacritic.com/music/AlbumName/ArtistName/critic-reviews
9
    URI namespaces are shortened according to prefixes in http://prefix.cc/
           A Hybrid Method for Rating Prediction Using Linked Data Features and Text Reviews                                      3

              Table 1. Domain-specific questions, hypotheses, and predicates with their accuracy

     #                       Question                                   Hypothesis                      Predicate      SVM Acc.
     t1    What are the topics (dct:subjects) for the     Some albums belong to successful sub-       dct:subject      58.05%
           album? (baseline)                              jects, and vice versa.
     t2    Who is the artist of the album?                Some artists are more famous than oth-      dbo:artist       48.91%
                                                          ers
     t3    Is the artist a band, single artist, etc.?     Bands are more successful than single       rdf:type   of    61.95%
                                                          artists                                     dbo:artist
     t4    What genres the album belongs to?              Some genres are more popular than oth-      dbo:genre        66.33%
                                                          ers
     t5    What are the language(s) in the album?         Albums in English are more likely to be     dbo:language     47.27%
                                                          popular
     t6    Who recorded this album?                       Some labels are more popular and            dbo:recordLabel 49.06%
                                                          record more albums.
     t7    Are long albums more popular?                  Long albums tend to be more popular         dbo:runtime      46.48%
     t8    Who is the director of the album?              Certain directors/artists are more suc-     dbp:director     47.19%
                                                          cessful
     t9    What is the region of the album?               Albums created in certain regions are       dbp:region       51.72%
                                                          more likely to be successful
     t10   What studio created the album?                 Some studios create high quality works,     dbp:studio       47.19%
                                                          some do not.
     t11   What is the total length of the album?         Shorter albums are likely to be worse.      dbp:totalLength 54.69%
     t12   Who are the songwriters of the album?          The songwriters in the album affects the    dbp:writer      47.19%
                                                          popularity of the album
     t13   Who are the reviewers of the album?            Some reviewers are likely to review only    dbp:rev          71.41%
                                                          good or bad albums.
     t14   What are the topics (dct:subjects) for the     Particular artists are likely to be cate-   dct:subject of   68.59%
           artist?                                        gorized under certain subjects.             dbo:artist
     t15   How many awards does an artist have?           Albums of award winning artists are         # awards of      47.19%
                                                          likely to be more successful                dbo:artist
     t16   How many other albums a producer of this       Some producers are more successful and      # albums by      54.53%
           album have?                                    produce more albums than others.            dbo:producer
     t17   Are textual reviews useful for the classifi-   A Bag of Words approach can help to         BoW              85.00%
           cation?                                        separate the classes


    During our manual analysis, we noticed that some properties (e.g., dbp:extra, dbp:source,
dbp:collapsed, dbp:extraColumn, dbp:type) have a strong correlation with the class ‘good’
over ‘bad’, and vice versa. These properties are also collected and added to the LD feature
set. Moreover, some properties are directly related to Metacritic scores (dbp:mc is the actual
Metacritic score), and other (critic) scores, like dbp:revNscore whose values range from 1 to 15.
To keep our models unbiased, we decided to exclude them from our extraction.
    Besides regular DBpedia properties, we also selected features from textual reviews. For each
review, we use Bag-of-Words with lower-case and non-alphanumeric normalizations and stop-
words removal. For this, NLTK library10 was used for stemming and lemmatization of words
longer than 2 characters. In [1], the authors also show that aggregation features provide better
results when discretized, e.g., based on their numeric range. For instance, the award feature of an
artist could be marked as ‘high’ if the number of awards is more than one; and ‘low’ otherwise.
For other numeric (property) values, we have identified the average values and use them to
discretize the values as ‘high’ (above average) and ‘low’ (below average). Few average examples
are runtime is 2800 sec., number of albums per producer is 40, total length is 2900 sec.
Classification. We trained seven different models listed in Table 2 using k-fold cross-validation
(k = 10). Each model was trained with five different sets of features, and evaluated using accu-
racy, Acc = tp+ftp+tn
                  p+f n+tn . The hyperparameters for each model were determined manually via
incremental tests, and results extracted from the training set. For example, for SVM we tested
a linear kernel with C ∈ [0.001 − 0.1] and found 0.025 as best performing value.
10
     http://www.nltk.org/
4        S. Yumusak et al.

3      Experimental Results and Analysis
For our experiments we used the sckit-learn library11 that supports the training of the proposed
seven classifiers using different combinations of our features. Table 2 shows the accuracy values
for the best validation values for all seven models with each set of features. We report our best
cross-validation accuracy 87.81% on the training set, whilst the challenge system reports 90%
for our submission on the testing set. This might be seen as an indication that our models did
not overfit on the training data, and they are able to generalise to unseen data. We attribute
this mainly to our decision to leave out predicates that are directly or indirectly related to scores
for the music albums. We would also like to highlight the use of textual features to increase the
true positives and false negatives. Considering solely LD features reached up to 76.64%, while
considering solely TEXT features reached up to 85%, both using the SVM model. This fact shows
that for a classification problem like this, DBpedia still does not provide enough meta-information
for the entities, and other sources must be taken into account. Also we tested our hypotheses
with the best performing model and extract accuracy for each one in Table 1.
                      Table 2. Comparative Analysis of Feature Sets and Classifiers

            Feature Set   Linear SVM   KNN      RBF SVM Dec. Tree Rand. Forest AdaBoost Naı̈ve Bayes
          LD                 76.64%    60.47%    48.05%    72.66%     53.91%     75.00%    76.41%
          LDA                54.53%    52.58%    54.69%    54.45%     48.91%     54.53%    52.89%
          LD+LDA             76.72%    60.23%    48.05%    72.66%     52.34%     75.00%    76.41%
          TEXT               85.00%    50.00%    47.27%    67.27%     52.81%     78.91%    68.44%
          LD+LDA+TEXT        87.81%    52.81%    47.27%    72.03%     52.58%     82.50%    77.19%


4      Conclusion
In this paper, we addressed the problem of classification by using features from Linked Data
and text reviews. We experimented with several properties related to music albums, however, we
noticed that by also considering textual features we could reach higher accuracies. We enriched
our knowledge base with textual critics and use them as Bag of Words. We selected our model
using 10-fold cross-validation: our best model also showed good predictive accuracy on the test
set as reported by the challenge system. This is an indication that our manual analysis and
feature selection was a useful pre-processing step. For reproducibility, all source files, crawler
code and reviews, enriched knowledge base in RDF, and intermediate files are published as an
open-source repository12 .
Acknowledgement. This research is partly supported by The Scientific and Technological Re-
search Council of Turkey (Ref.No: B.14.2. TBT.0.06.01-21514107-020-155998)

References
1. Aldarra, S., Muñoz, E.: A Linked Data-Based Decision Tree Classifier to Review Movies. In: Proc. of
   the 4th International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data
   at ESWC 2015. CEUR Workshop Proceedings, vol. 1365. Portoroz, Slovenia (May 2015)
2. Lehmann, J., Isele, Robert and Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S.,
   Morsey, M., van Kleef, P., Auer, S.: DBpedia - A Large-scale, Multilingual Knowledge Base Extracted
   from Wikipedia. Semantic Web 6(2), 167–195 (2015)
3. Ristoski, P., Paulheim, H.: Semantic web in data mining and knowledge discovery: A comprehensive
   survey. Web Semantics: Science, Services and Agents on the World Wide Web (2016)

11
     http://scikit-learn.org/
12
     https://github.com/semihyumusak/KNOW2016

</pre>