Introduction

A Hybrid Method for Rating Prediction Using Linked Data Features and Text Reviews

Semih Yumusak

semih.yumusak@karatay.edu.tr 1 2

Emir Mun~oz

0 1

Pasquale Minervini

Erdogan Dogdu

Halife Kodaz

3 0 Fujitsu Ireland Limited 1 Insight Centre for Data Analytics, National University of Ireland , Galway 2 KTO Karatay University , Konya , Turkey 3 Selcuk University , Konya , Turkey 4 TOBB University of Economics and Technology , Ankara , Turkey

This paper describes our entry for the Linked Data Mining Challenge 2016, which poses the problem of classifying music albums as `good' or `bad' by mining Linked Data. The original labels are assigned according to aggregated critic scores published by the Metacritic website. To this end, the challenge provides datasets that contain the DBpedia reference for music albums. Our approach bene ts from Linked Data (LD) and free text to extract meaningful features that help distinguishing between these two classes of music albums. Thus, our features can be summarized as follows: (1) direct object LD features, (2) aggregated count LD features, and (3) textual review features. To build unbiased models, we ltered out those properties somehow related with scores and Metacritic. By using these sets of features, we trained seven models using 10-fold cross-validation to estimate accuracy. We reached the best average accuracy of 87.81% in the training data using a Linear SVM model and all our features, while we reached 90% in the testing data.

Linked data SPARQL Classi cation Machine Learning #Know@LOD2016

Introduction

We start from the DBpedia knowledge base for referencing of metadata about all albums in the training and testing datasets. By leveraging such a knowledge base, we de ned a set of features which are potentially relevant to the classi cation task. As shown in [1], features coming from textual data (such as reviews) are also relevant for a classi cation problem. Therefore, in addition to pure Linked Data features, we collected the textual reviews from Metacritic website, and consider the words content as features herein. Our approach steps (as shown in Figure 1) can be summarized as follows: Data Collection. First, we collected and analysed the DBpedia knowledge base and the Metacritic reviews. For each music album, we crawled the summaries of the corresponding Metacritic reviews for an album and artist8. The critic reviews were scrapped and saved as text, converted into RDF and linked to DBpedia using the dbp:rev9 property in a Jena Fuseki instance. Feature Extraction. Starting from DBpedia knowledge base, a manual selection of predicates was carried out, leaving out less frequent and irrelevant predicates. With the remaining predicates, we de ned a set of questions and hypotheses that we later test (see Table 1). Based on our two sources, our features are divided into two sets: ( 1 ) Linked Data-based features, and ( 2 ) Text-based features. Set ( 1 ) is further divided into: ( 1-1 ) Linked Data object speci c features, where values of speci c predicates are directly used; and ( 1-2 ) aggregating features, where we use the count of values of given predicates. In the case of Metacritic reviews, we follow a Bag of Words approach for part ( 2 ) to nd the most discriminant words for each class. Formally, we generate the following vectors as features: x(LD) = (f1; : : : ; fm) to represent the ( 1-1 ) features (t1 to t14), where m = 15009; x(LDA) = (f1; : : : ; fn) to represent the ( 1-2 ) features (t15, t16), where n = 4; and, x(TEXT) = (f1; : : : ; fq) to represent the ( 2 ) features (t17), where q = 21973 is the cardinality of the extracted vocabulary.

In order to answer each question in Table 1, we submitted SPARQL to our enriched DBpedia knowledge base. For example, the query to get a direct object feature like genre(s) of the album <AlbumURI>: SELECT ?o WHERE {< AlbumURI > dbo : genre ?o .} Similarly, we get the aggregation features, e.g., the number of extra albums for the producer of album <AlbumURI>: SELECT count (? s) WHERE {< AlbumURI > dbo : producer ? o1 . ?s dbo : producer ? o1 . ?s a dbo : album >} 8 We use URIs as http://www.metacritic.com/music/AlbumName/ArtistName/critic-reviews 9 URI namespaces are shortened according to pre xes in http://prefix.cc/ A Hybrid Method for Rating Prediction Using Linked Data Features and Text Reviews

During our manual analysis, we noticed that some properties (e.g., dbp:extra, dbp:source, dbp:collapsed, dbp:extraColumn, dbp:type) have a strong correlation with the class `good' over `bad', and vice versa. These properties are also collected and added to the LD feature set. Moreover, some properties are directly related to Metacritic scores (dbp:mc is the actual Metacritic score), and other (critic) scores, like dbp:revNscore whose values range from 1 to 15. To keep our models unbiased, we decided to exclude them from our extraction.

Besides regular DBpedia properties, we also selected features from textual reviews. For each review, we use Bag-of-Words with lower-case and non-alphanumeric normalizations and stopwords removal. For this, NLTK library10 was used for stemming and lemmatization of words longer than 2 characters. In [1], the authors also show that aggregation features provide better results when discretized, e.g., based on their numeric range. For instance, the award feature of an artist could be marked as `high' if the number of awards is more than one; and `low' otherwise. For other numeric (property) values, we have identi ed the average values and use them to discretize the values as `high' (above average) and `low' (below average). Few average examples are runtime is 2800 sec., number of albums per producer is 40, total length is 2900 sec. Classi cation. We trained seven di erent models listed in Table 2 using k-fold cross-validation (k = 10). Each model was trained with ve di erent sets of features, and evaluated using accutp+tn racy, Acc = tp+f p+f n+tn . The hyperparameters for each model were determined manually via incremental tests, and results extracted from the training set. For example, for SVM we tested a linear kernel with C 2 [0:001 0:1] and found 0:025 as best performing value.

Experimental Results and Analysis

For our experiments we used the sckit-learn library11 that supports the training of the proposed seven classi ers using di erent combinations of our features. Table 2 shows the accuracy values for the best validation values for all seven models with each set of features. We report our best cross-validation accuracy 87.81% on the training set, whilst the challenge system reports 90% for our submission on the testing set. This might be seen as an indication that our models did not over t on the training data, and they are able to generalise to unseen data. We attribute this mainly to our decision to leave out predicates that are directly or indirectly related to scores for the music albums. We would also like to highlight the use of textual features to increase the true positives and false negatives. Considering solely LD features reached up to 76.64%, while considering solely TEXT features reached up to 85%, both using the SVM model. This fact shows that for a classi cation problem like this, DBpedia still does not provide enough meta-information for the entities, and other sources must be taken into account. Also we tested our hypotheses with the best performing model and extract accuracy for each one in Table 1.

Linear SVM KNN RBF SVM Dec. Tree Rand. Forest AdaBoost Nave Bayes 4

Conclusion

In this paper, we addressed the problem of classi cation by using features from Linked Data and text reviews. We experimented with several properties related to music albums, however, we noticed that by also considering textual features we could reach higher accuracies. We enriched our knowledge base with textual critics and use them as Bag of Words. We selected our model using 10-fold cross-validation: our best model also showed good predictive accuracy on the test set as reported by the challenge system. This is an indication that our manual analysis and feature selection was a useful pre-processing step. For reproducibility, all source les, crawler code and reviews, enriched knowledge base in RDF, and intermediate les are published as an open-source repository12.

Acknowledgement. This research is partly supported by The Scienti c and Technological Research Council of Turkey (Ref.No: B.14.2. TBT.0.06.01-21514107-020-155998)

1. Aldarra , S. , Mun~oz, E.: A Linked Data-Based Decision Tree Classi er to Review Movies . In: Proc. of the 4th International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data at ESWC 2015. CEUR Workshop Proceedings , vol. 1365 . Portoroz , Slovenia (May 2015 )

2. Lehmann , J. , Isele , Robert and Jakob, M. , Jentzsch , A. , Kontokostas , D. , Mendes , P.N. , Hellmann , S. , Morsey , M., van Kleef, P. , Auer , S.: DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia . Semantic Web 6 ( 2 ), 167 { 195 ( 2015 )

3. Ristoski , P. , Paulheim , H.: Semantic web in data mining and knowledge discovery: A comprehensive survey . Web Semantics: Science, Services and Agents on the World Wide Web ( 2016 )