=Paper= {{Paper |id=Vol-1586/ldmc1 |storemode=property |title=The Linked Data Mining Challenge 2016 |pdfUrl=https://ceur-ws.org/Vol-1586/ldmc1.pdf |volume=Vol-1586 |authors=Petar Ristoski,Heiko Paulheim,Vojtěch Svátek,Václav Zeman |dblpUrl=https://dblp.org/rec/conf/esws/RistoskiPSZ16 }} ==The Linked Data Mining Challenge 2016== https://ceur-ws.org/Vol-1586/ldmc1.pdf
       The Linked Data Mining Challenge 2016

    Petar Ristoski1 , Heiko Paulheim1 , Vojtěch Svátek2 , and Václav Zeman2
                       1
                         University of Mannheim, Germany
                      Research Group Data and Web Science
              {petar.ristoski,heiko}@informatik.uni-mannheim.de
                2
                  University of Economics, Prague, Czech Republic
               Department of Information and Knowledge Engineering
                         {svatek,vaclav.zeman}@vse.cz



      Abstract. The 2016 edition of the Linked Data Mining Challenge, con-
      ducted in conjunction with Know@LOD 2016, has been the fourth edi-
      tion of this challenge. This year’s dataset collected music album ratings,
      where the task was to classify well and badly rated music albums. The
      best solution submitted reached an accuracy of almost 92.5%, which is
      a clear advancement over the baseline of 69.38%.


1    The Linked Data Mining Challenge Overview

Linked Open Data [9] has been recognized as a valuable source of background
knowledge in many data mining tasks and knowledge discovery in general [7].
Augmenting a dataset with features taken from Linked Open Data can, in many
cases, improve the results of a data mining problem at hand, while externalizing
the cost of maintaining that background knowledge [4]. Hence, the primary goal
of the Linked Data Mining Challenge 2016 is to show how Linked Open Data
and Semantic Web technologies could be used in a real-world data mining task.
    This year, the Linked Data Mining Challenge was held for the fourth time,
following past editions co-located with DMoLD (at ECML/PKDD) [11], Know@-
LOD 2014 [10] and Know@LOD 2015 [8]. The challenge consists of one task,
which is the prediction of the review class of music albums. The dataset is
generated from real-world observations, linked to a LOD dataset and it is used
for a two-class classification problem.
    The rest of this paper is structured as follows. Section 2 discusses the dataset
construction and the task to be solved. In section 3, we discuss the entrants to the
challenge and their results. We conclude with a short summary and an outlook
on future work.


2    Task and Dataset

The 2016 edition of the challenge used a dataset built from music albums rec-
ommendations, turned into a two-class classification problem.
2        Petar Ristoski, Heiko Paulheim, Vojtěch Svátek, and Václav Zeman

2.1     Dataset
The task concerns the prediction of a review of music albums, i.e.,“good” and
“bad”. The initial dataset is retrieved from Metacritic.com3 , which offers an
average rating of all time reviews for a list of music albums4 . Each album is
linked to DBpedia [3] using the album’s title and the album’s artist. The initial
dataset contained around 10, 000 music albums, from which we selected 800
albums from the top of the list, and 800 albums from the bottom of the list.
The ratings were used to divide the albums into classes, i.e., albums with score
above 79 are regarded as “good” albums, while albums with score less than 63
are regarded as “bad” albums. For each album we provide the corresponding
DBpedia URI. The mappings can be used to extract semantic features from
DBpedia or other LOD repositories to be exploited in the learning approaches
proposed in the challenge.
    The dataset was split into training and test set using random stratified
split 80/20 rule, i.e., the training dataset contains 1, 280 instances, and the test
dataset contains 320 instances. The training dataset, which contains the target
variable, was provided to the participants to train predictive models. The test
dataset, from which the target label is removed, is used for evaluating the built
predictive models.

2.2     Task
The task concerns the prediction of a review of albums, i.e.,“good” and “bad”,
as a classification task. The performance of the approaches is evaluated with
respect to accuracy, calcuated as:

                                  #true positives + #true negatives
    Accuracy = #true positives + #f alse positives + #f alse negatives + #true negatives   (1)

2.3     Submission
The participants were asked to submit the predicted labels for the instances in
the test dataset. The submissions were performed through an online submission
system. The users could upload their prediction and get the results instantly.
Furthermore, the results of all participants were made completely transparent
by publishing them on an online real-time leader board (Figure 1). The number
of submissions per user was not constrained.
    In order to advance the increase of Linked Open Data [9] available as a side-
effect of the challenge, we allowed users to also exploit non-LOD data sources,
given that they transform the datasets they use to RDF, and provide them
publicly. Since the Metacritic dataset is publicly available, the participants were
asked not to use the Metacritic music albums’ rating score to tune the predictor
for the albums in the test set.
3
    http://www.metacritic.com/
4
    http://www.metacritic.com/browse/albums/score/metascore/all
                                    The Linked Data Mining Challenge 2016         3




                            Fig. 1: Participants Results


3      The Linked Data Mining Challenge results

In total, three parties participated in the challenge. We compare those results
against a baseline approach.


3.1     Baseline Models

We provide a simple classification model that serves as a baseline. In this baseline
approach, we use the albums’ DBpedia URI to extract the direct types and cate-
gories of each album. On the resulting dataset, we built a k-NN classifier (k=3),
and applied it on the test set, scoring an accuracy of 69.38%. The model is im-
plemented in the RapidMiner platform5 , using the Linked Open Data extension
[7], and it was publicly available for the participants.


3.2     Participants’ Approaches

During the submission period, three teams completed the challenge by submit-
ting a solution to the online evaluation system and describing the used approach
in a paper. In the following, we describe and compare the final participant ap-
proaches. A summary of all approaches is given in Table 1.


Jedrzej Potoniec. Not-So-Linked Solution to the Linked Data Mining
Challenge 2016 [6]

      By Jedrzej Potoniec

In this approach, the authors extract features from several non-LOD datasets,
which are then used to build a Logistic Regression model for classification of
albums. To extract the features, the authors start by scraping the Wikipedia
pages for the given albums using the Scrapy tool6 . From the collected data, the
5
    http://www.rapidminer.com/
6
    http://scrapy.org/
4      Petar Ristoski, Heiko Paulheim, Vojtěch Svátek, and Václav Zeman

authors focus on the album reviews and ratings. Furthermore, reviews and rat-
ings are collected from Amazon 7 and Discogs 8 , while MusicBrainz 9 is used to
obtain the number of users owning an album and its average score. The final
dataset contains 94 numerical attributes in total.
    To train the classification model, the authors use logistic regression, using
the RapidMiner platform. Before training the model, a Z-transformation is per-
formed on all attributes, so all attributes have an average of 0 and standard de-
viation 1. The authors perform 10-fold cross-validation on the training dataset,
achieving accuracy of 91.7%. This value is consistent with 92.5% on the test set
reported by the challenge submission system.
    Furthermore, the authors provide some insights on the relevance of the fea-
tures for the classification task, based on the learned logistic regression coeffi-
cients for each attribute. For example, the results show that Metacritic ratings
highly correlate with ratings from other sources, like Pitchfork 10 , AllMusic 11 ,
Stylus 12 , and others.
    The code and the data can be found online13 .


Semih Yumusak. A Hybrid Method for Rating Prediction Using Linked
Data Features and Text Reviews [12]

   By Semih Yumusak, Emir Muñoz, Pasquale Minervini, Erdogan Dogdu, and
Halife Kodaz

In this approach, the authors use Linked Open Data features in combination
with album reviews to build seven different classification models. DBpedia is
used as a main source for LOD features. More precisely, the authors manually
select predicates that might be relevant for the given classification task. Along
the direct predicate values, aggregate count features are used as well. Besides the
LOD features, the authors also use albums’ reviews retrieved from Metacritic as
textual features. The reviews are first preprocessed, i.e., lower-case transforma-
tion, non-alphanumeric normalizations, stopwords removal and stemming, then
standard Bag-of-Words is used to represent each review.
    Furthermore, the authors identify that discretizing some of the features leads
to better representation of the data, e.g., the award feature of an artist could be
marked as “high” if the number of awards is more than one, and “low” otherwise.
    In the next step, the authors experiment with seven different classification
models, i.e., linear SVM, KNN, RBF SVM, Decision Trees, Random Forest, Ada-
Boost, and Naive Bayes. The hyper parameters for each model are determined
7
   http://www.amazon.com/
8
   https://www.discogs.com/
 9
   https://musicbrainz.org/doc/MusicBrainz\ Database/Download
10
   http://pitchfork.com/
11
   http://www.allmusic.com/
12
   http://www.stylusmagazine.com/
13
   https://github.com/jpotoniec/LDMC2016
                                      The Linked Data Mining Challenge 2016      5

manually via incremental tests, and results extracted from the training set. They
evaluate each model on the training dataset using 10-fold cross-validation. The
experiments were performed using the sckit-learn library14 . The best perfor-
mance on the training dataset is achieved using Linear SVM with an accuracy
of 87.81%. Applying the same model on the test set scores an accuracy of 90.00%,
which confirms that the model is not overfitted on the training dataset. The au-
thors evaluate the relevance of different features group separately, showing that
most of the models perform the best when using both LOD and text-based
features.
    Furthermore, the authors provide some interesting observations about the
task. For example, “Bands are more successful than single artists”, “Shorter
albums are likely to be worse”, “The genre of the album indicates if the album
is good or bad”, and others.
    The source files, the crawler code and the reviews, the enriched knowledge
base in RDF, and the intermediate files are published as an open-source reposi-
tory15 .


Petar P. Can you judge a music album by its cover? [5]

     By Petar Petrovski and Anna Lisa Gentile

In this approach, the authors use an unconventional method for the task of
music album classification. They explore the potential role of music album cover
arts for the task of predicting the overall rating of music albums and investigate
if one can judge a music album by its cover alone. The proposed approach for
album classification consists of three main steps. Given a collection of music al-
bums, the authors first obtain the image of their cover art using DBpedia. Then,
using off-the-shelf tools obtain a feature vector representation of the images. In
the final step, a classifier is trained to label each album, only exploiting the
feature space obtained from its cover art.
    To extract the image features, the authors use the Caffe deep learning frame-
work [2], which also provides a collection of reference models, which can be
used retrieving image feature vectors. More precisely, the authors use the bvlc
model16 , which consists of five convolutional layers, and three fully-connected
layers, and it is trained on 1.2 million labeled images from the ILSVRC2012
challenge17 . To obtain features for each image, the output vectors of the sec-
ond fully-connected layer of the model are used. Such features capture different
characteristics of images, e.g., colors, shapes, edges etc.
    To build a classification model, the authors use linear SVM model. The model
is evaluated on the training set using 10-fold cross-validation, achieving accuracy
of 58.30%. The accuracy of the model on the test set is 60.31%. Hence, the results
14
   http://scikit-learn.org/
15
   https://github.com/semihyumusak/KNOW2016
16
   bvlc reference caffenet from caffe.berkeleyvision.org
17
   http://image-net.org/challenges/LSVRC/2012/
6         Petar Ristoski, Heiko Paulheim, Vojtěch Svátek, and Václav Zeman


                Table 1: Comparison of the participants approaches.
Approach         Classification   Knowledge        Tools          Score Rank
                 methods          Source
Jedrzej Potoniec Logistic Regres- Wikipedia, Ama- RapidMiner,     92.50% 1
                 sion             zon,    Discogs, Scrapy
                                  MusicBrainz
Semih Yumusak SVM,           KNN, DBpedia, Meta- sckit-learn lib  90.00% 2
                 Decision Trees, critic
                 Random Forest,
                 AdaBoost, Naive
                 Bayes
Petar P.         SVM              DBpedia          RapidMiner,    60.31% 3
                                                   LOD extension,
                                                   Caffe



show that using only features extracted from the album cover arts is not sufficient
for the given classification task.
    The dataset is available online18 , along with the extracted feature vectors
and used processes.


3.3     Meta Learner

We made a few more experiments in order to analyze the agreement of the three
submissions, as well as the headroom for improvement.
   For the agreement of the three submissions, we computed the Fleiss’ kappa
score [1], which is 0.373. This means that there is not a good agreement of the
three approaches about what makes good and bad albums. We also calculated
the Fleiss’ kappa score for the top two approaches, which is 0.687. This means
that there is a good, although not perfect agreement of the top two approaches.
   To exploit advantages of the three approaches, and mitigate the disadvan-
tages, we analyzed how a majority vote of the three submissions would perform.
The accuracy totals at 90.00%, which is lower than the best solution submit-
ted. This shows that the majority vote is highly influenced by the low scoring
submission using the image features, which does not outperform the baseline.
We also perform a weighted majority vote, using the achieved accuracy of each
approach as the weight. The accuracy totals at “92.50”, which is same as the
best solution submitted.


4      Conclusion

In this paper, we have discussed the task, dataset, and results of the Linked
Data Mining Challenge 2016. The submissions show that Linked Open Data is a
18
     https://github.com/petrovskip/know-lod2016
                                      The Linked Data Mining Challenge 2016              7

useful source of information for data mining, and that it can help building good
predictive models.
     One problem to address in future editions is the presence of false predictors.
The dataset at hand, originating from Metacritic, averages several ratings on
albums into a final score. Some of the LOD datasets used by the competitors
contained a few of those original ratings, which means that they implicitly used
parts of the ground truth in their predictive models (which, to a certain extent,
explains the high accuracy values). Since all of the participants had access to
that information, a fair comparison of approaches is still possible; but in a real-
life setting, the predictive model would perform sub-optimally, e.g., when trying
to forecast the rating of a new music album.
     In summary, this year’s edition of the Linked Data Mining challenge showed
some interesting cutting-edge approaches for using Linked Open Data in data
mining. As the dataset is publicly available, it can be used for benchmarking
future approaches as well.


Acknowledgements

We thank all participants for their interest in the challenge and their submissions.
The preparation of the Linked Data Mining Challenge and of this paper has been
partially supported by the by the German Research Foundation (DFG) under
grant number PA 2373/1-1 (Mine@LOD), and by long-term institutional support
of research activities by the Faculty of Informatics and Statistics, University of
Economics, Prague.


References

 1. Joseph L Fleiss and Jacob Cohen. The equivalence of weighted kappa and the intra-
    class correlation coefficient as measures of reliability. Educational and psychological
    measurement, 1973.
 2. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,
    Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional ar-
    chitecture for fast feature embedding. In Proceedings of the ACM International
    Conference on Multimedia, pages 675–678. ACM, 2014.
 3. Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,
    Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören
    Auer, and Christian Bizer. DBpedia – A Large-scale, Multilingual Knowledge Base
    Extracted from Wikipedia. Semantic Web Journal, 2013.
 4. Heiko Paulheim. Exploiting linked open data as background knowledge in data
    mining. In Workshop on Data Mining on Linked Open Data, 2013.
 5. Petar Petrovski and Anna Lisa Gentile. Can you judge a music album by its
    cover? In 5th Workshop on Knowledge Discovery and Data Mining meets Linked
    Open Data (Know@LOD), 2016.
 6. Jedrzej Potoniec. Not-so-linked solution to the linked data mining challenge 2016.
    In 5th Workshop on Knowledge Discovery and Data Mining meets Linked Open
    Data (Know@LOD), 2016.
8       Petar Ristoski, Heiko Paulheim, Vojtěch Svátek, and Václav Zeman

 7. Petar Ristoski, Christian Bizer, and Heiko Paulheim. Mining the web of linked
    data with rapidminer. Web Semantics: Science, Services and Agents on the World
    Wide Web, 35:142–151, 2015.
 8. Petar Ristoski, Heiko Paulheim, Vojtech Svátek, and Vaclav Zeman. The linked
    data mining challenge 2015. In 5th Workshop on Knowledge Discovery and Data
    Mining meets Linked Open Data (Know@LOD), 2015.
 9. Max Schmachtenberg, Christian Bizer, and Heiko Paulheim. Adoption of the linked
    data best practices in different topical domains. In The Semantic Web–ISWC 2014,
    pages 245–260. Springer, 2014.
10. Vojtěch Svátek, Jindřich Mynarz, and Heiko Paulheim. The linked data mining
    challenge 2014: Results and experiences. In 3rd International Workshop on Knowl-
    edge Discovery and Data Mining meets Linked Open Data, 2014.
11. Vojtěch Svátek, Jindřich Mynarz, and Petr Berka. Linked Data Mining Challenge
    (LDMC) 2013 Summary. In International Workshop on Data Mining on Linked
    Data (DMoLD 2013), 2013.
12. Semih Yumusak, Emir Muñoz, Pasquale Minervini, Erdogan Dogdu, and Halife
    Kodaz. A hybrid method for rating prediction using linked data features and text
    reviews. In 5th Workshop on Knowledge Discovery and Data Mining meets Linked
    Open Data (Know@LOD), 2016.