<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Linked Data Mining Challenge 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Petar Ristoski</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <email>heikog@informatik.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vojtech Svatek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vaclav Zeman</string-name>
          <email>vaclav.zemang@vse.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Economics Department of Information and Knowledge Engineering</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Mannheim, Germany Research Group Data and Web Science</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>The 2015 edition of the Linked Data Mining Challenge, conducted in conjunction with Know@LOD 2015, has been the third edition of this challenge. This year's dataset collected movie ratings, where the task was to classify well and badly rated movies. The solutions submitted reached an accuracy of almost 95%, which is a clear advancement over the baseline of 60%. However, there is still headroom for improvement, as the majority vote of the three best systems reaches an even higher accuracy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>{ Using a dataset from a popular domain
{ Using a standard classi cation or regression task
Picking up on these issues, we used a dataset for movie rating prediction this
year, instead of data from the public procurement and research collaboration
domains, as in the past editions. Furthermore, the dataset was built as a standard
two-class classi cation problem with balanced data for both classes.</p>
      <p>The rest of this paper is structured as follows. Section 2 discusses the dataset
construction and the task to be solved. In section 3, we discuss the entrants to the
challenge and their results. We conclude with a short summary and an outlook
on future work.</p>
    </sec>
    <sec id="sec-2">
      <title>Task and Dataset</title>
      <p>The 2015 edition of the challenge used a dataset built from movie
recommendations, turned into a two-class classi cation problem.</p>
      <sec id="sec-2-1">
        <title>Dataset</title>
        <p>The task concerns the prediction of a review of movies, i.e.,\good" and \bad".
The initial dataset is retrieved from Metacritic.com3, which o ers an average
rating of all time reviews for a list of movies4. Each movie is linked to DBpedia
using the movie's title and the movie's director. The initial dataset contained
around 10; 000 movies, from which we selected 1; 000 movies from the top of the
list, and 1; 000 movies from the bottom of the list. The ratings were used to divide
the movies into classes, i.e., movies with score above 60 are regarded as \good"
movies, while movies with score less than 40 are regarded as \bad" movies. For
each movie we provide the corresponding DBpedia URI. The mappings can be
used to extract semantic features from DBpedia or other LOD repositories to be
exploited in the learning approaches proposed in the challenge.</p>
        <p>The dataset was split into training and test set using random strati ed split
8020 rule, i.e., the training dataset contains 1; 600 instances, and the test dataset
contains 400 instances. The training dataset, which contains the target variable,
was provided to the participants to train predictive models. The test dataset,
from which the target label is removed, is used for evaluating the built predictive
models.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Task</title>
        <p>The task concerns the prediction of a review of movies, i.e.,\good" and \bad",
as a classi cation task. The performance of the approaches is evaluated with
respect to accuracy, calcuated as:</p>
        <p>#true positives+#true negatives
Accuracy = #true positives+#false positives+#false negatives+#true negatives
(1)
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Submission</title>
        <p>The participants were asked to submit the predicted labels for the instances in
the test dataset. The submission were performed through an online submission
system. The users could upload their prediction and get the results instantly.
Furthermore, the results of all participants were made completely transparent
by publishing them on an online real-time leader board (Figure 1). The number
of submissions per user was not constrained.</p>
        <p>
          In order to advance the increase of Linked Open Data [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] available as a
sidee ect of the challenge, we allowed users to also exploit non-LOD data sources,
given that they transform the datasets they use to RDF, and provide them
publicly.
        </p>
        <sec id="sec-2-3-1">
          <title>3 http://www.metacritic.com/</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>4 http://www.metacritic.com/browse/movies/score/metascore/all</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The Linked Data Mining Challenge results</title>
      <p>In total, four parties participated in the challenge, three of which nally
submitted results and a paper. We compare those results against two baselines.
3.1</p>
      <sec id="sec-3-1">
        <title>Baseline Models</title>
        <p>
          We provide a simple classi cation model that will serve as a baseline. The model
is implemented in the RapidMiner platform, using the Linked Open Data
extension [
          <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
          ]. In this process we use the movies DBpedia URI to extract the direct
types and categories of each movie. On the resulting dataset we built k-NN
classi er (k=3), and applied it on the test set, scoring an accuracy of 60:25%.
        </p>
        <p>In addition, we built the trivial model ZeroR, which simply predicts the
majority class. The model achieved an accuracy of 49:75%.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Participants' Approaches</title>
        <p>During the submission period, four approaches participated in the challenge.
Finally, three teams completed the challenge, by submitting a solution to the
online evaluation system, and describing the used approach in a paper. In the
following, we describe and compare the nal participant approaches.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Topper. Utilizing the Open Movie Data Base for Predicting the Review Class of Movies [6]</title>
        <p>By Johann Schaible, Zeljko Carevic, Oliver Hopt, and Benjamin Zapilko (GESIS
{ Leibniz Institute for the Social Sciences, Cologne, Germany)</p>
        <p>In this approach, the authors use features extracted from the Open Movie
Database5 (OMDB) to build several predictive models, and compare their
results. The OMDB database contains many information about the quality of the
movie. The authors extract the following features: number of awards, number
of nominations, IMDB movie ratings, IMDB number of votes, Rotten
Tomatoes Tomatometer, Tomatorating, Tomato User Meter, Tomato User Rating
and Tomato number of reviews. Furthermore, aggregation features are used,</p>
        <sec id="sec-3-3-1">
          <title>5 http://www.omdbapi.com/</title>
          <p>tomatoFreshRatio that is calculated as the quotient of the number of \fresh"
Tomato ratings and the Tomatometer, and tomatoRottenRatio that is the
quotient of the number of \rotten" Tomato ratings and the Tomatometer. The data
is converted into RDF, resulting into 36; 020 RDF triples.6</p>
          <p>To build the predictive models, the authors use RapidMiner including the
RapidMiner Linked Open Data Extension. Moreover, they build three classi ers,
i.e., Nave Bayes, K-NN, and Decision Trees. To compare the performances of
the classi ers, the authors rst perform 10-fold cross validation on the training
dataset, using di erent combination of features. The best results are achieved
when using the Decision Trees classi er with all features, scoring an accuracy of
97% on the training data. In comparison, the Nave Bayes classi er scored 95%,
and the k-NN classi er only 51% accuracy on the training data. The Decision
Trees classi er scored an accuracy of 94:75% on the test dataset, taking the rst
place in the challenge. The decrease of 2% on the test set is explained by the
authors as an over tting problem. The authors state that the reasons for the
bad performances of the k-NN classi er might be that they did not normalize
the feature, thus the distance measure might have been dominated by features
with large scales.</p>
          <p>Furthermore, the authors provide some insights on the relevance of the
features for the classi cation task. For example, the authors observe that user
generated critics, such as the IMDB score and the Tomato user rating/meter, provide
a 10 to 20 percent lower prediction accuracy than the \o cial" critics like the
Tomatometer. Also, the number of winning or nominated awards provides a
decent prediction accuracy as well.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Meyer Bossert. Predicting Metacritic Film Reviews Using Linked</title>
      </sec>
      <sec id="sec-3-5">
        <title>Open Data and Semantic Technologies [2]</title>
        <p>By Meyer Bossert (Cray Inc., Seattle Washington, USA)</p>
        <p>In his approach, the author solves the task of classi cation by only using
SPARQL. To start with, using the Cray Urika GD7 graph appliance, the
author loads the complete DBpedia and Freebase datasets as well as the challenge
training and test dataset into a single graph. Next, all irrelevant predicates for
the task are removed from the graph. To implement the classi cation task, a
similar approach as the Nave Bayes method is used, i.e., the author tries to
nd for each attribute associated with a movie, on average how many times that
attribute is associated with a \good" or \bad" movie. Then, the value can be
used to surmise with some degree of certainty that the score as determined by
taking the average of all attributes will be a good indicator of the likelihood of
a lm receiving positive or negative reviews. Furthermore, the author makes an
assumption that some speci c properties, like awards, should get higher weights
than the rest of the properties. The code and the data can be found online8. This
approach scored an accuracy of 92:25%, taking the second place in the challenge.</p>
        <sec id="sec-3-5-1">
          <title>6 http://lod.gesis.org/gmovies/</title>
        </sec>
        <sec id="sec-3-5-2">
          <title>7 http://www.cray.com/products/analytics/urika-gd</title>
        </sec>
        <sec id="sec-3-5-3">
          <title>8 https://github.com/mabossert/LDMC 2015</title>
          <p>Furthermore, the author provides some interesting observations about the
task. For example, lms featured at a lm festival are disproportionately well
reviewed by critics, however, the experiments showed that there was little
correlation between lm festivals and good critical reviews despite the fact that the
average percentage of good vs. bad lms that had properties associated with
lm festivals was 80.34% for the training dataset. Next, regardless of the lm,
those that were identi ed as documentaries received overwhelmingly high praise
from critics. Finally, the author observes that it is slightly easier to predict the
review class of good movies than bad ones. The hypothesis is that the reason for
the imbalance is that good movies tend to have a wide variety of information
entered into DBpedia and Freebase while bad movies tend to have less e ort put
into their documentation.</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>Emir Munoz. A Linked Data Based Decision Tree Classi er to Review</title>
      </sec>
      <sec id="sec-3-7">
        <title>Movies [1]</title>
        <p>By Suad Aldarra and Emir Mun~oz (Insight Centre for Data Analytics, National
University of Ireland, Galway)</p>
        <p>In this approach, the authors use several sources to extract useful features to
build a decision trees algorithm to predict the class of the movies. The features
used for building the predictive model are extracted from multiple Linked Open
Data sources, as well as semi-structured information from HTML pages. The
features were extracted from ve di erent sources: DBpedia, Freebase, IMDB,
OMDB and Metacritic. First, DBpedia is used to extract the categories (i.e.
dcterms:subject ) of the movies, and explore the owl:sameAs links to Freebase. From
Freebase, personal information about actors and directors are retrieved, such as,
genre, nationality, date of birth, IMDB ID, among others. The IMDB ID is used
as a link to retrieve features from IMDB: actors, directors and movies awards,
movies budget, gross, common languages, countries, and IMDB keywords. The
authors use the OMDB API to query for further movies data, including MPAA
ratings. Finally, for each movie the authors collected textual critics' reviews
from Metacritic website and applied an existing API for sentiment analysis
using NLTK9, which returns either a positive, negative or neutral sentiment label
for a given text. In order to reduce the feature space, the authors applied
feature aggregation over actors, directors, and critics' reviews. The collected data
is transformed into RDF, resulting in 338,140 RDF triples10.</p>
        <p>The authors use the previously extracted features to build a C4.5 decision
trees, using the Weka11 J48 implementation. This approach scored an accuracy
of 91:75%, taking the third place in the challenge.</p>
        <p>Furthermore, the authors provide a solid analysis of the relevance of the
features for the classi cation task. The sentiment analysis over critics' reviews
generate the attributes with higher information gain. Negative critics have an</p>
        <sec id="sec-3-7-1">
          <title>9 http://text-processing.com/docs/sentiment.html</title>
          <p>10 https://github.com/emir-munoz/ldmc2015
11 http://www.cs.waikato.ac.nz/ml/weka/</p>
          <p>RapidMiner, 94.75%
LOD
extension</p>
          <p>Score Rank
1
2
3
Urika 92.25%
91.75%
information gain of 0.71886 bits, thus, selected as root of the decision tree.
Experiments removing all sentiment features from the training show that accuracy
is reduced by ca. 9%. While removing positive or negative does not a ect the
accuracy severely. That shows the relevance of sentiment analysis-based features
for this task, which are directly related to the taste of users.</p>
          <p>Movie keywords are the next features with higher information gain, and their
analysis provide interesting insights to be considered by writers and directors:
(i) bad movies are based on video games, with someone critically bashed, using a
taser, pepper spray, or hanged upside down, with dark heroine involved; and (ii)
good movies include family relationships, frustration, crying, melancholy, very
little dialogue, and some sins with moral ambiguity.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-8">
        <title>Meta Learner</title>
        <p>We made a few more experiments in order to analyze the agreement of the three
submissions, as well as the headroom for improvement.</p>
        <p>
          For the agreement of the three submissions, we computed the Fleiss' kappa
score [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], which is 0:757. This means that there is a good, although not perfect
agreement of the three approaches about what makes good and bad movies.
        </p>
        <p>To exploit advantages of the three approaches, and mitigate the
disadvantages, we analyzed how a majority vote of the three submissions would perform.
The accuracy totals at 97%, which is still higher than the best solution
submitted. This shows that there is still headroom for improvement by combining the
di erent approaches pursued by the challenge participants.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we have discussed the task, dataset, and results of the Linked
Data Mining Challenge 2015. The submissions show that Linked Open Data
is a useful source of information for data mining, and that it can help to build
good predictive models. On the other hand, the experiment with majority voting
shows that there is still some headroom for improvement.</p>
      <p>One problem to address in future editions is the presence of false predictors.
The dataset at hand, originating from MetaCritic, averages several ratings on
movies into a nal score. Some of the LOD datasets used by the competitors
contained a few of those original ratings, which means that they implicitly used
parts of the ground truth in their predictive models (which, to a certain extent,
explains the high accuracy values). Since all of the participants had access to
that information, a fair comparison of approaches is still possible; but in a
reallife setting, the predictive model would perform sub-optimally, e.g., when trying
to forecast the rating of a new movie.</p>
      <p>In summary, this year's edition of the Linked Data Mining challenge showed
some interesting cutting-edge approaches for using Linked Open Data in data
mining. As the dataset is publicly available, it can be used for benchmarking
future approaches as well.</p>
      <sec id="sec-4-1">
        <title>Acknowledgements</title>
        <p>We thank all participants for their interest in the challenge and their submissions.
The preparation of the Linked Data Mining Challenge and of this paper has been
partially supported by the by the German Research Foundation (DFG) under
grant number PA 2373/1-1 (Mine@LOD), and by long-term institutional support
of research activities by the Faculty of Informatics and Statistics, University of
Economics, Prague.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Suad</given-names>
            <surname>Aldarra</surname>
          </string-name>
          and
          <article-title>Emir Mun~oz. A linked data-based decision tree classi er to review movies</article-title>
          .
          <source>In 4th International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Meyer</given-names>
            <surname>Bossert</surname>
          </string-name>
          .
          <article-title>Predicting metacritic lm reviews using linked open data and semantic technologies</article-title>
          .
          <source>In 4th International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Joseph L Fleiss and
          <string-name>
            <surname>Jacob Cohen</surname>
          </string-name>
          .
          <article-title>The equivalence of weighted kappa and the intraclass correlation coe cient as measures of reliability</article-title>
          .
          <source>Educational and psychological measurement</source>
          ,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          , Petar Ristoski, Evgeny Mitichkin, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Data mining with background knowledge from the web</article-title>
          . In RapidMiner World,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Petar</given-names>
            <surname>Ristoski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Mining the web of linked data with rapidminer</article-title>
          .
          <source>In Semantic Web challenge at ISWC</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Johann</given-names>
            <surname>Schaible</surname>
          </string-name>
          , Zeljko Carevic, Oliver Hopt, and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Zapilko</surname>
          </string-name>
          .
          <article-title>Utilizing the open movie database api for predicting the review class of movies</article-title>
          .
          <source>In 4th International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Max</given-names>
            <surname>Schmachtenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Adoption of the linked data best practices in di erent topical domains</article-title>
          .
          <source>In The Semantic Web{ISWC</source>
          <year>2014</year>
          , pages
          <fpage>245</fpage>
          {
          <fpage>260</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Vojtech</given-names>
            <surname>Svatek</surname>
          </string-name>
          , Jindrich Mynarz, and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>The linked data mining challenge 2014: Results and experiences</article-title>
          .
          <source>In 3rd International Workshop on Knowledge Discovery and Data Mining meets Linked Open Data</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Vojtech</given-names>
            <surname>Svatek</surname>
          </string-name>
          , Jindrich Mynarz, and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Berka</surname>
          </string-name>
          .
          <article-title>Linked Data Mining Challenge (LDMC) 2013 Summary</article-title>
          . In
          <source>International Workshop on Data Mining on Linked Data (DMoLD</source>
          <year>2013</year>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>