<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Linked Data-Based Decision Tree Classi er to Review Movies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Suad Aldarra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emir Mun~oz</string-name>
          <email>Emir.Munozg@ie.fujitsu.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fujitsu Ireland Limited</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Insight Centre for Data Analytics, National University of Ireland</institution>
          ,
          <addr-line>Galway</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Suad.AlDarra</institution>
          ,
          <addr-line>Emir.Munoz</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe our contribution to the 2015 Linked Data Mining Challenge. The proposed task is concerned with the prediction of review of movies as \good" or \bad", as does Metacritic website based on critics' reviews. First we describe the sources used to build the training data. Although, several sources provide data about movies on the Web in di erent formats including RDF, data from HTML pages had to be gathered to ful ll some of our features. We then describe our experiment training a decision tree model on 241 features derived from our RDF knowledge base, achieving an accuracy of 0.94.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In this paper we describe the method used in our submission to the 2015 Linked
Data Mining Challenge1 at the Know@LOD Workshop. The challenge propose
the task of predict whether a movie is \good" or \bad" based on the value of its
RDF properties. These labels are as the ones used in the Metacritic2 website,
based on critics' reviews submitted to their system. Metacritic originally use
three categories based on the critics: positive, negative, and mixed ; according to
a score ranging from 0 to 100. For simplicity, in this challenge, only two classes
are required, and movies with score above 60 are regarded as \good", while
movies with score less than 40 are regarded as \bad". To achieve this goal we
learn a Decision Tree classi er [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which can e ciently assign a binary label to
incoming unlabeled/unseen movies.
      </p>
      <p>To design our classi er, we solved two main challenges: 1) the
collection/transformation of relevant data about movies, and 2) the design of features from
RDF data to train our classi er. We address the two challenges in this work
with an estimated 70-30% e ort, respectively. First, we collect data from several
sources, including HTML pages, and convert it to RDF. Second, we use SPARQL
queries to generate suitable data format for the learning process.</p>
      <p>In the remaining of this paper, we describe how we address both challenges.
We describe the construction of our RDF knowledge base, feature extraction,
and experiment to learn the decision tree with its corresponding evaluation.
1 http://knowalod2015.informatik.uni-mannheim.de/en/</p>
      <p>linkeddataminingchallenge/
2 http://www.metacritic.com/</p>
    </sec>
    <sec id="sec-2">
      <title>RDF knowledge base construction</title>
      <p>The provided data comprises 2,000 movies along with their name, release date,
DBpedia URI, class (good/bad), and ID. From the data, 80% (1,600) is used
during the training step, and 20% (400) during the testing step. The DBpedia
URIs are used to access the LOD cloud for collecting further data about movies.
Although, several LOD datasets contain relevant data for this task, namely,
DBpedia3, LinkedMDB4, Freebase5, none of them contain high quality, complete,
and up-to-date data in one place. Thus, we were forced to build our own RDF
knowledge base, gathering facts from di erent RDF sources plus other
(semi/un-)structured data sources. The nal list of sources included in our knowledge
base comprises: IMDB6, OMDB7, Metacritic, Freebase, and DBpedia.</p>
      <p>We start retrieving dcterms:subject values for a movie from DBpedia.We
use DBpedia sameAs links to Freebase to get a movie's IMDB ID. Movies data
(e.g., year, release, genre, director, starring, MPAA rating) were collected from
OMDB in JSON format and then converted into RDF programmatically. We
queried OMDB using the movie's IMDB ID instead of the movie title provided,
since the search was more accurate in most cases. We retrieved data about
actors and directors from Freebase using OpenRe ne8. Thus, we could collect
personal information about actors and directors, such as, genre, nationality, date
of birth, IMDB ID, among others. Other information was extracted from IMDB:
actors, directors and movies awards, movies budget, gross, common languages
and countries. For each movie, we also extracted its IMDB keywords, which are
later used to determine common keywords among good and bad movies.</p>
      <p>Finally, for each movie we collected textual critics' reviews from Metacritic
website and applied an existing API for sentiment analysis using NLTK9, which
returns either a positive, negative or neutral sentiment label for a given text.</p>
      <p>Our resulting RDF knowledge base comprises 338,140 RDF triples that are
accessed using SPARQL queries to generate our set of features to train a decision
tree model. (All data in RDF, decision tree model and diagram, and feature
vectors are available at https://github.com/emir-munoz/ldmc2015.)
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experiment</title>
      <p>In the following we present our experiment set up to train and evaluate the
proposed decision tree. Figure 1 shows a ow diagram of the data and processes
involved. In order to train a decision tree classi er, we rst de ne a set of features
to be extracted from our RDF knowledge base (Movies DB). Movies DB is stored
in a Virtuoso Server running on a CentOS Linux virtual machine (with 4.0 GHz
CPU and 7.5 GB of RAM), and queried via HTTP.
3 http://dbpedia.org/
4 http://www.linkedmdb.org/
5 http://www.freebase.com/
6 http://www.imdb.com/
7 http://www.omdbapi.com/
8 http://openrefine.org/
9 http://text-processing.com/docs/sentiment.html</p>
      <sec id="sec-3-1">
        <title>3.1 Feature set</title>
        <p>
          Once the RDF KB was nished, we de ned a set of 241 features. Our features
contain mixed continuous (numerical) and dichotomous (categorical) types that
can be handled by C4.5 algorithm [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The following list summarize the features
used in this work. (? = feature considers the release/record date of the movie.)
{ dcterms:subject values { # directors between 30 and 50 (?)
{ genres of a movie { # directors older than 50 (?)
{ countries of a movie { # actors younger than 30 (?)
{ languages of a movie { # actors between 30 and 50 (?)
{ MPAA rating { # actors older than 50 (?)
{ # of directors' Oscar/Golden Globe { is the movie from a common country?
awards won/nominated (?) { is the movie in a common language?
{ # of actors' Oscar/Golden Globe { low or high amount of budget?
awards won/nominated (?) { is the gross higher than the budget?
{ runtime { % of positive critics' reviews
{ release week/weekend day { % of negative critics' reviews
{ # of bad/good/neutral/mostly- { % of neutral critics' reviews
good/mostly-bad keywords { is the movie based on a book?
{ # of female/male actors { is the movie a sequel?
{ # directors younger than 30 (?) { is the movie an independent lm?
        </p>
        <p>Features are extracted from
the data using SELECT and ASK
SPARQL queries. For instance, the
query on the right, get the age value
for each actor involved in the movie
\Amores Perros". These values are }
then used to generate three of our
features. A similar query is performed to get the age for directors.
SELECT ?age WHERE {
dbr:Amores_perros rdf:type dbo:Film .
dbr:Amores_perros dbp:recorded ?recorded .
dbr:Amores_perros dbp:starring ?actor .
?actor dbp:dateOfBirth ?dob .</p>
        <p>BIND (?recorded - YEAR(?dob) AS ?age)</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Learning process</title>
        <p>After all features are extracted for both train and test sets, we use the J48
classi er, a Weka implementation for C4.5 algorithm. The decision tree settings
consider pruning of the tree, and a con dence factor equals to 0.25.
(a) Confusion matrix</p>
        <p>(b) Accuracy metric</p>
        <p>Using equation in Figure 2b to compute the accuracy of our system, we
achieve Acc =0.94 on the train set. The challenge system reports an Acc =0.9175
for our submission on the test set.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have described our submission to the 2015 Linked Data Mining Challenge,
presenting a decision tree classi er to solve the prediction problem of review of
movies. We trained this decision tree on 1,600 examples, with input features
extracted from a built-in RDF knowledge base using SPARQL queries.</p>
      <p>
        In order to reduce the features space, feature aggregation was applied over
actors, directors, and critics' reviews. The sentiment analysis over critics' reviews
generate the attributes with higher information gain [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Negative critics have an
information gain of 0.71886 bits, thus, selected as root of the decision tree.
Experiments removing all sentiment features from the training show that accuracy
is reduced by ca. 9%. While removing positive or negative does not a ect the
accuracy severely. That shows the relevance of sentiment analysis-based features
for this task, which are directly related to the taste of users.
      </p>
      <p>Movie keywords are the next features with higher information gain, and their
analysis provide interesting insights to be considered by writers and directors:
a) bad movies are based on video games, with someone critically bashed, using a
taser, pepper spray, or hanged upside down, with dark heroine involved; and b)
good movies include family relationships, frustration, crying, melancholy, very
little dialogue, and some sins with moral ambiguity|yes, people like drama.
Acknowledgments. This work has been supported by KI2NA project funded by
Fujitsu Laboratories Limited and Insight Centre for Data Analytics at NUI Galway.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Quinlan</surname>
          </string-name>
          , J.:
          <article-title>Simplifying decision trees</article-title>
          .
          <source>International Journal of Man-Machine Studies</source>
          <volume>27</volume>
          (
          <issue>3</issue>
          ) (
          <year>1987</year>
          )
          <volume>221</volume>
          {
          <fpage>234</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Data Mining: Practical Machine Learning Tools and Techniques. 3rd edn</article-title>
          . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Basuroy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chatterjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ravid</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          :
          <article-title>How Critical Are Critical Reviews? The Box O ce E ects of Film Critics, Star Power, and Budgets</article-title>
          .
          <source>Journal of Marketing</source>
          <volume>67</volume>
          (
          <issue>4</issue>
          ) (
          <year>October 2003</year>
          )
          <volume>103</volume>
          {
          <fpage>117</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>