<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Utilizing the Open Movie Database API for Predicting the Review Class of Movies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johann Schaible</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zeljko Carevic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oliver Hopt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Zapilko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GESIS - Leibniz Institute for the Social Sciences</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present our contribution to the Linked Data Mining Challenge 2015. Our approach predicts the review class of movies using external data from the Open Movie Database API (OMDb-API). We select specific features, such as movie ratings and box office, that are very likely to describe the quality of a movie. With RapidMiner we utilize these features and apply three basic classification algorithms to train and validate the prediction model using a 10-fold cross-validation. The results of our evaluation are interesting in a two-fold way: (i) few movie ratings from professional critics provide a higher accuracy (accuracy 0:94) than many ratings from users (accuracy 0:7 ), and (ii) the Decision Tree classifier (accuracy 0:83) outperforms Naive Bayes (accuracy 0:73), whereas k-NN is not suitable at all (accuracy 0:53).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In the Linked Data Mining Challenge 20151 participants were asked to predict a movie’s
review class, i.e. one needs to identify whether a movie is labeled as good or as bad.
The training set contains solely the movie title, its release date, its DBpedia2 URI, as
well as the actual label.</p>
      <p>Instead of developing sophisticated data mining algorithms or adapting existing
ones to the challenge task, we focus on selecting and calculating specific features out
of particular data sets that can be used by state of the art classification algorithms to
provide a statement about a movie’s quality. In detail, we extend the training set with
the publicly available data from the Open Movie Database API3 (OMDb API)
containing various movie ratings and box office information. With RapidMiner4, we apply the
Naive Bayes, the k-NN, and the Decision Tree classifier to train and evaluate the
prediction model using a 10-fold cross-validation. Each feature is evaluated alone as well
as in combination.</p>
      <p>In the following section, we describe the utilized data set and our evaluation setup.
We present and discuss our results in detail in Section 3.</p>
      <p>1 http://knowalod2015.informatik.uni-mannheim.de/en/linkeddataminingchallenge/
2 http://de.dbpedia.org/
3 http://www.omdbapi.com/
4 https://rapidminer.com/</p>
    </sec>
    <sec id="sec-2">
      <title>Our Approach</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Extending the Data</title>
        <p>
          The information retrieved from the OMDb API enables to provide a statement on a
movie’s quality. For example, it illustrates how many awards a movie has won or was
nominated for, as well as movie ratings such as the IMDB5 rating and several Rotten
Tomatoes6 ratings. The information also contains the Metacritic’s7 Metascore that is
used as ground truth for the challenge. However, we did not make use of the Metascore
for tuning the prediction model in any way. The API allows for querying the data source
by various criteria, of which we used the movie title and release year. All its content is
licensed under Creative Commons Attribution 4.0 International Public License.8 Hence,
we were allowed to publish the relevant parts of this data as Linked Data which was
done by following the guidelines of Heath and Bizer [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. To express the full semantics
of the data, we had to define a few datatype properties of our own under the namespace
gmovies: http://lod.gesis.org/gmovies/. Listing 1.1 illustrates an excerpt
of the published data in turtle syntax for the movie ”The Godfather”.
&lt;http://lod.gesis.org/gmovies/The_Godfather&gt;
a &lt;http://dbpedia.org/ontology/Film&gt;;
dcterms:title "The Godfather";
owl:sameAs &lt;http://dbpedia.org/resource/The_Godfather&gt;;
.
.
gmovies:numberOfAwards "52";
gmovies:tomatoMeter "99";
gmovies:tomatoFreshRatio "0.9879518072289156";
gmovies:tomatoRottenRatio "0.012048192771084338";
rdfs:seeAlso &lt;http://www.omdbapi.com/?t=The+Godfather&amp;y=1972&amp;tomatoes=true&gt;;
foaf:page &lt;www.imdb.com/title/tt0068646&gt;.
        </p>
        <p>Listing 1.1. An excerpt of the RDF representation of the data from the OMDb API for the movie
”The Godfather” in Turtle syntax.</p>
        <p>The title, release date, and DBpedia link are obtained from the data provided by the
challengers. The various ratings, meters, and IMDB page are retrieved directly from the
OMDb API. We also defined the further metrics numberOfAwards, tomatoFreshRatio,
and tomatoRottenRatio. The number of awards counts the awards the movie has won or
was nominated for. The other two ratios are based on the Tomatometer and are defined
by the number of critics rating a movie as fresh or rotten divided by the total number of
critics.</p>
        <p>The RDF representation of the additional data for all movies contained in the
challenge is published as LOD.9 The example for the movie ”Skyfall“10 illustrates the
several data type properties and the possibility to download the data as Turtle, as
RDF/XML, or query it via a SPARQL endpoint.</p>
        <p>5 http://www.imdb.com/
6 http://www.rottentomatoes.com/
7 http://www.metacritic.com/
8 http://creativecommons.org/licenses/by/4.0/legalcode
9 http://lod.gesis.org/gmovies/
10 http://lod.gesis.org/pubby/page/gmovies/Skyfall</p>
      </sec>
      <sec id="sec-2-2">
        <title>Evaluation Setup</title>
        <p>
          To train and evaluate the prediction model, we used the RapidMiner Studio (free
edition). The Linked Open Data Extension11 was used to query the previously defined
RDF data from the OMDb API. Subsequently, the extended data set is forwarded to the
RapidMiner process X-Validation, i.e. the built-in 10-fold cross-validation. Three
different prediction models were trained and evaluated, based on Naive Bayes, k-NN, and
a Decision Tree classifier. For better comparability of the additional features, we used a
3-NN classifier like it was provided by baseline of the challenge. For the Naive Bayes
classifier we used the Laplace correction to prevent high influence of zero probabilities.
Besides that, we used the RapidMiner’s default setting for each classifier. Hereby, the
RapidMiner’s decision tree learner works similar to Quinlan’s C4.5 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] or CART [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
with a maximal depth of 20. The criterion determining the type of the tree is set to
”gain ratio“.
        </p>
        <p>The entire RapidMiner process as well as the XML test set including the predicted
labels can be downloaded at the GESIS data repository service ”datorium“.12
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>Results. The results of the 10-fold cross-validation on the training data set is illustrated
in Table 1. It shows the accuracy ACC of the combinations between the three classifiers
and the various features, which is defined as follows:
ACC =</p>
      <p>P True positive + P True negative</p>
      <p>P True positive + P False positive + P True negative + P False negative
The 3-NN classifier did not reach a prediction accuracy of 60%, whereas the the Naive
Bayes (ACC 0:73) and the Decision Tree ( ACC 0:83 ) approaches performed
quite well. Regarding the overall precision and recall when predicting the labels good
and bad, the results are as follows: The 3-NN has a precision of p = 0:51 and a recall
of r = 0:99 when predicting the label good. When predicting the label bad, it has a
precision of p = 0:64 and a recall of r = 0:1. The Naive Bayes classifier has in both
cases a precision of p = 0:95 and recall r = 0:93. Finally, RapidMiner’s Decision Tree
has a precision of p = 0:86 and a recall of r = 0:92 when predicting label good. When
predicting label bad, its precision is p = 0:9 and recall r = 0:84.</p>
      <p>Considering the different features, with an accuracy of about 90% the
Tomatometer/rating scores provide a better accuracy than the user generated Tomato-scores or
the IMDB score (accuracy between 70% and 80%). The box office information (ACC
50%) does not seem to provide an appropriate feature to predict a movie’s review
class at all. Winning awards or being nominated for awards indicates the label of a
movie, but with about 65% it does not make a clear statement as the other features.</p>
      <p>
        Discussion. The submitted configuration, which achieved an accuracy of
ACCt = 0:97 on the training data, achieved only an accuracy of ACCe = 0:95 on
the evaluation set. Such an overfitting is quite typical for Decision Tree algorithms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
11 http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
12 http://dx.doi.org/10.7802/78
      </p>
      <sec id="sec-3-1">
        <title>Awards won or nominated for Box Office information IMDB Rating</title>
        <p>IMDB Rating + number of votes
Tomatometer</p>
        <p>Tomatorating
Tomato Fresh/Rotten Ratio</p>
        <p>Tomato User Meter
Tomato User Rating
Tomato User Rating + number of reviews</p>
      </sec>
      <sec id="sec-3-2">
        <title>All above features combined</title>
      </sec>
      <sec id="sec-3-3">
        <title>Naive Bayes</title>
        <p>3-NN</p>
        <p>Decision Tree</p>
        <p>
          As the model is trained by maximizing its prediction performance, the number and
performance of the provided features might lead to memorizing the training data. Thus, it
decreases the prediction performance on new and previously unseen data. However,
decision trees use the ”divide and conquer“ method, so they tend to perform well on a few
highly relevant features, like in our use-case [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. On the contrary, Naive Bayes takes
features and values into account, which Decision Trees have already eliminated [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
However, as we use only a small amount of features, the Naive Bayes approach cannot
use this big advantage to outperform the Decision Tree algorithm. The k-NN classifier
cannot use the additional features in a way the other classifiers can at all. Taking further
looks on its predictions, we observed that the 3-NN classifier predicted the label good
in over 95% of cases. This observation correlates with the precision and recall values of
its predictions. The most probable reason for this is the dimensionality and a
normalization of distance between single data values. In detail, we did not normalize the data, so
that the distance measure might have been dominated by features with a large scale [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
Thus, the various features did not play a similar role in determining the distance, so that
no good prediction could be produced.
        </p>
        <p>Regarding the various rating features, we observed that user generated critics, such
as the IMDB score and the Tomato user rating/meter, provide a 10 to 20 percent lower
prediction accuracy than the ”official“ critics like the Tomatometer. The reason for this,
is that Metacritic’s Metascore, which is used as ground truth, is a weighted average of
scores from top critics.13 These critics are 30-50 writers from the most recognizable
journals in the movie industry. Similar to that, the Tomatometer reflects the percentage
of up to 200 critics, who are involved in print or online publications and maintain a
certain level of quality and consistency.14 Thus, a high Tomatometer value reflects the
13 http://www.metacritic.com/about-metascores
14 http://www.rottentomatoes.com/about/
Metascore better than a high user-generated IMDB value. Regarding the other features,
the number of winning or nominated awards provides a decent prediction accuracy as
well. Distinguishing between a movie having no award nomination and a missing value
might increase the accuracy, as currently the value for both is ”N/A“. Such data
sparseness is also the reason for the low prediction accuracy using the box office information
(missing values for about 50% of the movies). Using LOD sources like DBpedia or
LinkedMDB posed the same problem. To generate good predictions, it is crucial for the
additional data to be as less sparse as possible. DBpedia did not provide information,
e.g. a movie’s budget and gross income, for almost half of the movies in the
challenge. One could interlink LOD sources with each other to overcome data sparseness.
However, such a process is quite challenging, as first one needs to find LOD sources
containing similar data, second one must be familiar with the source’s schema, and third
one might have to deal with different data formats, such as digits vs. text describing a
movie’s gross income , in order to apply classification algorithms.</p>
        <p>In our evaluation we primarily used features that describe a movie’s quality via some
sort of rating. These features are likely to be very close to the metric defining the ground
truth. Thus, using our approach is only possible if such features already exist for a given
movie. Predicting a review class of a movie that is yet to come out, i.e. it does not have
any ratings yet, will be impossible using our approach. However, more sophisticated
mining algorithms might increase the prediction accuracy for several features that are
already known before the rating of a movie. For example, some features could express
the cast and crew reputation of a movie, e.g. awards of actors and directors. Using
such features and sophisticated data mining algorithms might provide a quite decent
prediction of the movie’s quality.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>The results of our evaluation show that using state of the art classifiers makes it possible
to achieve a high prediction accuracy, if one uses various ratings that were generated by
critics maintaining a certain level of quality and consistency. Furthermore, information
such as award nominations are likely to provide adequate results, if the data is not too
sparse. To follow this initiative, we will publish the generated data set as LOD and
extend it with further data from other sources, e.g. links to persons and other relevant
classes from the movies domain.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Linked Data: Evolving the Web into a Global Data Space</article-title>
          .
          <article-title>Synthesis Lectures on the Semantic Web</article-title>
          . Morgan &amp; Claypool Publishers (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <source>C4</source>
          .
          <article-title>5: programs for machine learning</article-title>
          .
          <source>Elsevier</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>R.J.:</given-names>
          </string-name>
          <article-title>An introduction to classification and regression tree (cart) analysis</article-title>
          .
          <source>In: Annual Meeting of the Society for Academic Emergency Medicine in San Francisco</source>
          . (
          <year>2000</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Entezari-Maleki</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rezaei</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minaei-Bidgoli</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Comparison of classification methods based on the type of attributes and sample size</article-title>
          .
          <source>Journal of Convergence Information Technology</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ) (
          <year>2009</year>
          )
          <fpage>94</fpage>
          -
          <lpage>102</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>