<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Linked Data-Based Decision Tree Classifier to Review Movies</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Suad</forename><surname>Aldarra</surname></persName>
							<email>suad.aldarra@ie.fujitsu.com</email>
							<affiliation key="aff0">
								<orgName type="department">Insight Centre for Data Analytics</orgName>
								<orgName type="institution">National University of Ireland</orgName>
								<address>
									<settlement>Galway</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Emir</forename><surname>Muñoz</surname></persName>
							<email>emir.munoz@ie.fujitsu.com</email>
							<affiliation key="aff0">
								<orgName type="department">Insight Centre for Data Analytics</orgName>
								<orgName type="institution">National University of Ireland</orgName>
								<address>
									<settlement>Galway</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fujitsu</forename><forename type="middle">Ireland</forename><surname>Limited</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Insight Centre for Data Analytics</orgName>
								<orgName type="institution">National University of Ireland</orgName>
								<address>
									<settlement>Galway</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Linked Data-Based Decision Tree Classifier to Review Movies</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">81168AEE08F2E36B01304BE313646E44</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T06:24+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we describe our contribution to the 2015 Linked Data Mining Challenge. The proposed task is concerned with the prediction of review of movies as "good" or "bad", as does Metacritic website based on critics' reviews. First we describe the sources used to build the training data. Although, several sources provide data about movies on the Web in different formats including RDF, data from HTML pages had to be gathered to fulfill some of our features. We then describe our experiment training a decision tree model on 241 features derived from our RDF knowledge base, achieving an accuracy of 0.94.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In this paper we describe the method used in our submission to the 2015 Linked Data Mining Challenge<ref type="foot" target="#foot_0">1</ref> at the Know@LOD Workshop. The challenge propose the task of predict whether a movie is "good" or "bad" based on the value of its RDF properties. These labels are as the ones used in the Metacritic<ref type="foot" target="#foot_1">2</ref> website, based on critics' reviews submitted to their system. Metacritic originally use three categories based on the critics: positive, negative, and mixed ; according to a score ranging from 0 to 100. For simplicity, in this challenge, only two classes are required, and movies with score above 60 are regarded as "good", while movies with score less than 40 are regarded as "bad". To achieve this goal we learn a Decision Tree classifier <ref type="bibr" target="#b0">[1]</ref>, which can efficiently assign a binary label to incoming unlabeled/unseen movies.</p><p>To design our classifier, we solved two main challenges: 1) the collection/transformation of relevant data about movies, and 2) the design of features from RDF data to train our classifier. We address the two challenges in this work with an estimated 70-30% effort, respectively. First, we collect data from several sources, including HTML pages, and convert it to RDF. Second, we use SPARQL queries to generate suitable data format for the learning process.</p><p>In the remaining of this paper, we describe how we address both challenges. We describe the construction of our RDF knowledge base, feature extraction, and experiment to learn the decision tree with its corresponding evaluation.</p><p>The provided data comprises 2,000 movies along with their name, release date, DBpedia URI, class (good/bad), and ID. From the data, 80% (1,600) is used during the training step, and 20% (400) during the testing step. The DBpedia URIs are used to access the LOD cloud for collecting further data about movies. Although, several LOD datasets contain relevant data for this task, namely, DBpedia<ref type="foot" target="#foot_2">3</ref> , LinkedMDB<ref type="foot" target="#foot_3">4</ref> , Freebase<ref type="foot" target="#foot_4">5</ref> , none of them contain high quality, complete, and up-to-date data in one place. Thus, we were forced to build our own RDF knowledge base, gathering facts from different RDF sources plus other (semi-/un-)structured data sources. The final list of sources included in our knowledge base comprises: IMDB<ref type="foot" target="#foot_5">6</ref> , OMDB<ref type="foot" target="#foot_6">7</ref> , Metacritic, Freebase, and DBpedia.</p><p>We start retrieving dcterms:subject values for a movie from DBpedia.We use DBpedia sameAs links to Freebase to get a movie's IMDB ID. Movies data (e.g., year, release, genre, director, starring, MPAA rating) were collected from OMDB in JSON format and then converted into RDF programmatically. We queried OMDB using the movie's IMDB ID instead of the movie title provided, since the search was more accurate in most cases. We retrieved data about actors and directors from Freebase using OpenRefine<ref type="foot" target="#foot_7">8</ref> . Thus, we could collect personal information about actors and directors, such as, genre, nationality, date of birth, IMDB ID, among others. Other information was extracted from IMDB: actors, directors and movies awards, movies budget, gross, common languages and countries. For each movie, we also extracted its IMDB keywords, which are later used to determine common keywords among good and bad movies.</p><p>Finally, for each movie we collected textual critics' reviews from Metacritic website and applied an existing API for sentiment analysis using NLTK<ref type="foot" target="#foot_8">9</ref> , which returns either a positive, negative or neutral sentiment label for a given text.</p><p>Our resulting RDF knowledge base comprises 338,140 RDF triples that are accessed using SPARQL queries to generate our set of features to train a decision tree model. (All data in RDF, decision tree model and diagram, and feature vectors are available at https://github.com/emir-munoz/ldmc2015.)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experiment</head><p>In the following we present our experiment set up to train and evaluate the proposed decision tree. Figure <ref type="figure" target="#fig_0">1</ref> shows a flow diagram of the data and processes involved. In order to train a decision tree classifier, we first define a set of features to be extracted from our RDF knowledge base (Movies DB). Movies DB is stored in a Virtuoso Server running on a CentOS Linux virtual machine (with 4.0 GHz CPU and 7.5 GB of RAM), and queried via HTTP. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Feature set</head><p>Once the RDF KB was finished, we defined a set of 241 features. Our features contain mixed continuous (numerical) and dichotomous (categorical) types that can be handled by C4.5 algorithm <ref type="bibr" target="#b1">[2]</ref>. The following list summarize the features used in this work. ( = feature considers the release/record date of the movie.)</p><p>dcterms:subject values genres of a movie countries of a movie languages of a movie -MPAA rating -# of directors' Oscar/Golden Globe awards won/nominated ( ) -# of actors' Oscar/Golden Globe awards won/nominated ( ) runtime release week/weekend day -# of bad/good/neutral/mostlygood/mostly-bad keywords -# of female/male actors -# directors younger than 30 ( ) -# directors between 30 and 50 ( ) -# directors older than 50 ( ) -# actors younger than 30 ( ) -# actors between 30 and 50 ( ) -# actors older than 50 ( ) is the movie from a common country? is the movie in a common language? low or high amount of budget? is the gross higher than the budget? -% of positive critics' reviews -% of negative critics' reviews -% of neutral critics' reviews is the movie based on a book? is the movie a sequel? is the movie an independent film? § ¤ SELECT ?age WHERE { dbr:Amores_perros rdf:type dbo:Film . dbr:Amores_perros dbp:recorded ?recorded . dbr:Amores_perros dbp:starring ?actor . ?actor dbp:dateOfBirth ?dob . BIND (?recorded -YEAR(?dob) AS ?age) }</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>¦ ¥</head><p>Features are extracted from the data using SELECT and ASK SPARQL queries. For instance, the query on the right, get the age value for each actor involved in the movie "Amores Perros". These values are then used to generate three of our features. A similar query is performed to get the age for directors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Learning process</head><p>After all features are extracted for both train and test sets, we use the J48 classifier, a Weka implementation for C4.5 algorithm. The decision tree settings consider pruning of the tree, and a confidence factor equals to 0.25.  Using equation in Figure <ref type="figure" target="#fig_2">2b</ref> to compute the accuracy of our system, we achieve Acc =0.94 on the train set. The challenge system reports an Acc =0.9175 for our submission on the test set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>We have described our submission to the 2015 Linked Data Mining Challenge, presenting a decision tree classifier to solve the prediction problem of review of movies. We trained this decision tree on 1,600 examples, with input features extracted from a built-in RDF knowledge base using SPARQL queries.</p><p>In order to reduce the features space, feature aggregation was applied over actors, directors, and critics' reviews. The sentiment analysis over critics' reviews generate the attributes with higher information gain <ref type="bibr" target="#b2">[3]</ref>. Negative critics have an information gain of 0.71886 bits, thus, selected as root of the decision tree. Experiments removing all sentiment features from the training show that accuracy is reduced by ca. 9%. While removing positive or negative does not affect the accuracy severely. That shows the relevance of sentiment analysis-based features for this task, which are directly related to the taste of users.</p><p>Movie keywords are the next features with higher information gain, and their analysis provide interesting insights to be considered by writers and directors: a) bad movies are based on video games, with someone critically bashed, using a taser, pepper spray, or hanged upside down, with dark heroine involved; and b) good movies include family relationships, frustration, crying, melancholy, very little dialogue, and some sins with moral ambiguity-yes, people like drama.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: System architecture with training and evaluation parts.</figDesc><graphic coords="3,143.41,115.84,328.53,136.09" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure</head><label></label><figDesc>Figure 2a reports the confusion matrix resulting from a 10-fold crossvalidation over the train data. Prediction good bad</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: Model evaluation.</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://knowalod2015.informatik.uni-mannheim.de/en/ linkeddataminingchallenge/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://www.metacritic.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://dbpedia.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">http://www.linkedmdb.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">http://www.freebase.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">http://www.imdb.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">http://www.omdbapi.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7">http://openrefine.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8">http://text-processing.com/docs/sentiment.html</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments. This work has been supported by KI2NA project funded by Fujitsu Laboratories Limited and Insight Centre for Data Analytics at NUI Galway.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Simplifying decision trees</title>
		<author>
			<persName><forename type="first">J</forename><surname>Quinlan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Man-Machine Studies</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="221" to="234" />
			<date type="published" when="1987">1987</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Data Mining: Practical Machine Learning Tools and Techniques</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Frank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hall</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011">2011</date>
			<publisher>Morgan Kaufmann Publishers Inc</publisher>
			<pubPlace>San Francisco, CA, USA</pubPlace>
		</imprint>
	</monogr>
	<note>3rd edn</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">How Critical Are Critical Reviews? The Box Office Effects of Film Critics, Star Power, and Budgets</title>
		<author>
			<persName><forename type="first">S</forename><surname>Basuroy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chatterjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Ravid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Marketing</title>
		<imprint>
			<biblScope unit="volume">67</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="103" to="117" />
			<date type="published" when="2003-10">October 2003</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
