<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Augmenting a Feature Set of Movies Using Linked Open Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaroslav Kuchar</string-name>
          <email>jaroslav.kuchar@fit.cvut.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Web Intelligence Research Group, Faculty of Information Technology Czech Technical University in Prague</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Augmenting a feature set using mappings to the Web of data is an up-and-coming way to enrich data in the original dataset. Those enrichments are valuable especially for the recent preference learning algorithms and recommender systems. In this paper, we describe the process of mapping and augmenting the movie ratings dataset MovieTweetings from the perspective of RecSysRules 2015 Challenge. The ad-hoc queries to DBpedia are used as an underlying concept. To the best of our knowledge, there is no existing mapping dataset of movies for MovieTweetings. We also provide a brief discussion about the bene ts of the augmented feature set for an elementary rule-based representation of the user preferences.</p>
      </abstract>
      <kwd-group>
        <kwd>web of data</kwd>
        <kwd>mapping</kwd>
        <kwd>user preferences</kwd>
        <kwd>association rules</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In this paper, we are focused on a new type of problems which uses the Web of
data to augment the feature set. Data in the original dataset are automatically
mapped to the Linked Open Data (LOD) identi ers, and then additional features
are generated from public knowledge bases such as DBpedia. The huge amount
of achievable additional features can provide valuable information for various
applications. Recommender systems and their preference learning algorithms
have adopted the augmenting of the feature sets. The main goal is to overcome
the issues with low granularity of available content descriptions on the one hand
and data volume on the other hand [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Since association rules are recognized
as one of the most suitable and understandable forms to represent knowledge
and relations in data, we place emphasize on the bene ts of enrichments for
the user preferences represented by a set of rules. Rule-based representations of
user preferences can thus provide a desirable balance between the quality of the
representation and the understandability of the explanation for the human user
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The main contribution of this paper is that it presents an approach how to
map an existing movie ratings dataset MovieTweetings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to the DBpedia, makes
the mapping dataset available and discusses its bene ts for rule-based user
preferences. The presented approach is focused on ad-hoc SPARQL queries instead
of "guessing" URIs [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or downloading all possible data to a local database and
processing the data locally [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. To the best of our knowledge, there is no
existing mapping dataset of movies for MovieTweetings to the Web of Data.
      </p>
      <p>This paper is organized as follows. Section 2 examines a connection to
RecSysRules challenge and provides an overview of dataset used for the challenge.
Section 3 presents automatically generated mappings to the LOD cloud for an
existing dataset, including the details on results. Section 4 brie y discusses the
bene ts of mappings for rule-based representation of user preferences. Finally,
Section 5 summarizes the results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Connection to RecSysRules 2015</title>
      <p>
        The challenge RecSysRules 2015 1 has two focus areas: 1) rule learning
algorithms applied on recommender problems 2) using the linked open data cloud
for feature set extension. Since the mappings for the MovieTweetings dataset (as
described in the rest of this paper) were not available at the time of organizing
this challenge, the challenge uses a semantically enriched version of the
MovieLens dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As a mapping of MovieLens to Linked Open Data DBpedia
mappings to MovieLens1M dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] were used. Please note that due to the
unavailability of all movies in DBpedia, the mapping for a fragment of movies is
missing. For each movie in the mapping dataset the organizers extracted a set
of categories and datatype properties (e.g. release data or gross) as an example
of the augmented feature set. The URI identi ers to DBpedia were used to
extract those features. In order to facilitate the distribution, the organizers do not
provide the nal dataset. Nevertheless, a Python script to download and build
the dataset is available. This script downloads all necessary dependencies and
creates the train CSV le as follows:
1. Download all dependencies including MovieLens ratings, mappings to
DBpedia, augmented feature sets and con gurations.
2. Filter ratings - select only ratings that correspond to a prede ned set of
users (randomly selected 1000 users by challenge organizers). There were
also removed last 10 ratings for each selected user and moved to a test set.
      </p>
      <p>Test set was used for an evaluation of results submissions.
3. Augment a feature set of movies - for each movie that appeared in the
ltered ratings, merge the movie with categories and properties from DBpedia.</p>
      <p>Entries without any available mapping are removed.
4. Export the train dataset as a CSV le.</p>
      <p>The rest of this paper is focused on a way to provide mappings of movies
to DBpedia for another dataset: MovieTweetings. The linking of movies is
performed in a similar way as mappings for MovieLens. The paper also discusses
the bene ts of available links for preference learning.
1 http://2015.ruleml.org/recsysrules-2015.html</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset Mapping</title>
      <p>
        The goal is to provide a one-to-one mapping of movies from MovieTweetings
dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to Linked Open Data cloud as URI identi ers. The dataset contains
movie ratings extracted from Twitter for movies released from 1900s to the
presence. Each movie is represented by a title, release date and a set of assigned
genres (Example: Rocky (1976), Drama j Sport ). The main advantage, compared
to other existing datasets (MovieLens [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Last.fm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Jester [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or Book-Crossing
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]), is an availability of updates on a daily basis. Because the dataset is based
on extraction of ratings from Twitter users around the world and it is daily
updated, we have to deal with the following issues: multilingualism in titles,
freshness, inaccuracies and incompleteness of data.
3.1
      </p>
      <sec id="sec-3-1">
        <title>URI Alignment</title>
        <p>Our proposed approach is designed to query the DBpedia using a set of
predened SPARQL queries performed in the following order:
Perfect match of a title: Listing 1.1 presents a SPARQL query to perform the
perfect matching of the title and year according to the existing conventions for
titles of movies in DBpedia (Example: Rocky, Rocky ( lm) and Rocky (1976
lm)).
Partial match of a title: Listing 1.2 describes a modi cation of the FILTER
condition as a relaxation of the patterns in titles.
1 . . .
2 FILTER r e g e x ( ? t i t l e , "%s " , " i " ) .
3 FILTER r e g e x ( ? year , "%s " , " i " )
4 . . .</p>
        <p>Listing 1.2. SPARQL query - Partial match of the title and year
Pattern-based match of an abstract: Based on the nature of DBpedia abstracts
formatting we use an abstract as a possible candidate for the pattern matching.
The common format of an abstract is: Rocky is a 1976 lm . . . or . . . Rocky
. . . released 1976 . . . .
1 . . .
2 FILTER (
3 r e g e x ( ? a b s t r a c t , "^%s i s a %s " , " i " )
4 j j
5 r e g e x ( ? a b s t r a c t , "^%s . r e l e a s .
6 )
7 . . .</p>
        <p>%s " , " i " )</p>
        <p>Listing 1.3. SPARQL query - Pattern-based match of the abstract
Any match of an abstract Last case is when there is no match to any previously
described patterns. For foreign languages, abstract usually contains textual
mentions about titles of the movie in foreign languages (Example: . . . also known as
. . . or . . . (Italian: . . . , German: . . . ))
1 . . .
2 FILTER r e g e x ( ? a b s t r a c t , "%s " , " i " ) .
3 FILTER r e g e x ( ? year , "%s " , " i " )
4 . . .</p>
        <p>Listing 1.4. SPARQL query - Any match of the abstract
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Con dence Values</title>
        <p>To express a basic relevance of the mapping to URI identi ers from DBpedia,
we provide a set of con dence values. Title con dence(tc) is computed using
Levenshtein distance of titles, Year Con dence (yc) is computed as a simple
distance of years and Genre Con dence (gc) uses number of common genres.
Those values are available in the nal mapping dataset and can be used together
with a method name for ltering of results. The setting of the ltering is left to
the end-user of the mapping dataset.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Results and Statistics</title>
        <p>In this section we will brie y describe results of the mapping. We use a snapshot
of the dataset downloaded on June 1, 2015. It contains over 21000 movies. At the
time of publishing of this paper, the mapping provides URIs for 71:3% movies.
The remaining movies were not mapped due to the issues mentioned at the
beginning of this section.</p>
        <p>Figure 1 depicts distribution of years for movies that were not successfully
mapped to any URI. There is a large amount of movies from recent time that
were not successfully mapped due to their unavailability in DBpedia. The
reason is that the current version of DBpedia was published on September 9, 2014
(based on Wikipedia dumps from April/May 2014)2. Figure 2 demonstrates
usage of methods for successful mapping of movies. The method that performs the
perfect match of a title and a year is the most frequent (perfect: 86:61%, pattern:
4:38%, partial: 3:66%, any: 5:35%). Figure 3 provides an overview of language
distribution in titles.3 This summary presents the availability of mappings to
DBpedia for various languages.</p>
        <p>
          We also evaluated our approach using another existing mapping dataset for
MovieLens [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. We selected this dataset because both original datasets
(MovieLens and MovieTweetings) are provided in the same format and the authors of
the mapping dataset for MovieLens deal with the same task: mapping of movies
2 http://wiki.dbpedia.org/news/dbpedia-version-2014-released
3 Languages detected in titles using LangID: https://github.com/saffsd/langid.py
to DBpedia. Furthermore, the dataset was manually corrected, therefore we can
use it as a ground truth. We launched the proposed mapping algorithm and
compared to available mappings. Our approach achieved over 98:5% match, where
the incorrectly mapped values were either missing URIs or incorrect links that
can be ltered using the con dence values.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Rule-based User Preferences</title>
      <p>In this section we will brie y discuss the bene ts of the augmented dataset from
the perspective of the challenge RecSysRules 2015.</p>
      <p>
        Association rules are recognized as one of the most suitable and
understandable forms to represent knowledge and relations in data. Rule-based
representations of user preferences can thus provide a desirable balance between the
quality of the representation and the understandability of the explanation for
the human user. The user preferences may be used in di erent scenarios or use
cases from elementary user pro le representations to rating predictions and
recommendations. In this paper we consider a subset of association rules, called
class association rules (CARs) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Those rules are in the speci c format, where
a right-hand side of a rule (consequent) contains only one attribute and this
attribute is a classi cation class attribute.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Illustrative Example</title>
        <p>Let consider the domain of movies and information about ratings provided by
users from MovieTweetings dataset. The presence of a user rating for a speci c
movie can be considered as an interest clue - the implicit information about the
positive user preference for the movie. For ratings prediction tasks, the provided
ratings can be considered as a level of interest. However, it is beyond the scope of
this paper to elaborate on all possible tasks. The rest of this illustrative example
is focused on the positive-only feedback and the item recommendation task. Each
movie is basically represented by a set of features - associated genres. Table 1
provides example for one user from the MovieTweetings dataset.</p>
        <p>
          The elementary rule-based user preferences can be mined using an association
rule mining algorithm (e.g R arules package [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]). Example of extracted rules, that
represents the user preferences for one speci c user (User Id: 455, minCon dence:
0.1, minSupport: 0.1):
{ fActiong ! fpositiveg (support=1.0, con dence=1.0)
{ fAction&amp;T hrillerg ! fpositiveg (support=1.0, con dence=1.0)
{ fCrimeg ! fpositiveg (support=0.67, con dence=1.0)
        </p>
        <p>The drawback of the previously described preferences is that they consider
only genres as a key component. It is a limiting factor of this representation
since those genres are too general. The total number of unique genres in the
dataset is 28. In case we would like to use those rules to nd candidates for
other interesting movies to the user, the rules match too many movies as a set
of possible candidates (2952, 1130 and 2717 matched movies respectively).</p>
        <p>The mappings of movies to the Linked Open Data (See previous section for
more details) can help to overcome this issue. Linked Open Data cloud contains
relevant information to augment the feature set and increase the granularity.
The URI as an identi er of data related to the associated movie can be used
to extract additional features; a set of assigned categories for this example4.
Table 2 demonstrates excerpt of an augmented feature set for the movies from
our example. We use a basic SPARQL query to extract all categories associated
with the speci c movies.</p>
        <p>Sample of three representative rules mined on the augmented feature set
(User Id: 455, minCon dence: 0.1, minSupport: 0.1):
{ fW arner Bros: f ilmsg ! fpositiveg (support=0.67, con dence=1.0)
{ fBatman f ilmsg ! fpositiveg (support=0.67, con dence=1.0)
{ fT he Hunger Games (f ilm series)g ! fpositiveg (support=0.33, con
dence=1.0)</p>
        <p>Using the Linked Open Data Cloud we get more granular features for
representations of movies. In total there are 10 950 unique categories for all movies in
the dataset. The availability of a set of more granular categories assigned to each
movie and rule-based user preferences considering those categories, the number
4 Categories are identi ed by predicate http://purl.org/dc/terms/subject
of movies that match preferences should be decreased. For our illustrative
experiment, the number of matching movies are as follows: 859, 9, 4. The rst
rule contains more general category, but the remaining two are able to provide
adequate number of candidates based on the preferences.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this paper we demonstrate the approach to augment the existing movie
ratings dataset MovieTweetings from the perspective of the RecSysRules 2015
challenge. We provide the dataset as a mapping of movies to DBpedia for further
experiments. It is available for download on the Github5. It can be used for
other content-based recommender systems as well. We also discussed the
benets of augmented feature sets for the elementary rule-based representations of
user preferences. We plan to perform extensive experiments with rule-based user
preferences boosted by the augmented feature set. Last but not least, we plan
to improve the mapping patterns, o er the mappings to other knowledge bases
and provide updates of mapping dataset on a regular basis.</p>
      <p>Acknowledgments. This work was supported by the Grant Agency of the
Czech Technical University in Prague, grant No. SGS14/104/OHK3/1T/18.
5 http://github.com/jaroslav-kuchar/MovieTweetingsMappings</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Oscar</given-names>
            <surname>Celma</surname>
          </string-name>
          .
          <article-title>Music Recommendation and Discovery: The Long Tail, Long Fail, and Long Play in the Digital Music Space</article-title>
          . Springer Publishing Company,
          <source>Incorporated, 1st edition</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Tommaso</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Noia</surname>
          </string-name>
          , Roberto Mirizzi, Vito Claudio Ostuni, Davide Romito, and
          <string-name>
            <given-names>Markus</given-names>
            <surname>Zanker</surname>
          </string-name>
          .
          <article-title>Linked open data to support content-based recommender systems</article-title>
          .
          <source>In Proceedings of the 8th International Conference on Semantic Systems, I-SEMANTICS '12</source>
          , pages
          <issue>1</issue>
          {
          <fpage>8</fpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Simon</given-names>
            <surname>Dooms</surname>
          </string-name>
          , Toon De Pessemier, and
          <string-name>
            <given-names>Luc</given-names>
            <surname>Martens</surname>
          </string-name>
          .
          <article-title>Movietweetings: a movie rating dataset collected from twitter</article-title>
          . In Workshop on Crowdsourcing and
          <article-title>Human Computation for Recommender Systems</article-title>
          ,
          <source>CrowdRec at RecSys</source>
          <year>2013</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Ken</given-names>
            <surname>Goldberg</surname>
          </string-name>
          , Theresa Roeder, Dhruv Gupta, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Perkins</surname>
          </string-name>
          .
          <article-title>Eigentaste: A constant time collaborative ltering algorithm</article-title>
          .
          <source>Inf. Retr.</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ):
          <volume>133</volume>
          {
          <fpage>151</fpage>
          ,
          <year>July 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hahsler</surname>
          </string-name>
          ,
          <article-title>Bettina Grun, and Kurt Hornik. arules - a computational environment for mining association rules and frequent item sets</article-title>
          .
          <source>Journal of Statistical Software</source>
          ,
          <volume>14</volume>
          (
          <issue>15</issue>
          ):
          <volume>1</volume>
          {
          <issue>25</issue>
          , 9
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jonathan L. Herlocker</surname>
          </string-name>
          , Joseph A.
          <string-name>
            <surname>Konstan</surname>
            , Al Borchers,
            <given-names>and John Riedl.</given-names>
          </string-name>
          <article-title>An algorithmic framework for performing collaborative ltering</article-title>
          .
          <source>In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '99</source>
          , pages
          <fpage>230</fpage>
          {
          <fpage>237</fpage>
          , New York, NY, USA,
          <year>1999</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Kliegr</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jaroslav</given-names>
            <surname>Kuchar</surname>
          </string-name>
          .
          <article-title>Orwellian eye: Video recommendation with Microsoft Kinect</article-title>
          .
          <source>In Proceedings of the Conference on Prestigious Applications of Intelligent Systems (PAIS'14) collocated with European Conference on Arti cial Intelligence (ECAI'14)</source>
          , pages
          <fpage>1227</fpage>
          {
          <fpage>1228</fpage>
          . IOS Press,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Jaroslav</given-names>
            <surname>Kuchar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Kliegr</surname>
          </string-name>
          .
          <article-title>Bag-of-entities text representation for clientside recommender systems</article-title>
          .
          <source>In First Workshop on Recommender Systems for Television and online Video (RecSysTV)</source>
          ,
          <source>ACM RecSys</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Bing</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Wynne Hsu, and Yiming Ma.
          <article-title>Integrating classi cation and association rule mining</article-title>
          . In
          <string-name>
            <surname>Piatetsky-Shapiro G. Agrawal</surname>
            <given-names>R.</given-names>
          </string-name>
          , Stolorz P., editor,
          <source>Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD-98)</source>
          , pages
          <fpage>80</fpage>
          {
          <fpage>86</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>Heiko Paulheim and Johannes Fumkranz. Unsupervised generation of data mining features from linked open data</article-title>
          .
          <source>In Proceedings of the 2nd International Conference on Web Intelligence</source>
          , Mining and Semantics, WIMS '
          <fpage>12</fpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Rowe</surname>
          </string-name>
          . Semanticsvd++:
          <article-title>Incorporating semantic taste evolution for predicting ratings</article-title>
          .
          <source>In Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 01, WI-IAT '14</source>
          , pages
          <fpage>213</fpage>
          {
          <fpage>220</fpage>
          , Washington, DC, USA,
          <year>2014</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Cai-Nicolas</surname>
            <given-names>Ziegler</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sean M. McNee</surname>
            ,
            <given-names>Joseph A.</given-names>
          </string-name>
          <string-name>
            <surname>Konstan</surname>
            , and
            <given-names>Georg</given-names>
          </string-name>
          <string-name>
            <surname>Lausen</surname>
          </string-name>
          .
          <article-title>Improving recommendation lists through topic diversi cation</article-title>
          .
          <source>In Proceedings of the 14th International Conference on World Wide Web, WWW '05</source>
          , pages
          <fpage>22</fpage>
          {
          <fpage>32</fpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>