Augmenting a Feature Set of Movies Using
              Linked Open Data

                                 Jaroslav Kuchař

       Web Intelligence Research Group, Faculty of Information Technology
         Czech Technical University in Prague, Prague, Czech Republic
                         jaroslav.kuchar@fit.cvut.cz


      Abstract. Augmenting a feature set using mappings to the Web of data
      is an up-and-coming way to enrich data in the original dataset. Those
      enrichments are valuable especially for the recent preference learning
      algorithms and recommender systems. In this paper, we describe the
      process of mapping and augmenting the movie ratings dataset Movi-
      eTweetings from the perspective of RecSysRules 2015 Challenge. The
      ad-hoc queries to DBpedia are used as an underlying concept. To the
      best of our knowledge, there is no existing mapping dataset of movies
      for MovieTweetings. We also provide a brief discussion about the benefits
      of the augmented feature set for an elementary rule-based representation
      of the user preferences.

      Keywords: web of data, mapping, user preferences, association rules


1   Introduction
In this paper, we are focused on a new type of problems which uses the Web of
data to augment the feature set. Data in the original dataset are automatically
mapped to the Linked Open Data (LOD) identifiers, and then additional features
are generated from public knowledge bases such as DBpedia. The huge amount
of achievable additional features can provide valuable information for various
applications. Recommender systems and their preference learning algorithms
have adopted the augmenting of the feature sets. The main goal is to overcome
the issues with low granularity of available content descriptions on the one hand
and data volume on the other hand [8]. Since association rules are recognized
as one of the most suitable and understandable forms to represent knowledge
and relations in data, we place emphasize on the benefits of enrichments for
the user preferences represented by a set of rules. Rule-based representations of
user preferences can thus provide a desirable balance between the quality of the
representation and the understandability of the explanation for the human user
[7].
     The main contribution of this paper is that it presents an approach how to
map an existing movie ratings dataset MovieTweetings [3] to the DBpedia, makes
the mapping dataset available and discusses its benefits for rule-based user pref-
erences. The presented approach is focused on ad-hoc SPARQL queries instead
of ”guessing” URIs [10] or downloading all possible data to a local database and
processing the data locally [2],[11]. To the best of our knowledge, there is no
existing mapping dataset of movies for MovieTweetings to the Web of Data.
    This paper is organized as follows. Section 2 examines a connection to Rec-
SysRules challenge and provides an overview of dataset used for the challenge.
Section 3 presents automatically generated mappings to the LOD cloud for an
existing dataset, including the details on results. Section 4 briefly discusses the
benefits of mappings for rule-based representation of user preferences. Finally,
Section 5 summarizes the results.


2     Connection to RecSysRules 2015

The challenge RecSysRules 2015 1 has two focus areas: 1) rule learning algo-
rithms applied on recommender problems 2) using the linked open data cloud
for feature set extension. Since the mappings for the MovieTweetings dataset (as
described in the rest of this paper) were not available at the time of organizing
this challenge, the challenge uses a semantically enriched version of the Movie-
Lens dataset [6]. As a mapping of MovieLens to Linked Open Data DBpedia
mappings to MovieLens1M dataset [2] were used. Please note that due to the
unavailability of all movies in DBpedia, the mapping for a fragment of movies is
missing. For each movie in the mapping dataset the organizers extracted a set
of categories and datatype properties (e.g. release data or gross) as an example
of the augmented feature set. The URI identifiers to DBpedia were used to ex-
tract those features. In order to facilitate the distribution, the organizers do not
provide the final dataset. Nevertheless, a Python script to download and build
the dataset is available. This script downloads all necessary dependencies and
creates the train CSV file as follows:

 1. Download all dependencies including MovieLens ratings, mappings to DB-
    pedia, augmented feature sets and configurations.
 2. Filter ratings - select only ratings that correspond to a predefined set of
    users (randomly selected 1000 users by challenge organizers). There were
    also removed last 10 ratings for each selected user and moved to a test set.
    Test set was used for an evaluation of results submissions.
 3. Augment a feature set of movies - for each movie that appeared in the fil-
    tered ratings, merge the movie with categories and properties from DBpedia.
    Entries without any available mapping are removed.
 4. Export the train dataset as a CSV file.

   The rest of this paper is focused on a way to provide mappings of movies
to DBpedia for another dataset: MovieTweetings. The linking of movies is per-
formed in a similar way as mappings for MovieLens. The paper also discusses
the benefits of available links for preference learning.
1
    http://2015.ruleml.org/recsysrules-2015.html
    3     Dataset Mapping
    The goal is to provide a one-to-one mapping of movies from MovieTweetings
    dataset [3] to Linked Open Data cloud as URI identifiers. The dataset contains
    movie ratings extracted from Twitter for movies released from 1900s to the
    presence. Each movie is represented by a title, release date and a set of assigned
    genres (Example: Rocky (1976), Drama | Sport). The main advantage, compared
    to other existing datasets (MovieLens [6], Last.fm [1], Jester [4] or Book-Crossing
    [12]), is an availability of updates on a daily basis. Because the dataset is based
    on extraction of ratings from Twitter users around the world and it is daily
    updated, we have to deal with the following issues: multilingualism in titles,
    freshness, inaccuracies and incompleteness of data.

    3.1   URI Alignment
    Our proposed approach is designed to query the DBpedia using a set of prede-
    fined SPARQL queries performed in the following order:

    Perfect match of a title: Listing 1.1 presents a SPARQL query to perform the
    perfect matching of the title and year according to the existing conventions for
    titles of movies in DBpedia (Example: Rocky, Rocky (film) and Rocky (1976
    film)).
1  SELECT DISTINCT ? movie ? t i t l e ? c a t e g o r y WHERE {
2  ? movie r d f : type dbpedia−owl : Film ;
 3 rdfs : label         ? title .
 4 ? movie dcterms : s u b j e c t ? c a t e g o r y        .
 5 ? category         r d f s : l a b e l ? year .
 6 FILTER (

 7    (
 8        ( s t r ( ? t i t l e )=”%s ” | | s t r ( ? t i t l e )=”%s ( f i l m ) ” )
 9       &&
10        r e g e x ( ? year , ”ˆ%s f i l m ” , ” i ” )
11    )
12    ||
13    s t r ( ? t i t l e )=”%s (%s f i l m ) ”
14    )
15 }

16 ORDER BY ASC( ? movie )

                Listing 1.1. SPARQL query - Perfect match of the title and year

    Partial match of a title: Listing 1.2 describes a modification of the FILTER
    condition as a relaxation of the patterns in titles.
1 ...
2 FILTER r e g e x ( ? t i t l e , ”%s ” , ” i ” ) .
3 FILTER r e g e x ( ? year , ”%s ” , ” i ” )

4 . . .

         Listing 1.2. SPARQL query - Partial match of the title and year
    Pattern-based match of an abstract: Based on the nature of DBpedia abstracts
    formatting we use an abstract as a possible candidate for the pattern matching.
    The common format of an abstract is: Rocky is a 1976 film . . . or . . . Rocky
    . . . released 1976 . . . .
1 ...
2 FILTER (
3     r e g e x ( ? a b s t r a c t , ”ˆ%s i s a %s ” , ” i ” )
4     ||
5     r e g e x ( ? a b s t r a c t , ”ˆ%s . ∗ r e l e a s . ∗ %s ” , ” i ” )
6 )

7 . . .

             Listing 1.3. SPARQL query - Pattern-based match of the abstract


    Any match of an abstract Last case is when there is no match to any previously
    described patterns. For foreign languages, abstract usually contains textual men-
    tions about titles of the movie in foreign languages (Example: . . . also known as
    . . . or . . . (Italian: . . . , German: . . . ))
1 ...
2 FILTER r e g e x ( ? a b s t r a c t , ”%s ” , ” i ” ) .
3 FILTER r e g e x ( ? year , ”%s ” , ” i ” )

4 . . .

                   Listing 1.4. SPARQL query - Any match of the abstract


    3.2   Confidence Values
    To express a basic relevance of the mapping to URI identifiers from DBpedia,
    we provide a set of confidence values. Title confidence(tc) is computed using
    Levenshtein distance of titles, Year Confidence (yc) is computed as a simple
    distance of years and Genre Confidence (gc) uses number of common genres.
    Those values are available in the final mapping dataset and can be used together
    with a method name for filtering of results. The setting of the filtering is left to
    the end-user of the mapping dataset.

    3.3   Results and Statistics
    In this section we will briefly describe results of the mapping. We use a snapshot
    of the dataset downloaded on June 1, 2015. It contains over 21000 movies. At the
    time of publishing of this paper, the mapping provides URIs for 71.3% movies.
    The remaining movies were not mapped due to the issues mentioned at the
    beginning of this section.
        Figure 1 depicts distribution of years for movies that were not successfully
    mapped to any URI. There is a large amount of movies from recent time that
    were not successfully mapped due to their unavailability in DBpedia. The rea-
    son is that the current version of DBpedia was published on September 9, 2014
Fig. 1. Distribution of years for unmapped Fig. 2. Overview of methods used for suc-
movies                                     cessful mapping


(based on Wikipedia dumps from April/May 2014)2 . Figure 2 demonstrates us-
age of methods for successful mapping of movies. The method that performs the
perfect match of a title and a year is the most frequent (perfect: 86.61%, pattern:
4.38%, partial: 3.66%, any: 5.35%). Figure 3 provides an overview of language
distribution in titles.3 This summary presents the availability of mappings to
DBpedia for various languages.


Fig. 3. Distribution of mapped/unmapped movies with respect to languages detected
in movie titles


   We also evaluated our approach using another existing mapping dataset for
MovieLens [2]. We selected this dataset because both original datasets (Movie-
Lens and MovieTweetings) are provided in the same format and the authors of
the mapping dataset for MovieLens deal with the same task: mapping of movies
2
    http://wiki.dbpedia.org/news/dbpedia-version-2014-released
3
    Languages detected in titles using LangID: https://github.com/saffsd/langid.py
to DBpedia. Furthermore, the dataset was manually corrected, therefore we can
use it as a ground truth. We launched the proposed mapping algorithm and com-
pared to available mappings. Our approach achieved over 98.5% match, where
the incorrectly mapped values were either missing URIs or incorrect links that
can be filtered using the confidence values.


4     Rule-based User Preferences
In this section we will briefly discuss the benefits of the augmented dataset from
the perspective of the challenge RecSysRules 2015.
    Association rules are recognized as one of the most suitable and understand-
able forms to represent knowledge and relations in data. Rule-based represen-
tations of user preferences can thus provide a desirable balance between the
quality of the representation and the understandability of the explanation for
the human user. The user preferences may be used in different scenarios or use
cases from elementary user profile representations to rating predictions and rec-
ommendations. In this paper we consider a subset of association rules, called
class association rules (CARs) [9]. Those rules are in the specific format, where
a right-hand side of a rule (consequent) contains only one attribute and this
attribute is a classification class attribute.

4.1     Illustrative Example
Let consider the domain of movies and information about ratings provided by
users from MovieTweetings dataset. The presence of a user rating for a specific
movie can be considered as an interest clue - the implicit information about the
positive user preference for the movie. For ratings prediction tasks, the provided
ratings can be considered as a level of interest. However, it is beyond the scope of
this paper to elaborate on all possible tasks. The rest of this illustrative example
is focused on the positive-only feedback and the item recommendation task. Each
movie is basically represented by a set of features - associated genres. Table 1
provides example for one user from the MovieTweetings dataset.


      Table 1. Example of input data from MovieTweetings dataset (User Id: 455)

MovieId Title                        Features (Genres)               Interest
468569 The Dark Knight (2008)        ”Action”,   ”Crime”,   ”Drama”, positive
                                     ”Thriller”
1345836 The Dark Knight Rises (2012) ”Action”, ”Crime”, ”Thriller”   positive
1951264 The Hunger Games: Catching ”Action”, ”Adventure”, ”Sci-Fi”, positive
        Fire (2013)                  ”Thriller”


    The elementary rule-based user preferences can be mined using an association
rule mining algorithm (e.g R arules package [5]). Example of extracted rules, that
represents the user preferences for one specific user (User Id: 455, minConfidence:
0.1, minSupport: 0.1):

 – {Action} → {positive} (support=1.0, confidence=1.0)
 – {Action&T hriller} → {positive} (support=1.0, confidence=1.0)
 – {Crime} → {positive} (support=0.67, confidence=1.0)

    The drawback of the previously described preferences is that they consider
only genres as a key component. It is a limiting factor of this representation
since those genres are too general. The total number of unique genres in the
dataset is 28. In case we would like to use those rules to find candidates for
other interesting movies to the user, the rules match too many movies as a set
of possible candidates (2952, 1130 and 2717 matched movies respectively).


Table 2. Excerpt from an augmented feature set for MovieTweetings (User Id: 455)

Title                Features
The Dark Knight ”American action thriller films”, ”Batman films”, ”Films di-
(2008)               rected by Christopher Nolan”, . . .
The Dark Knight ”Batman films”, ”Warner Bros. films”, . . .
Rises (2012)
The Hunger Games: ”2010s science fiction films”, ”The Hunger Games (film series)”,
Catching Fire (2013) ”American fantasy adventure films”, . . .


    The mappings of movies to the Linked Open Data (See previous section for
more details) can help to overcome this issue. Linked Open Data cloud contains
relevant information to augment the feature set and increase the granularity.
The URI as an identifier of data related to the associated movie can be used
to extract additional features; a set of assigned categories for this example4 .
Table 2 demonstrates excerpt of an augmented feature set for the movies from
our example. We use a basic SPARQL query to extract all categories associated
with the specific movies.
    Sample of three representative rules mined on the augmented feature set
(User Id: 455, minConfidence: 0.1, minSupport: 0.1):

 – {W arner Bros. f ilms} → {positive} (support=0.67, confidence=1.0)
 – {Batman f ilms} → {positive} (support=0.67, confidence=1.0)
 – {T he Hunger Games (f ilm series)} → {positive} (support=0.33, confi-
   dence=1.0)

   Using the Linked Open Data Cloud we get more granular features for repre-
sentations of movies. In total there are 10 950 unique categories for all movies in
the dataset. The availability of a set of more granular categories assigned to each
movie and rule-based user preferences considering those categories, the number
4
    Categories are identified by predicate http://purl.org/dc/terms/subject
of movies that match preferences should be decreased. For our illustrative ex-
periment, the number of matching movies are as follows: 859, 9, 4. The first
rule contains more general category, but the remaining two are able to provide
adequate number of candidates based on the preferences.


5     Conclusion and Future Work

In this paper we demonstrate the approach to augment the existing movie rat-
ings dataset MovieTweetings from the perspective of the RecSysRules 2015 chal-
lenge. We provide the dataset as a mapping of movies to DBpedia for further
experiments. It is available for download on the Github5 . It can be used for
other content-based recommender systems as well. We also discussed the bene-
fits of augmented feature sets for the elementary rule-based representations of
user preferences. We plan to perform extensive experiments with rule-based user
preferences boosted by the augmented feature set. Last but not least, we plan
to improve the mapping patterns, offer the mappings to other knowledge bases
and provide updates of mapping dataset on a regular basis.


Acknowledgments. This work was supported by the Grant Agency of the
Czech Technical University in Prague, grant No. SGS14/104/OHK3/1T/18.


References

 1. Oscar Celma. Music Recommendation and Discovery: The Long Tail, Long Fail,
    and Long Play in the Digital Music Space. Springer Publishing Company, Incor-
    porated, 1st edition, 2010.
 2. Tommaso Di Noia, Roberto Mirizzi, Vito Claudio Ostuni, Davide Romito, and
    Markus Zanker. Linked open data to support content-based recommender sys-
    tems. In Proceedings of the 8th International Conference on Semantic Systems,
    I-SEMANTICS ’12, pages 1–8, New York, NY, USA, 2012. ACM.
 3. Simon Dooms, Toon De Pessemier, and Luc Martens. Movietweetings: a movie
    rating dataset collected from twitter. In Workshop on Crowdsourcing and Human
    Computation for Recommender Systems, CrowdRec at RecSys 2013, 2013.
 4. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A
    constant time collaborative filtering algorithm. Inf. Retr., 4(2):133–151, July 2001.
 5. Michael Hahsler, Bettina Grün, and Kurt Hornik. arules - a computational envi-
    ronment for mining association rules and frequent item sets. Journal of Statistical
    Software, 14(15):1–25, 9 2005.
 6. Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl. An
    algorithmic framework for performing collaborative filtering. In Proceedings of the
    22nd Annual International ACM SIGIR Conference on Research and Development
    in Information Retrieval, SIGIR ’99, pages 230–237, New York, NY, USA, 1999.
    ACM.
5
    http://github.com/jaroslav-kuchar/MovieTweetingsMappings
 7. Tomáš Kliegr and Jaroslav Kuchař. Orwellian eye: Video recommendation with
    Microsoft Kinect. In Proceedings of the Conference on Prestigious Applications
    of Intelligent Systems (PAIS’14) collocated with European Conference on Artificial
    Intelligence (ECAI’14), pages 1227–1228. IOS Press, 2014.
 8. Jaroslav Kuchař and Tomáš Kliegr. Bag-of-entities text representation for client-
    side recommender systems. In First Workshop on Recommender Systems for Tele-
    vision and online Video (RecSysTV), ACM RecSys, 2014.
 9. Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and association
    rule mining. In Piatetsky-Shapiro G. Agrawal R., Stolorz P., editor, Proceedings
    of the 4th International Conference on Knowledge Discovery and Data Mining
    (KDD-98), pages 80–86, 1998.
10. Heiko Paulheim and Johannes Fümkranz. Unsupervised generation of data mining
    features from linked open data. In Proceedings of the 2nd International Conference
    on Web Intelligence, Mining and Semantics, WIMS ’12, New York, NY, USA, 2012.
    ACM.
11. Matthew Rowe. Semanticsvd++: Incorporating semantic taste evolution for pre-
    dicting ratings. In Proceedings of the 2014 IEEE/WIC/ACM International Joint
    Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) -
    Volume 01, WI-IAT ’14, pages 213–220, Washington, DC, USA, 2014. IEEE Com-
    puter Society.
12. Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. Im-
    proving recommendation lists through topic diversification. In Proceedings of the
    14th International Conference on World Wide Web, WWW ’05, pages 22–32, New
    York, NY, USA, 2005. ACM.