Introduction

Augmenting a Feature Set of Movies Using Linked Open Data

Jaroslav Kuchar

jaroslav.kuchar@fit.cvut.cz 0 0 Web Intelligence Research Group, Faculty of Information Technology Czech Technical University in Prague , Prague , Czech Republic

Augmenting a feature set using mappings to the Web of data is an up-and-coming way to enrich data in the original dataset. Those enrichments are valuable especially for the recent preference learning algorithms and recommender systems. In this paper, we describe the process of mapping and augmenting the movie ratings dataset MovieTweetings from the perspective of RecSysRules 2015 Challenge. The ad-hoc queries to DBpedia are used as an underlying concept. To the best of our knowledge, there is no existing mapping dataset of movies for MovieTweetings. We also provide a brief discussion about the bene ts of the augmented feature set for an elementary rule-based representation of the user preferences.

web of data mapping user preferences association rules

Introduction

In this paper, we are focused on a new type of problems which uses the Web of data to augment the feature set. Data in the original dataset are automatically mapped to the Linked Open Data (LOD) identi ers, and then additional features are generated from public knowledge bases such as DBpedia. The huge amount of achievable additional features can provide valuable information for various applications. Recommender systems and their preference learning algorithms have adopted the augmenting of the feature sets. The main goal is to overcome the issues with low granularity of available content descriptions on the one hand and data volume on the other hand [ 8 ]. Since association rules are recognized as one of the most suitable and understandable forms to represent knowledge and relations in data, we place emphasize on the bene ts of enrichments for the user preferences represented by a set of rules. Rule-based representations of user preferences can thus provide a desirable balance between the quality of the representation and the understandability of the explanation for the human user [ 7 ].

The main contribution of this paper is that it presents an approach how to map an existing movie ratings dataset MovieTweetings [ 3 ] to the DBpedia, makes the mapping dataset available and discusses its bene ts for rule-based user preferences. The presented approach is focused on ad-hoc SPARQL queries instead of "guessing" URIs [ 10 ] or downloading all possible data to a local database and processing the data locally [ 2 ],[ 11 ]. To the best of our knowledge, there is no existing mapping dataset of movies for MovieTweetings to the Web of Data.

This paper is organized as follows. Section 2 examines a connection to RecSysRules challenge and provides an overview of dataset used for the challenge. Section 3 presents automatically generated mappings to the LOD cloud for an existing dataset, including the details on results. Section 4 brie y discusses the bene ts of mappings for rule-based representation of user preferences. Finally, Section 5 summarizes the results. 2

Connection to RecSysRules 2015

The challenge RecSysRules 2015 1 has two focus areas: 1) rule learning algorithms applied on recommender problems 2) using the linked open data cloud for feature set extension. Since the mappings for the MovieTweetings dataset (as described in the rest of this paper) were not available at the time of organizing this challenge, the challenge uses a semantically enriched version of the MovieLens dataset [ 6 ]. As a mapping of MovieLens to Linked Open Data DBpedia mappings to MovieLens1M dataset [ 2 ] were used. Please note that due to the unavailability of all movies in DBpedia, the mapping for a fragment of movies is missing. For each movie in the mapping dataset the organizers extracted a set of categories and datatype properties (e.g. release data or gross) as an example of the augmented feature set. The URI identi ers to DBpedia were used to extract those features. In order to facilitate the distribution, the organizers do not provide the nal dataset. Nevertheless, a Python script to download and build the dataset is available. This script downloads all necessary dependencies and creates the train CSV le as follows: 1. Download all dependencies including MovieLens ratings, mappings to DBpedia, augmented feature sets and con gurations. 2. Filter ratings - select only ratings that correspond to a prede ned set of users (randomly selected 1000 users by challenge organizers). There were also removed last 10 ratings for each selected user and moved to a test set.

Test set was used for an evaluation of results submissions. 3. Augment a feature set of movies - for each movie that appeared in the ltered ratings, merge the movie with categories and properties from DBpedia.

Entries without any available mapping are removed. 4. Export the train dataset as a CSV le.

The rest of this paper is focused on a way to provide mappings of movies to DBpedia for another dataset: MovieTweetings. The linking of movies is performed in a similar way as mappings for MovieLens. The paper also discusses the bene ts of available links for preference learning. 1 http://2015.ruleml.org/recsysrules-2015.html

Dataset Mapping

The goal is to provide a one-to-one mapping of movies from MovieTweetings dataset [ 3 ] to Linked Open Data cloud as URI identi ers. The dataset contains movie ratings extracted from Twitter for movies released from 1900s to the presence. Each movie is represented by a title, release date and a set of assigned genres (Example: Rocky (1976), Drama j Sport ). The main advantage, compared to other existing datasets (MovieLens [ 6 ], Last.fm [ 1 ], Jester [ 4 ] or Book-Crossing [ 12 ]), is an availability of updates on a daily basis. Because the dataset is based on extraction of ratings from Twitter users around the world and it is daily updated, we have to deal with the following issues: multilingualism in titles, freshness, inaccuracies and incompleteness of data. 3.1

URI Alignment

Our proposed approach is designed to query the DBpedia using a set of predened SPARQL queries performed in the following order: Perfect match of a title: Listing 1.1 presents a SPARQL query to perform the perfect matching of the title and year according to the existing conventions for titles of movies in DBpedia (Example: Rocky, Rocky ( lm) and Rocky (1976 lm)). Partial match of a title: Listing 1.2 describes a modi cation of the FILTER condition as a relaxation of the patterns in titles. 1 . . . 2 FILTER r e g e x ( ? t i t l e , "%s " , " i " ) . 3 FILTER r e g e x ( ? year , "%s " , " i " ) 4 . . .

Listing 1.2. SPARQL query - Partial match of the title and year Pattern-based match of an abstract: Based on the nature of DBpedia abstracts formatting we use an abstract as a possible candidate for the pattern matching. The common format of an abstract is: Rocky is a 1976 lm . . . or . . . Rocky . . . released 1976 . . . . 1 . . . 2 FILTER ( 3 r e g e x ( ? a b s t r a c t , "^%s i s a %s " , " i " ) 4 j j 5 r e g e x ( ? a b s t r a c t , "^%s . r e l e a s . 6 ) 7 . . .

%s " , " i " )

Listing 1.3. SPARQL query - Pattern-based match of the abstract Any match of an abstract Last case is when there is no match to any previously described patterns. For foreign languages, abstract usually contains textual mentions about titles of the movie in foreign languages (Example: . . . also known as . . . or . . . (Italian: . . . , German: . . . )) 1 . . . 2 FILTER r e g e x ( ? a b s t r a c t , "%s " , " i " ) . 3 FILTER r e g e x ( ? year , "%s " , " i " ) 4 . . .

Listing 1.4. SPARQL query - Any match of the abstract 3.2

Con dence Values

To express a basic relevance of the mapping to URI identi ers from DBpedia, we provide a set of con dence values. Title con dence(tc) is computed using Levenshtein distance of titles, Year Con dence (yc) is computed as a simple distance of years and Genre Con dence (gc) uses number of common genres. Those values are available in the nal mapping dataset and can be used together with a method name for ltering of results. The setting of the ltering is left to the end-user of the mapping dataset. 3.3

Results and Statistics

In this section we will brie y describe results of the mapping. We use a snapshot of the dataset downloaded on June 1, 2015. It contains over 21000 movies. At the time of publishing of this paper, the mapping provides URIs for 71:3% movies. The remaining movies were not mapped due to the issues mentioned at the beginning of this section.

Figure 1 depicts distribution of years for movies that were not successfully mapped to any URI. There is a large amount of movies from recent time that were not successfully mapped due to their unavailability in DBpedia. The reason is that the current version of DBpedia was published on September 9, 2014 (based on Wikipedia dumps from April/May 2014)2. Figure 2 demonstrates usage of methods for successful mapping of movies. The method that performs the perfect match of a title and a year is the most frequent (perfect: 86:61%, pattern: 4:38%, partial: 3:66%, any: 5:35%). Figure 3 provides an overview of language distribution in titles.3 This summary presents the availability of mappings to DBpedia for various languages.

We also evaluated our approach using another existing mapping dataset for MovieLens [ 2 ]. We selected this dataset because both original datasets (MovieLens and MovieTweetings) are provided in the same format and the authors of the mapping dataset for MovieLens deal with the same task: mapping of movies 2 http://wiki.dbpedia.org/news/dbpedia-version-2014-released 3 Languages detected in titles using LangID: https://github.com/saffsd/langid.py to DBpedia. Furthermore, the dataset was manually corrected, therefore we can use it as a ground truth. We launched the proposed mapping algorithm and compared to available mappings. Our approach achieved over 98:5% match, where the incorrectly mapped values were either missing URIs or incorrect links that can be ltered using the con dence values. 4

Rule-based User Preferences

In this section we will brie y discuss the bene ts of the augmented dataset from the perspective of the challenge RecSysRules 2015.

Association rules are recognized as one of the most suitable and understandable forms to represent knowledge and relations in data. Rule-based representations of user preferences can thus provide a desirable balance between the quality of the representation and the understandability of the explanation for the human user. The user preferences may be used in di erent scenarios or use cases from elementary user pro le representations to rating predictions and recommendations. In this paper we consider a subset of association rules, called class association rules (CARs) [ 9 ]. Those rules are in the speci c format, where a right-hand side of a rule (consequent) contains only one attribute and this attribute is a classi cation class attribute. 4.1

Illustrative Example

Let consider the domain of movies and information about ratings provided by users from MovieTweetings dataset. The presence of a user rating for a speci c movie can be considered as an interest clue - the implicit information about the positive user preference for the movie. For ratings prediction tasks, the provided ratings can be considered as a level of interest. However, it is beyond the scope of this paper to elaborate on all possible tasks. The rest of this illustrative example is focused on the positive-only feedback and the item recommendation task. Each movie is basically represented by a set of features - associated genres. Table 1 provides example for one user from the MovieTweetings dataset.

The elementary rule-based user preferences can be mined using an association rule mining algorithm (e.g R arules package [ 5 ]). Example of extracted rules, that represents the user preferences for one speci c user (User Id: 455, minCon dence: 0.1, minSupport: 0.1): { fActiong ! fpositiveg (support=1.0, con dence=1.0) { fAction&T hrillerg ! fpositiveg (support=1.0, con dence=1.0) { fCrimeg ! fpositiveg (support=0.67, con dence=1.0)

The drawback of the previously described preferences is that they consider only genres as a key component. It is a limiting factor of this representation since those genres are too general. The total number of unique genres in the dataset is 28. In case we would like to use those rules to nd candidates for other interesting movies to the user, the rules match too many movies as a set of possible candidates (2952, 1130 and 2717 matched movies respectively).

The mappings of movies to the Linked Open Data (See previous section for more details) can help to overcome this issue. Linked Open Data cloud contains relevant information to augment the feature set and increase the granularity. The URI as an identi er of data related to the associated movie can be used to extract additional features; a set of assigned categories for this example4. Table 2 demonstrates excerpt of an augmented feature set for the movies from our example. We use a basic SPARQL query to extract all categories associated with the speci c movies.

Sample of three representative rules mined on the augmented feature set (User Id: 455, minCon dence: 0.1, minSupport: 0.1): { fW arner Bros: f ilmsg ! fpositiveg (support=0.67, con dence=1.0) { fBatman f ilmsg ! fpositiveg (support=0.67, con dence=1.0) { fT he Hunger Games (f ilm series)g ! fpositiveg (support=0.33, con dence=1.0)

Using the Linked Open Data Cloud we get more granular features for representations of movies. In total there are 10 950 unique categories for all movies in the dataset. The availability of a set of more granular categories assigned to each movie and rule-based user preferences considering those categories, the number 4 Categories are identi ed by predicate http://purl.org/dc/terms/subject of movies that match preferences should be decreased. For our illustrative experiment, the number of matching movies are as follows: 859, 9, 4. The rst rule contains more general category, but the remaining two are able to provide adequate number of candidates based on the preferences. 5

Conclusion and Future Work

In this paper we demonstrate the approach to augment the existing movie ratings dataset MovieTweetings from the perspective of the RecSysRules 2015 challenge. We provide the dataset as a mapping of movies to DBpedia for further experiments. It is available for download on the Github5. It can be used for other content-based recommender systems as well. We also discussed the benets of augmented feature sets for the elementary rule-based representations of user preferences. We plan to perform extensive experiments with rule-based user preferences boosted by the augmented feature set. Last but not least, we plan to improve the mapping patterns, o er the mappings to other knowledge bases and provide updates of mapping dataset on a regular basis.

Acknowledgments. This work was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS14/104/OHK3/1T/18. 5 http://github.com/jaroslav-kuchar/MovieTweetingsMappings

Oscar

Celma . Music Recommendation and Discovery: The Long Tail, Long Fail, and Long Play in the Digital Music Space . Springer Publishing Company, Incorporated, 1st edition , 2010 .

Tommaso

Di Noia , Roberto Mirizzi, Vito Claudio Ostuni, Davide Romito, and

Markus

Zanker . Linked open data to support content-based recommender systems . In Proceedings of the 8th International Conference on Semantic Systems, I-SEMANTICS '12 , pages 1 { 8 , New York, NY, USA, 2012 . ACM.

Simon

Dooms , Toon De Pessemier, and

Luc

Martens . Movietweetings: a movie rating dataset collected from twitter . In Workshop on Crowdsourcing and Human Computation for Recommender Systems , CrowdRec at RecSys 2013 , 2013 .

Ken

Goldberg , Theresa Roeder, Dhruv Gupta, and

Chris

Perkins . Eigentaste: A constant time collaborative ltering algorithm . Inf. Retr. , 4 ( 2 ): 133 { 151 , July 2001 .

Michael

Hahsler , Bettina Grun, and Kurt Hornik. arules - a computational environment for mining association rules and frequent item sets . Journal of Statistical Software , 14 ( 15 ): 1 { 25 , 9 2005 .

6. Jonathan L. Herlocker , Joseph A. Konstan , Al Borchers, and John Riedl. An algorithmic framework for performing collaborative ltering . In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '99 , pages 230 { 237 , New York, NY, USA, 1999 . ACM.

Tomas

Kliegr and

Jaroslav

Kuchar . Orwellian eye: Video recommendation with Microsoft Kinect . In Proceedings of the Conference on Prestigious Applications of Intelligent Systems (PAIS'14) collocated with European Conference on Arti cial Intelligence (ECAI'14) , pages 1227 { 1228 . IOS Press, 2014 .

Jaroslav

Kuchar and

Tomas

Kliegr . Bag-of-entities text representation for clientside recommender systems . In First Workshop on Recommender Systems for Television and online Video (RecSysTV) , ACM RecSys , 2014 .

9. Bing

Liu

, Wynne Hsu, and Yiming Ma. Integrating classi cation and association rule mining . In Piatetsky-Shapiro G. Agrawal

, Stolorz P., editor, Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD-98) , pages 80 { 86 , 1998 .

10. Heiko Paulheim and Johannes Fumkranz. Unsupervised generation of data mining features from linked open data . In Proceedings of the 2nd International Conference on Web Intelligence , Mining and Semantics, WIMS ' 12 , New York, NY, USA, 2012 . ACM.

11.

Matthew

Rowe . Semanticsvd++: Incorporating semantic taste evolution for predicting ratings . In Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 01, WI-IAT '14 , pages 213 { 220 , Washington, DC, USA, 2014 . IEEE Computer Society.

12. Cai-Nicolas

Ziegler

, Sean M. McNee , Joseph A.

Konstan , and Georg

Lausen . Improving recommendation lists through topic diversi cation . In Proceedings of the 14th International Conference on World Wide Web, WWW '05 , pages 22 { 32 , New York, NY, USA, 2005 . ACM.