1. INTRODUCTION

April

Mariano Consens

consens@cs.toronto.edu 1

Oktie Hassanzadeh

oktie@cs.toronto.edu 2

RDF Book Mashup

0 0 "Stanley Kubrick" 1 University of Toronto , 10 King's College Rd., Toronto, Ontario, M5S-3G4 , Canada 2 University of Toronto , 10 King's College Rd., Toronto, Ontario, M5S-3G4 , Canada

2009

20 2009

The Linked Movie Database (LinkedMDB) project provides a demonstration of the first open linked dataset connecting several major existing (and highly popular) movie web resources. The database exposed by LinkedMDB contains millions of RDF triples with hundreds of thousands of RDF links to existing web data sources that are part of the growing Linking Open Data cloud, as well as to popular movierelated web pages such as IMDb. LinkedMDB uses a novel way of creating and maintaining large quantities of high quality links by employing state-of-the-art approximate join techniques for finding links, and providing additional RDF metadata about the quality of the links and the techniques used for deriving them.

1. INTRODUCTION

Movies are highly popular on the Web. There are several web resources dedicated to movies and many others containing movie-related information. Creating a single source of information about movies that contains information from existing open web data sources and links to other related data sources is a challenging task and the goal of the Linked Movie Data Base (LinkedMDB) project. LinkedMDB provides a high quality source of RDF data about movies (http://linkedmdb.org) that appeals to a wide audience, enabling further demonstrations of the linked data capabilities. Furthermore, LinkedMDB demonstrates the value of a novel class of tool to facilitate high volume and dense interlinking of RDF datasets.

Figure 1 shows an example of the entities and the interlinking in LinkedMDB. There are several challenges involved in identification of the entities in different data sources that should be interlinked. In some cases, the access to the data in target data source is limited. For example, only the title of the movies with their associated URLs can be obtained from the data source. In such cases, matching only the titles may not be sufficient due to different representations of the same title. Matching the movie title “The Shining” in LinkedMDB would miss the owl:sameAs link to the movie title “The Shining (film)” in DBpedia. SimiMusicBrainz “Béla Bartók”

Geonames "United States (US)” foaf:based_ne"aGrreat Britain (GB)" foaf:based_near owl:sameAs

Lingvoj “English”

IMDb “The Shining” RottenTomatoes

“The Shining”

Wikipedia “The Shining (film)” Freebase “The Shining” owl:sameAs “The_Shining_(film)” owl:sameAs "Stanley_Kubrick" “Béla_Bartók” larly, movie titles “A Thousand and One Nights” and “1001 Nights” would not match. Many non-English movie titles have different spellings in English, e.g., the titles “Adu Puli Attam” and “Sacco and Vanzetti” in LinkedMDB are written as “Aadu Puli Aattam” and “Sacco e Vanzetti” in DBpedia. This calls for approximate (or fuzzy) string matching for finding owl:sameAs links between the two sources.

However, exact or approximate matching of movie titles could results in false matches. The movie “Chicago” (1927 movie) would link to the movie “Chicago” (2002 movie) using exact matching. By approximate matching, movie titles “Spiderman 1” and “Spiderman 2” have similar titles but are not the same. There is a similar case for the movies “Face to Face” and “Face to Fate”, and some adult movies that have names very similar to popular Hollywood movies. Although using proper string similarity function and specific record matching techniques (e.g., using additional structural and co-occurrence information as in [ 3, 7 ]) could significantly reduce the amount of false matches, achieving 100% accuracy is not always possible. Also, higher accuracy may result in fewer correct links, as shown in the accuracy evaluation of Section 3 in this paper. Therefore, it is plausible for the publisher to include metadata about the links and how are they Total number of triples Number of interlinks to LOD cloud Number of links to movie websites Number of entities in LinkedMDB1 are obtained. This approach has several advantages. The users will be able to determine the type of the links and level of accuracy depending on the application. Furthermore, this will facilitate the process of judging the quality of the links by the users and therefore allowing the users to only judge the quality of the links as opposed to User Contributed Interlinking [ 6 ].

In this paper, we present an overview of the movie data triplification effort showcased in LinkedMDB (Section 2). We then overview the interlinking of the data sources (Section 3), and provide a brief overview the approximate string matching techniques used for link discovery in relational data and present an evaluation of the performance of some of the techniques in LinkedMDB (Section 4). The need for linkage metadata and our approach for providing such data in LinkedMDB is discussed (Section 5). We conclude the paper by a brief discussion of a few future directions (Section 6).

Data sources TRIPLIFICATION OF MOVIE DATA

Currently there are several sources of information on the web of documents: • IMDb is the biggest database of movies on the Web that provides a huge variety of up-to-date information about movies. Although IMDb data is available for download and personal use, it is strongly protected by copyright laws. Although we did transform the IMDb data to RDF, we could not get permission for publishing it and therefore our implementation does not include any information from IMDb although we include external links to IMDb pages whenever possible.. • FreeBase is an open, shared database of the world’s knowledge. The “film” category of freebase is one of the biggest and most complete domains in this database with more than 38,000 movies and thousands of other data items related to movies. Freebase has open data and has recently made its data available for download. Therefore, we use freebase as the nucleus of our database, although we do not limit our data source to the information available on freebase. • OMDB is another open data source of movies. The dataset currently contains information about more than 9,000 movies, and its data is available for public use. • DBpedia (Wikipedia) Movies: DBpedia contains a huge amount of information about more than 36,000 movies and thousands of related data items. We provide owl:sameAs links to DBpedia. Apart from extra information available in freebase such as movie characters and many other user-contributed data, we hope to serve additional information about movies and links to other data sources. This can be achieved due to the fact

Entity Film Actor Director Writer Producer Music Contributor Cinematographer Interlink

• RottenTomatoes.com is another movie website with information about movies. RottenTomatoes data is not available for download and public use, however, we include foaf:page links to RottenTomatoes website as well. • Stanford Movie Database is a free database of movie information initially provided as a real test data for students. This database is relatively old, last updated in November 1999. Therefore it includes only a few data items that are not present in FreeBase. We however plan to extend our database with the additional information that can be obtained from this source. 2.2

Entities and Facts

Our database currently contains information about several entities including but not limited to movies, actors, movie characters, directors, producers, editors, writers, music composers and soundtracks, movie ratings and festivals. Table 2 shows the statistics for major entities in LinkedMDB. A list of all entities and facts in LinkedMDB will be made available in the extended version of this paper.

3. INTERLINKING DATA SOURCES

LinkedMDB provides links to several Linking Open Data (LOD) cloud datasets. Among these links are links to DBpedia, YAGO, flickr-wrapper, Geonames and lingvoj. Moreover, several data items are linked to external web pages such as pages on freebase, IMDb, OMDB, RottenTomotoes and Wikipedia.

LinkedMDB is connected to the following LOD data sources: • DBpedia/YAGO: Apart from the movie titles, person names (such as actors, writers and composers) are linked the related resources in DBpedia and YAGO data sources with owl:sameAs links. • Geonames: We interlink the countries of the movies to Geonames dataset by foaf:based near type of links. This is done by matching name of the countries in the two datasets. These links could be extended by matching featured locations of the movies to Geonames items. • FlickrWrapper: The moviess are linked to their photo collections using FlickrWrapper web service. These links are derived from the corresponding DBpedia URIs of the movies. • RDF book mashup: Movies are linked to their related stories.

Apart from links to LOD datasets, we also have setup foaf:page links to external webpages: • Freebase.com pages. • IMDb.com movies and actor profiles. • RottenTomatoes.com movie information and reviews.

Other potential links include links to external webpages from OMDB, boxoffice and movie show-times website and also homepages of the movies.

APPROXIMATE STRING MATCHING FOR LINK DISCOVERY

As mentioned earlier, link discovery often requires approximate matching of strings. In LinkedMDB, several links to other data sources are found using string matching. In this Section, we first briefly overview a set of string similarity functions and state-of-the art approximate string join techniques that are used (or can be used) in link discovery in LinkedMDB and similar link discovery settings. We then present the results of the evaluation of the quality of the links found using different similarity functions. 4.1

String Similarity Measures

There exists a wide variety of similarity functions for comparing similarity of the strings. The similarity measures we discuss here share one or both of the following properties: • High scalability: There are various techniques proposed in the literature as described in Section 4.2 for enhancing the performance of the similarity join operation using q-grams along with these measures. • High accuracy: Previous work has proved that in most scenarios these measures perform better or equally well in terms of accuracy comparing with other string similarity measures. Specifically, these measures have shown good accuracy in name-matching tasks [ 4 ] or in approximate selection [ 5 ].

Let r be the set of q-grams (i.e., sequences of length q of consecutive characters of a string) in string record r. For example, for r = ‘dblab0, r = {‘d0, ‘db0, ‘b0, ‘l0, ‘la0, ‘ab0, ‘b0} for tokenization using 2-grams . In certain cases, a weight may be associated with each token. 4.1.1

Edit distance between two string records r1 and r2 is defined as the transformation cost of r1 to r2, tc(r1, r2), which is equal to the minimum cost of edit operations applied to r1 to transform it to r2. Edit operations include character copy, insert, delete and substitute. The edit similarity is defined as: tc(r1, r2) simedit(r1, r2) = 1 − max{|r1|, |r2|} There is a cost associated with each edit operation. There are several cost models proposed for edit operations for this measure. In the most commonly used measure called Levenshtein edit distance, which we will refer to as edit distance in this paper, uses unit cost for all operations except copy which has cost zero. 4.1.2

Jaccard and WeightedJaccard

Jaccard similarity is the fraction of tokens in r1 and r2 that are present in both. Weighted Jaccard similarity is the weighted version of Jaccard similarity, i.e., simW Jaccard(r1, r2) = Pt∈r1∪r2 wR(t)

Pt∈r1∩r2 wR(t) where w(t, R) is a weight function that reflects the commonality of the token t in the relation R. We choose RSJ (Robertson-Sparck Jones) weight for the tokens which was shown to be more effective than the commonly-used Inverse Document Frequency (IDF) weights [ 5 ]: wR(t) = log

N − nt + 0.5 nt + 0.5 where N is the number of tuples in the base relation R and nt is the number of tuples in R containing the token t. 4.1.3

Measures from IR

A well-studied problem in information retrieval is that given a query and a collection of documents, return the most relevant documents to the query. In the measures in this part, records are treated as documents and q-grams are seen as words (tokens) of the documents. Therefore same techniques for finding relevant documents to a query can be used to return similar records to a query string. In the rest of this section, we present three measures that previous work has shown their higher performance for approximate selection problem [ 5 ].

Cosine w/tf-idf The tf-idf cosine similarity is a well established measure in the IR community which leverages the vector space model. This measure determines the closeness of the input strings r1 and r2 by first transforming the strings into unit vectors and then measuring the angle between their corresponding vectors. The cosine similarity with tf-idf weights is given by: (1) (2) (3)

X where wr1 (t) and wr2 (t) are the normalized tf-idf weights for each common token in r1 and r2 respectively. The normalized tf-idf weight of token t in a given string record r is defined as follows: wr(t) = qP wr0(t) , wr0(t) = tfr(t) · idf (t) where tfr(t) is the term frequency of token t within string r and idf (t) is the inverse document frequency with respect to the entire relation R. 4.1.4

BM25

The BM25 similarity score for a query r1 and a string record r2 is defined as follows:

X where tfr(t) is the frequency of the token t in string record r, |r| is the number of tokens in r, avgrl is the average number of tokens per record, N is the number of records in the relation R, nt is the number of record containing the token t and k1, k3 and b are set of independent parameters. We set these parameters as described in [ 5 ] where k ∈ [ 1, 2 ], k3 = 8 and b ∈ [0.6, 0.75]. 4.1.5

Hidden Markov Model

The approximate string matching could be modeled by a discrete Hidden Markov process which has shown better performance than Cosine w/tf-idf in the IR literature, and high accuracy and running time for approximate selection [ 5 ]. The HMM similarity function accepts two string records r1 and r2 and returns the probability of generating r1 given r2 is a similar record: simHMM (r1, r2) =

Y (a0P (t|GE) + a1P (t|r2)) t∈r1 (6) where a0 and a1 = 1 − a0 are the transition states probabilities of the Markov model and P (t|GE) and P (t|r2) is given by:

P (t|GE) =

P (t|r2) = number of times t appears in r2

|r2| Pr∈R number of times t appears in r

P r∈R |r| 4.2

Approximate String Join Techniques

An advantage of the similarity predicates described above is that they can be implemented declaratively using standard SQL queries over any relational DBMS. This is in particular useful considering the fact that many existing linked data sources are published using linked data publication tools that operate over relational data sources, such as D2R server, Triplify or OpenLink Vituoso. Some of the similarity predicates can be made scalable to huge web data sources using some of the specialized, high performance, approximate join algorithms. Specifically, Enumeration (Enum) and Weighted Enumeration (WtEnum) signature generation algorithm can be used to significantly improve the running time of the join with Jaccard and weighted Jaccard predicates [ 1 ]. In addition, novel indexing and optimization techniques can be utilized to make the join even faster [ 2 ]. 4.3

Evaluation

In this Section, we provide a summary of the evaluation of the accuracy of the linkage in only one of the linkage scenarios. More detailed comparison of the techniques in other scenarios (including other type of links such as rdfs:seeAlso links) will be made available in the extended version of this paper. • For each string in the query source, we find all those strings in the base source which have similarity score above a threshold θ by performing an approximate selection. • If there is only one string matched, then we output the query and base strings as certain matches. If there is more than one string or no string with similarity score above θ with the query string, then we do not output a match for the query string. 4.3.2

Accuracy Results

In our experiments we used q = 2 for generating q-grams as it showed better performance comparing with other values of q. Here, we present brief accuracy results for matching movie titles from DBpedia to movie titles in our database. We matched 38,064 movie titles in our database with 25,424 movie titles from DBpedia using the similarity predicates described above. We need to inspect different thresholds to see find the optimal threshold. Table 4 shows the number of matches obtained with different values of threshold as well as the accuracy obtained. Note that accuracy reported is the precision of the links, i.e., percentage of the output links that are correct. The recall is hard to find since the correct number of matches is not known. However, the number of links returned reflects the value of recall. The ground truth is obtained by manually finding all the rules for matching in this scenario. For example, all underscores are replaced with whitespaces, and the substring “(film)” is removed from the DBpedia movie titles. These rules themselves are discovered by running the similarity join and manually inspecting thousands of the links returned.

The results in Table 4 show that the weighted Jaccard similarity outperforms other predicates in this scenario in terms of the number of correct links found. Based on these results we chose threshold θ = 0.7 with weighted Jaccard similarity for the existing links in our database.

Jaccard Edit Similarity

BM25

5. LINKAGE METADATA

As shown in the accuracy evaluation in previous Section, although using proper string matching techniques and string similarity function could significantly reduce the amount of false matches, achieving 100% accuracy is not always possible or may result in fewer correct links. Therefore, it is plausible for the publisher to include metadata about the links and how are they are obtained. In LinkedMDB, we provide to entities, namely interlink and linkage run for this purpose. Figures 2 and 3 show examples of these entities. This approach has several exciting advantages. The users will be able to determine the type of the links and level of accuracy depending on the application. Furthermore, this will facilitate the process of judging the quality of the links by the users and therefore allowing the users to provide feedback on the quality of the links. 6.

CONCLUSION

LinkedMDB provides a high quality source of RDF data about movies that appeals to a wide audience, enabling further demonstrations of the linked data capabilities. Furthermore, LinkedMDB demonstrates the value of a novel way of link discovery and publishing linkage metadata to facilitate high volume and dense interlinking of RDF datasets. We plan to extend LinkedMDB in several aspects. Our plan is to provide an easy-to-use interface to allow the users to provide feedback on the quality of the links. In this way, users will only need to report the quality of the links as opposed to manually providing the links, as proposed in User Contributed Interlinking framework of [ 6 ]. Apart from extending the number of external links, we plan to provide internal links (of type rdfs:seeAlso or a similar type) between related entities, such as movies with similar titles. Such links can be found using similar approximate matching techniques, and will further facilitate automatic mining of the data sources.

[1]

Arasu ,

Ganti , and

Kaushik . Efficient exact set-similarity joins . In VLDB '06 - Proceedings of the 32nd international conference on Very large data bases , pages 918 - 929 , 2006 .

[2]

R. J.

Bayardo ,

Ma , and

Srikant . Scaling up all pairs similarity search . In WWW'07 - Proceedings of the 16th International World Wide Web Conference , pages 131 - 140 , 2007 .

[3]

Bhattacharya and

Getoor . Collective entity resolution in relational data . IEEE Data Eng. Bull , 29 ( 2 ): 4 - 12 , 2006 .

[4]

W. W.

Cohen ,

Ravikumar , and

S. E.

Fienberg . A comparison of string distance metrics for name-matching tasks . In IIWeb'03 , pages 73 - 78 , 2003 .

[5]

Hassanzadeh . Benchmarking declarative approximate selection predicates . Master's thesis , University of Toronto, Feb 2007 .

[6]

Hausenblas and

Halb . Interlinking of resources with semantics . In ESWC'08(Posters).

[7]

Raimond ,

Sutton , and

Sandler . Automatic interlinking of music datasets on the semantic web . In LDOW'08.