-

Learning Conformation Rules for Linked Data Integration

Axel-Cyrille Ngonga Ngomo

ngonga@informatik.uni-leipzig.de 0 0 Department of Computer Science University of Leipzig Johannisgasse 26 , 04103 Leipzig

Over the last years, manifold applications that consume, link and integrate Linked Data have been developed. Yet, the speci cation of integration processes for Linked Data is rendered increasingly tedious by several factors such as the great number of knowledge bases in the Linked Data Cloud as well as schema mismatches and heterogeneous conventions for property values across knowledge bases. Especially the speci cation of rules for transforming property values has been carried out mostly manually so far. In this paper, we present CaRLA, an algorithm that allows learning transformation rules for pairs of property values expressed as strings. We present both a batch and an active learning version of CaRLA. The batch version of CaRLA uses a three-step learning approach to retrieve probable transformation rules. The active learning version extends the batch version by requesting highly informative property value pairs from the user so as to improve the learning speed of the system. We evaluate both versions of our approach on four experiments with respect to runtime and accuracy. Our results show that we can improve the precision of the data integration process by up to 12% by discovering transformation rules with human accuracy even when provided with small training datasets. In addition, we can even discover rules that were missed by human experts.

The Linked Open Data (LOD) Cloud consists of more than 30 billion triples1. Making use of this large amount of domain knowledge within large-scale semantic applications is currently gaining considerable momentum. For example, the DeepQA framework [ 4 ] combines knowledge from DBpedia2, Freebase3 and several other knowledge bases to determine the answer to questions with a speed superior to that of human champions. Complex applications that rely on several sources of knowledge usually integrate them into a uni ed view by the means of 1 http://www4.wiwiss.fu-berlin.de/lodcloud/state/ 2 http://dbpedia.org

3 http://www.freebase.com

the Extract-Transform-Load (ETL) paradigm [ 7 ]. Yet, over the last few years, the automatic provision of such uni ed views on Linked datasets has been rendered increasingly tedious. The di culties behind the integration of Linked Data are not only caused by the mere growth of the datasets in the Linked Data Web but also by large disparity across these datasets.

Linked Data Integration is commonly impeded by two categories of mismatches: ontology mismatches and naming convention mismatches. The second category of common mismatches (which is the focus of this work) mostly a ects the transformation step of ETL and lies in the di erent conventions used for equivalent property values. For example, the labels of lms in DBpedia di er from the labels of lms in LinkedMDB4 in three ways: First, they contain a language tag. Second, the extension \( lm)" is added to the label of movies if another entity with the same label exists. Third, if another lm with the same label exists, the production year of the lm is added. Consequently, the lm Liberty from 1929 has the label \Liberty (1929 lm)@en" in DBpedia, while the same lm bears the label \Liberty" in LinkedMDB. A similar discrepancy in naming persons holds for lm directors and actors. Finding a conform representation of the labels of movies that maps the LinkedMDB representation would require knowing the rules replace(``@en'', ) and replace(``(*film)'', ) where stands for the empty string.

In this paper, we address the problem of discovering transformation rules by presenting CaRLA, the Canonical Representation Learning Algorithm. Our approach learns canonical (also called conform) representation of data type property values by implementing a simple, time-e cient and accurate learning approach. We present two versions of CaRLA: a batch learning and an active learning version. The batch learning approach relies on a training dataset to derive rules that can be used to generate conform representations of property values. The active version of CaRLA (aCarLa) extends CaRLA by computing unsure rules and retrieving highly informative candidates for annotation that allow the validation or negation of these candidates. One of the main advantages of CaRLA is that it can be con gured to learn transformations at character, n-gram or even word level. By these means, it can be used to improve integration and link discovery processes based on string similarity/distance measures ranging from character-based (edit distance) and n-gram-based (q-grams) to word-based (Jaccard similarity) approaches.

The rest of this paper is structured as follows: In Section 2, we give an overview of the notation used in this paper. Thereafter, we present the two di erent versions of the CaRLA algorithm: the batch version in Section 3 and the active version in Section 4. Subsequently, we evaluate our approach with respect to both runtime and accuracy in four experiments (Section 5). After a brief overview of the related work (Section 6), we present some future work and conclude in Section 7.

4 http://linkedmdb.org/ Preliminaries

In the following, we de ne terms and notation necessary to formalize the approach implemented by CaRLA. Let s 2 be a string from an alphabet . We de ne a tokenization function as follows:

De nition 1 (Tokenization Function). Given an alphabet A of tokens, a

tokenization function token : ! 2A maps any string s 2 to a subset of the token alphabet A.

Note that string similarity and distances measures rely on a large number of di erent tokenization approaches. For example, the Levenshtein similarity [ 8 ] relies on a tokenization at character level, while the Jaccard similarity [ 6 ] relies on a tokenization at word level.

De nition 2 (Transformation Rule). A transformation rule is a function

r : A ! A that maps a token from the alphabet A to another token of A.

In the following we will denote transform rules by using an arrow notation. For example, the mapping of the token \Alan" to \A." will be denoted by < \Alan" ! \A.">. For any rule r =< x ! y >, we call x the premise and y the consequence of r. We call a transformation rule trivial when it is of the form < x ! x > with x 2 A. We call two transformation rules r and r0 inverse to each other when r =< x ! y > and r0 =< y ! x >. Throughout this work, we will assume that the characters that make up the tokens of A belong to [ f g, where stands for the empty character. Note that we will consequently denote deletions by rules of the form < x ! > where x 2 A.

De nition 3 (Weighted Transformation Rule). Let be the set of all

rules. Given a weight function w : ! R, a weighted transformation rule is the pair (r; w(r)), where r 2 is a transformation rule.

De nition 4 (Transformation Function). Given a set R of (weighted) trans

formation rules and a string s, we call the function 'R : ! [f g a transformation function when it maps s to a string 'R(s) by applying all rules ri 2 R to every token of token(s) in an arbitrary order.

For example, the set R = f<\Alan"!\A.">g of transformation rules would lead to 'R(\James Alan Het eld") =\James A. Het eld". 3

Batch Learning Approach

The goal of CaRLA is two-fold: First, it aims to compute rules that allow the derivation of conform representations of property values. As entities can have several values for the same property, CaRLA also aims to detect a condition under which two property values should be merged during the integration process. In the following, we will assume that two source knowledge bases are to be integrated to one. Note that our approach can be used for any number of source knowledge bases. 3.1

Overview

Formally, CaRLA addresses the problem of nding the required transformation rules by computing an equivalence relation E between pairs of property values (p1; p2) that is such that E (p1; p2) holds when p1 and p2 should be mapped to the same canonical representation p. CaRLA computes E by generating two sets of weighted transformation function rules R1 and R2 such that for a given similarity function , E (p1; p2) ! ('R1 (p1); 'R2 (p2)) , where is a similarity threshold. The canonical representation p is then set to 'R1 (p1). The similarity condition ('R1 (pR1 ); 'R2 (p2)) is used to distinguish between the pairs of properties values that should be merged.

To detect R1 and R2, CaRLA assumes two training datasets P and N , of which N can be empty. The set P of positive training examples is composed of pairs of property value pairs (p1; p2) such that E (p1; p2) holds. The set N of negative training examples consists of pairs (p1; p2) such that E (p1; p2) does not hold. In addition, CaRLA assumes being given a similarity function and a corresponding tokenization function token. Given this input, CaRLA implements a three-step approach. It begins by computing the two sets R1 and R2 of plausible transformation rules based on the positive examples at hand (Step 1). Then it merges inverse rules across R1 and R2 and discards rules with a low weight during the rule merging and ltering step. From the resulting set of rules, CaRLA derives the similarity condition E (p1; p2) ! ('R1 (p1); 'R2 (p2)) . It then applies these rules to the negative examples in N and tests whether the similarity condition also holds for the negative examples. If this is the case, then it discards rules until it reaches a local minimum of its error function. The retrieved set of rules and the novel value of constitute the output of CaRLA and can be used to generate the canonical representation of the properties in the source knowledge bases.

In the following, we explain each of the three steps in more detail. Throughout the explanation we use the toy example shown in Table 1. In addition, we will assume a word-level tokenization function and the Jaccard similarity. The goal of the rule generation set is to compute two sets of rules R1 resp. R2 that will underlie the transformation 'R1 resp. 'R2 . We begin by tokenizing all positive property values pi and pj such that (pi; pj) 2 P . We call T1 the set of all tokens pi such that (pi; pj) 2 P , while T2 stands for the set of all pj. We begin the computation of R1 by extending the set of tokens of each pj 2 T2 by adding to it. Thereafter, we compute the following rule score function score: score(< x ! y >) = jf(pi; pj) 2 P : x 2 token(pi) ^ y 2 token(pj)gj: (1) score computes the number of co-occurrences of the tokens x and y across P . All tokens x 2 T1 always have a maximal co-occurrence with as it occurs in all tokens of T2. To ensure that we do not only compute deletions, we decrease the score of rules < x ! > by a factor 2 [0; 1]. Moreover, in case of a tie, we assume the rule < x ! y > to be more natural than < x ! y0 > if (x; y) > (x; y0). Given that is bound between 0 and 1, it is su cient to add a fraction of (x; y) to each rule < x ! y > to ensure that the better rule is chosen. Our nal score function is thus given by scorefinal(< x ! y >) = (score(< x ! y >) + (x; y)=2: if y 6= ; score(< x ! y >) else. (2) Finally, for each token x 2 T1, we add the rule r =< x ! y > to R1 i x 6= y (i.e., r is not trivial) and y = arg max scorefinal(< x ! y0 >). To compute R2 we y02T2 simply swap T1 and T2, invert P (i.e., compute the set f(pj; pi) : (pi; pj) 2 P g) and run through the procedure described above.

For the set P in our example, we get the following sets of rules: R1 = f(<\van"!\Van">, 2.08), (<\T."! >, 2)g and R2 = f(<\Van"!\van">, 2.08), (<\(actor)"! >, 2)g. 3.3

Rule Merging and Filtering

The computation of R1 and R2 can lead to a large number of inverse or improbable rules. In our example, R1 contains the rule <\van"!\Van"> while R2 contains <\Van"!\van">. Applying these rules to the data would consequently not improve the convergence of their representations. To ensure that the transformation rules lead to similar canonical forms, the rule merging step rst discards all rules < x ! y >2 R2 that are such that < y ! x >2 R1 (i.e., rules in R2 that are inverse to rules in R1). Then, low-weight rules are discarded. The idea here is that if there is not enough evidence for a rule, it might just be a random event. The initial similarity threshold for the similarity condition is nally set to = In our example, CaRLA would discard <\van"!\Van"> from R2. When assuming a threshold of 10% of P's size (i.e., 0.4), no rule would be ltered out. The output of this step would consequently be R1 = f(<\van"!\Van">, 2.08), (<\T."! >, 2)g and R2 = f(<\(actor)"! >, 2)g. 3.4

Rule Falsi cation

The aim to the rule falsi cation step is to detect a set of transformations that lead to a minimal number of elements of N having a similarity superior to via . To achieve this goal, we follow a greedy approach that aims to minimize the magnitude of the set E = f(p1; p2) 2 N : ('R1 (p1); 'R2 (p2)) (4) Our approach simply tries to discard all rules that apply to elements of E by ascending score. If E is then empty, the approach terminates. If E does not get smaller, then the change is rolled back and then the next rule is tried. Else, the rule is discarded from the set of nal rules. Note that discarding a rule can alter the value of and thus E. Once the set E has been computed, CaRLA concludes its computation by generating a nal value of the threshold .

In our example, two rules apply to the element of N . After discarding the rule <\T."! >, the set E becomes empty, leading to the termination of the rule falsi cation step. The nal set of rules are thus R1 = f<\van"!\Van">g and R2 = f<\(actor)"! >g. The value of is computed to be 0.75. Table 2 shows the canonical property values for our toy example. Note that this threshold allows to discard the elements of N as being equivalent property values. = ('R1 (p1); 'R2 (p2))g:

It is noteworthy that by learning transformation rules, we also found an initial threshold for determining the similarity of property values using as similarity function. In combination with the canonical forms computed by CaRLA, the con guration ( ; ) can be used as an initial con guration for Link Discovery frameworks such as LIMES. For example, the smallest Jaccard similarity for the pair of property values for our example lies by 1=3, leading to a precision of 0.71 for a recall of 1 (F-measure: 0.83). Yet, after the computation of the transformation rules, we reach an F-measure of 1 with a threshold of 1. Consequently, the pair ( , ) can be used for determining an initial classi er for approaches such as the RAVEN [ 11 ] algorithm implemented in LIMES [ 10 ].

Extension to Active Learning

One of the drawbacks of batch learning approaches is that they often require a large number of examples to generate good models. As our evaluation shows (see Section 5), this drawback also holds for the batch version of CaRLA, as it can easily detect very common rules but sometimes fails to detect rules that apply to less pairs of property values. In the following, we present how this problem can be addressed by extending CaRLA to aCARLA using active learning [ 16 ]. Algorithm 1 Overview of aCaRLA Require: Positive examples P0 Require: Negative examples N0 Require: Similarity function Require: Damping factor Require: Score threshold smin Require: Tokenization function token Require: Maximal number of annotation requests qtotal Require: Number of questions/iteration q

Rule sets R1 ;, R2 ; qcurrent := 0 //Current number of questions Ex := 0 //Set of examples to annotate t := 0 //Iteration counter r := null //unsure rule B := ; //set of banned rules while qcurrent qtotal do (R1, R2, ) = runCarla(Pt, Nt, , , Smin, token) // Run batch learning r =getMostUnsureRule(R1 [ R2, B, smin) if r 6= null then

Ex = computeExamples(r, q) else

Ex = getMostDi erent(q) end if (P , N ) = requestAnnotations(Ex) Pt+1 Pt [ P Nt+1 Nt [ N B updateBannedRules(B, r) t t + 1 qcurrent qcurrent + jExj end while (R1, R2, ) := runCarla(Pt, Nt, , , Smin, token) return (R1; R2; )

An overview of aCaRLA is given in Algorithm 1. The basic idea here is to begin with small training sets P0 and N0. In each iteration, all the available training data is used by the batch version of CaRLA to update the set of rules. The algorithm then tries to refute or validate rules with a score below the score threshold smin (i.e., unsure rules). For this purpose it picks the most unsure rule r that has not been shown to be erroneous in a previous iteration (i.e., that is not an element of the set of banned rules B). It then fetches a set Ex of property values that map the left side (i.e., the premise) of r. Should there be no unsure rule, then Ex is set to the q property values that are most dissimilar to the already known property values. Annotations consisting of the corresponding values for the elements of Ex in the other source knowledge bases are requested by the user and written in the set P . Property values with no corresponding values are written in N . Finally the sets of positive and negative examples are updated and the triple (R1, R2, ) is learned anew until a stopping condition such as a maximal number of questions is reached. As our evaluation shows, this simple extension of the CaRLA algorithm allows it to detect e ciently the pairs of annotations that might lead to a larger set of high-quality rules. 5 5.1

Evaluation Experimental Setup

In the experiments reported in this section, we evaluated CaRLA by two means: First we aimed to measure how well CaRLA could compute transformations created by experts. To achieve this goal, we retrieved transformation rules from four link speci cations de ned manually by experts within the LATC project5. An overview of these speci cations is given in Table 3. Each link speci cation aimed to compute owl:sameAs links between entities across two knowledge bases by rst transforming their property values and by then computing the similarity of the entities based on the similarity of their property values. For example, the computation of links between lms in DBpedia and LinkedMDB was carried out by rst applying the set of R1 = f< (f ilm) ! >g to the labels of lms in DBpedia and R2 = f< (director) ! >g to the labels of their directors. We ran both CaRLA and aCaRLA on the property values of the interlinked entities and measured how fast CaRLA was able to reconstruct the set of rules that were used during the Link Discovery process.

In addition, we quanti ed the quality of the rules learned by CaRLA. In each experiments, we computed the boost in the precision of the mapping of property pairs with and without the rules derived by CaRLA. The initial precision was

5 http://latc-project.eu

computed as jjMPjj , where M = f(pi; pj ) : (pi; pj ) min(p1;p2)2P (p1; p2)g.

The precision after applying CaRLA's results was computed as jP j where jM0j M 0 = f(pi; pj ) : ('R1 (pi); 'R2 (pj )) min ('R1 (p1); 'R2 (p2))g. Note (p1;p2)2P that in both cases, the recall was 1 given that 8(pi; pj ) 2 P : (pi; pj ) min (p1; p2). In all experiments, we used the Jaccard similarity metric and (p1;p2)2P a word tokenizer with = 0:8. All runs were carried on a notebook running Windows 7 Enterprise with 3GB RAM and an Intel Dual Core 2.2GHz processor. Each of the algorithms was ran ve times. We report the rules that were discovered by the algorithms and the number of experiments within which they were found.

100% 80% iin 60% sco reP 40% 20% 100% 80% lseodh 60% rhT 40% 20% Baseline CaRLA

Baseline CaRLA 0% Actors Directors Directors_clean Movies Movies_clean Producers (a) Precision 0% Actors Directors Directors_clean Movies Movies_clean Producers (b) Thresholds Table 4 shows the union of the rules learned by the batch version of CaRLA in all ve runs. Note that the computation of a rule set lasted under 0.5s even for the largest dataset, Movies. The columns Pn give the probability of nding a rule for a training set of size n in our experiments. R2 is not reported because it was empty in all setups. Our results show that in all cases, CaRLA converges quickly and learns rules that are equivalent to those utilized by the LATC experts with a sample set of 5 pairs. Note that for each rule of the form <\@en" ! y > with y 6= that we learned, the experts used the rule < y ! > while the linking platform automatically removed the language tag. We experimented with the same datasets without language tags and computed exactly the same rules as those devised by the experts. In some experiments (such as Directors), CaRLA was even able detect rules that where not included in the set of rules generated by human experts. For example, the rule <\( lmmaker)" ! \(director)"> is not very frequent and was thus overlooked by the experts. In Table 4, we marked such rules with an asterisk. The director and the movies datasets contained a large number of typographic errors of di erent sort (incl. misplaced hyphens, character repetitions such as in the token \Neilll", etc.) which led to poor precision scores in our experiments. We cleaned the rst 250 entries of these datasets from these errors and obtained the results in the rows labels Directors clean and Movies clean. The results of CaRLA on these datasets are also shown in Table 4. We also measured the improvement in precision that resulted from applying CaRLA to the datasets at hand (see Figure 1). For that the precision remained constant across the di erent dataset sizes. In the best case (cleaned Directors dataset), we are able to improve the precision of the property mapping by 12.16%. Note that we can improve the precision of the mapping of property values even on the noisy datasets.

Experiment

Actors Directors

Directors Directors clean

Movies Movies

Movies Movies clean Movies clean Movies clean

Producers

Interestingly, when used on the Movies dataset with a training dataset size of 100, our framework learned low-con dence rules such as <\(1999" ! >, which were yet discarded due to a too low score. These are the cases where aCaRLA displayed its superiority. Thanks to its ability to ask for annotation when faced with unsure rules, aCaRLA is able to validate or negate unsure rules. As the results on the Movies example show, aCaRLA is able to detect several supplementary rules that were overlooked by human experts. Especially, it clearly shows that deleting the year of creation of a movie can improve the conformation process. aCaRLA is also able to generate a signi cantly larger number of candidates rules for the user's convenience. 6

Related Work

Linked Data Integration is an important topic for all applications that rely on a large number of knowledge bases and necessitate a uni ed view on this data, e.g., Question Answering frameworks [ 4 ] and semantic mashups [ 13 ]. In recent work, several challenges and requirements to Linked Data consumption and integration have been pointed out [ 9 ]. Recently, several approaches and frameworks have been developed with the aim of addressing many of these challenges. For example, the R2R framework [ 2 ] allows the speci cation and publication of mappings that allow mapping resources and literals across knowledge bases. The Linked Data Integration Framework LDIF [ 15 ], whose goal is to support the integration of RDF data, builds upon R2R mappings and technologies such as SILK [ 5 ] and LDSpider6. The concept behind the framework is to enable users to create periodic integration jobs via simple XML con gurations. KnoFuss [ 12 ] addresses data integration from the point of view of link discovery by monitoring the interaction between instance and dataset matching (which is similar to ontology matching [ 3 ]). Also worth mentioning id Semantic Web Pipes7 [ 13 ], which follows the idea of Yahoo Pipes8 to enable the integration of data in formats such as RDF and XML. In all of these systems, the con guration of the data transformations has to be carried out manually.

Some approaches to transformation rule learning approaches have previously been proposed in the record linkage area. For example, [ 14 ] presents an interactive approach to data cleaning while [ 1 ] developed an approach to learn string transformation from examples. Yet, none of these approaches uses active learning to improve the number of rules it detects. To the best of our knowledge, CaRLA is the rst approach tailored towards Linked Data that allows the active learning of data conformation rules for property values expressed as strings. 7

Conclusion and Future Work

In this work, we presented two algorithms for learning data transformations. We present a batch learning approach that allows to detect rules by performing a cooccurrence analysis. This version is particularly useful when the knowledge bases to be integrated are already linked. We also present an active learning approach 6 http://code.google.com/p/ldspider/ 7 http://pipes.deri.org/ 8 http://pipes.yahoo.com/pipes/ that allows to discover a large number of rules e ciently. We evaluated our approach with respect to the quality of the rules it discovers and to the e ect of the rules on matching property values. We showed that the data transformation achieved by our approach allows to increase the precision of property mapping by up to 12% when the recall is set to 1. In future work, we aim to extend our approach to learning transformation rules with several tokenizers concurrently.

Arvind

Arasu , Surajit Chaudhuri, and

Raghav

Kaushik . Learning string transformations from examples . Proc. VLDB Endow ., 2 ( 1 ): 514 { 525 , August 2009 .

Christian

Bizer and

Andreas

Schultz . The R2R Framework: Publishing and Discovering Mappings on the Web . In Proceedings of COLD , 2010 .

3. Jero ^me Euzenat and Pavel Shvaiko . Ontology matching. Springer-Verlag, Heidelberg (DE), 2007 .

4. David

Ferrucci , Eric W. Brown, Jennifer Chu-Carroll, James

Fan , David Gondek, Aditya

Kalyanpur , Adam Lally, J. William

Murdock , Eric Nyberg, John M. Prager, Nico Schlaefer, and Christopher

Welty . Building watson: An overview of the deepqa project . AI Magazine , 31 ( 3 ): 59 { 79 , 2010 .

Robert

Isele and

Christian

Bizer . Learning Linkage Rules using Genetic Programming . In Sixth International Ontology Matching Workshop , 2011 .

Paul

Jaccard . Etude comparative de la distribution orale dans une portion des Alpes et des Jura . Bulletin del la Societe Vaudoise des Sciences Naturelles , 37 : 547 { 579 , 1901 .

Ralph

Kimball and

Joe

Caserta . The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning , Conforming, and

Delivering

Data . Wiley, 2004 .

8. Vladimir

I. Levenshtein.

Binary codes capable of correcting deletions, insertions, and reversals . Technical Report 8 , 1966 .

Ian

Millard , Hugh Glaser, Manuel Salvadores, and

Nigel

Shadbolt . Consuming multiple linked data sources: Challenges and experiences . In First International Workshop on Consuming Linked Data , 2010 .

10. Axel-Cyrille Ngonga Ngomo. A Time-E cient Hybrid Approach to Link Discovery . In Sixth International Ontology Matching Workshop , 2011 .

11. Axel-Cyrille Ngonga

Ngomo

, Jens Lehmann, Soren Auer, and Konrad Ho ner. RAVEN { Active Learning of Link Speci cations . In Proceedings of OM@ISWC , 2011 .

12. Andriy

Nikolov

, Victoria Uren, Enrico Motta, and

Anne

Roeck . Overcoming schema heterogeneity between linked semantic repositories to improve coreference resolution . In Proceedings of the 4th Asian Conference on The Semantic Web , pages 332 { 346 , 2009 .

13. Danh Le

Phuoc

, Axel Polleres, Manfred Hauswirth, Giovanni Tummarello, and

Christian

Morbidoni . Rapid prototyping of semantic mash-ups through semantic web pipes . In WWW , pages 581 { 590 , 2009 .

14.

Vijayshankar

Raman and Joseph M. Hellerstein . Potter's wheel: An interactive data cleaning system . In VLDB , pages 381 { 390 , 2001 .

15. Andreas

Schultz

, Andrea Matteini, Robert Isele,

Christian

Bizer , and

Christian

Becker . Ldif - linked data integration framework . In COLD , 2011 .

16.

Burr

Settles . Active learning literature survey . Technical Report 1648 , University of Wisconsin-Madison, 2009 .