-

Time-Efficient Execution of Bounded Jaro-Winkler Distances

Kevin Dreßler

Axel-Cyrille Ngonga Ngomo

0 0 University of Leipzig AKSW Research Group Augustusplatz 10 , 04103 Leipzig Germany

Over the last years, time-efficient approaches for the discovery of links between knowledge bases have been regarded as a key requirement towards implementing the idea of a Data Web. A considerable portion of the information contained available as RDF on the Web pertains to persons. Thus, efficient and effective measures for comparing names are central to facilitate the integration of information about persons on the Web of Data. The Jaro-Winkler measure has been developed especially for the purpose of comparing person names. Hence, we present a novel approach for the efficient comparison of sets of strings using this measure. We evaluate our approach on several datasets derived from DBpedia 3.9 and containing up to 105 strings and show that it scales linearly with the size of the data for large thresholds. We also evaluate our approach against SILK and show that we outperform it even on small datasets.

The Linked Open Data Cloud (LOD Cloud) has developed to a compendium of more than 2000 datasets over the last few years.1 Currently, data sets pertaining to more than 14 million persons have already been made available on the Linked Data Web.2 While this number is impressive on its own, it is well known that the population of the planet has surpassed 7 billion people. Hence, the Web of Data contains information on less that 1% of the overall population of the planet (counting both the living and the dead). The output of open-government movements,3 scientific conferences,4 health data5 and similar endeavours yet promises to make massive amounts of data pertaining to persons available in the near future. Dealing with this upcoming increase of the number of person-related resources requires providing means to integrate these datasets with the aim to facilitate statistical analysis, data mining, personlization, etc. However, while the 1 See http://stats.lod2.eu for an overview of the current state of the Cloud. Last access: July 11th, 2014. 2 Data collected from http://stats.lod2.eu. Last access: July 11th, 2014. 3 See for example http://data.gov.uk/. 4 See for example http://data.semanticweb.org/ 5 http://aksw.org/Projects/GHO number of datasets on the Linked Data Web grows drastically, the number of links between datasets still stagnates.6 Addressing this lack of links requires solving two main problems: the quadratic time complexity of link discovery (efficiency) and the automatic support of the detection of link specifications (effectiveness). In this paper, we address the efficiency of the execution of bounded Jaro-Winkler measures,7 which are known to be effective when comparing person names [ 10 ]. To this end, we derive equations that allow discarding a large number of computations while executing bounded Jaro-Winkler comparisons with high thresholds.

The contributions of this paper are as follows: 1. We derive length- and range-based filters that allow reducing the number of strings t that are compared with a string s . 2. We present a character-based filter that allows detecting whether two strings s and t share enough resemblance to be similar according to the Jaro-Winkler measure. 3. We evaluate our approach w.r.t. to its runtime and its scalability with several threshold settings and dataset sizes.

The rest of this paper is structured as follows: In Section 2, we present the problem we tackled as well as the formal notation necessary to understand this work. In the subsequent Section 3, we present the three approaches we developed to reduce the runtime of bounded Jaro-Winkler computations. We then evaluate our approach in Section 4. Related work is presented in Section 5, where we focus on approaches that aim to improve the time-efficiency of link discovery. We conclude in Section 6. The approach presented herein is now an integral part of LIMES.8 2

Preliminaries

In the following, we present some of the symbols and terms used within this work. 2.1

Link Discovery

In this work, we use link discovery as a hypernym for deduplication, record linkage, entity resolution and similar terms used across literature. The formal specification of link discovery adopted herein is tantamount to the definition proposed in [ 16 ]: Given a set S of source resources, a set T of target resources and a relation R, our goal is to find the set M S T of pairs (s; t) such that R(s; t). If R is owl:sameAs, then we are faced with a deduplication task. Given that the explicit computation of M is usually a very complex endeavour, M is most commonly approximated by a set M 0 = f(s; t; (s; t)) 2 S T R+ : (s; t) g, where is a (potentially complex) similarity function and 2 [0; 1] is a similarity threshold. Given that this problem is in O(n2), using na¨ıve algorithms to compare large S and T is most commonly impracticable. Thus, time-efficient approaches for the computation of bounded measures 6 http://linklion.org 7 We use bounded measures in the same sense as [ 13 ], i.e., to mean that we are only interested in pairs of strings whose similarity is greater than or equal to a given lower bound. 8 http://limes.sf.net have been developed over the last years for measures such as the Levenshtein distance, Minkowski distances, trigrams and many more [ 15 ].

In this paper, we thus study the following problem: Given a threshold 2 [0; 1] and two sets of strings S and T , compute the set M 0 = f(s; t; (s; t)) 2 S T R+ : (s; t) g. Two categories of approaches can be considered to improve the runtime of measures: Lossy approaches return a subset M 00 of M 0 which can be calculated efficiently but for which there are no guarantees that M 00 = M 0. Lossless approaches on the other hand ensure that their result set M 00 is exactly the same as M 0. In this paper, we present a lossless approach. To the best of our knowledge, only one other link discovery framework implements a lossless approach that has been designed to exploit the bound defined by the threshold to ensure a more efficient computation of the Jaro-Winkler distance, i.e., the SILK framework with the approach MultiBlock [ 9 ]. We thus compare our approach with SILK 2.6.0 in the evaluation section of this paper. 2.2

The Jaro-Winkler Similarity

Let be the set of all the strings that can be generated by using an alphabet A. The Jaro measure dj : ! [0; 1] is a string similarity measure approach which was developed originally for name comparison in the U.S. Census. This measure takes into account the number of character matches m and the ratio of their transpositions t: dj = ( 0 31 jsm1j + jsm2j + mm t if m = 0 otherwise Here two characters are considered to be a match if and only only if (1) they are the same and (2) they are at most at a distance w = b max(js21j;js2j) c from each other. For example, for s1 = "Spears" and s2 = "P ears", the second s of s1 matches the s of s2 while the first s of s1 does not match the s of s2.

The Jaro-Winkler measure [ 27 ] is an extension of the Jaro distance. This extension is based on Winkler’s observation that typing errors occur most commonly in the middle or at the end of a word, but very rarely in the beginning. Hence, it is legitimate to put more emphasis on matching prefixes if the Jaro distance exceeds a certain ”boost threshold” bt, originally set to 0:7.

dw = dj dj + (`p(1

if dj < bt dj )) otherwise (1) (2) Here, ` denotes the length of the common prefix and p is a weighting factor. Winkler uses p = 0:1 and ` 4. Note that `p must not be greater than 1.

For the strings s1 = "DEM OCRACY "; s2 = "DEM OGARP HY " (with s2 being intentionally misspelled) we get the following output of the Jaro-Winkler measure.

– js1j = 9; js2j = 10 – w = 4 – m = 7 – t = 1 – dj = 13 jsm1j + jsm2j + mm t – dw = dj + `p (1 dj ) = 0:867

3 Improving the Runtime of Bounded Jaro-Winkler

The main principle behind reducing the runtime of the computation of measures is to reduce their reduction ratio. Here, we use a sequence of filters that allow discarding similarity computations while being sure that they would have led to a similarity score which would have been less than our threshold . To this end, we regard the problem as that of finding filters that return an upper bound estimation e(s1; s2) dw(s1; s2) for some properties of the input strings that can be computed in constant time. For a given threshold , if e(s1; s2) , then we can safely ignore the input (s1; s2). 3.1

Length-based filters

In the following, we denoted the length of a string s with jsj. Our first filter is based on the insight that large length differences are a guarantee for poor similarity. For example, the strings "a" and "alpha" cannot have a Jaro-Winkler similarity of 1 by virtue of their length difference. We can formalize this idea as follows: Let s1 and s2 be strings with respective lengths js1j and js2j. Without loss of generality, we will assume that js1j js2j. Moreover, let m be the number of matches across s1 and s2. Because m js1j, we can substitute m with js1j and gain the following upper bound estimation for dj (s1; s2): dj = Now the lower bound for the number t of transpositions is 0. Thus, we obtain the following equation.

dj 31 1 + jjss21jj + 1 32 + 3jjss12jj The application of this approximation on Winkler’s extension is trivial: dw = dj + ` p (1 dj ) 2 + js1j + ` p 3 3js2j

Consider the pair s1 = "bike" and s2 = "bicycle" and a threshold = 0:9. Applying the estimation for Jaro we get dj 23 347 = 0:857. This exceeeds the boost threshold, so we use equation 5 to compute e(s1; s2) = 0:885. Now we do not have to actually compute dw(s1; s2), since e(s1; s2) < .

By using this approach we can decide in O(1)9 if a given pairs score is greater than a given threshold, which saves us the much more expensive score computation for a big number of pairs, provided that the input strings sufficiently vary in length. 9 In most programming languages, especially Java (which we used for our implementation), the length of string is stored in a variable and can thus be accessed in constant time. (3) (4) (5) 3.2

Filtering ranges by length

The approach described above can be reversed to limit the number of pairs that we are going to be iterated over. To this end, we can construct a index : N ! 2 which maps strings lengths l 2 N to all strings s with jsj = l. With the help of this index, we can now determine the set of strings t that should be compared with the subset S(l) of S that only contains strings of length l. We go about using this insight by computing the upper and lower bound for the length of a string t that should be compared with a string s. This is basically equivalent to asking what is the minimum length difference jjsj jtjj so that e(s; t) is satisfied. We transpose equation 5 to the following for our lower bound:

Analogously, we can derive the following upper bound:

For example, consider a list of strings S with equally distributed, distinct string lengths (4; 7; 11; 18). Using Equation 6 and Equation 7 we obtain Table 1. Taking into account the last column of the table, we will save a total of 38 comparisons. jtj jsjmin jsjmax sizes in range An even more fine-grained approach can be chosen to filter out computations. Let e :

A ! N be the function with returns the number of occurrences of a given character c in a string s. For the strings s1 and2, the number of maximum possible matches mmax can be expressed as mmax =

X c2s1 min(e(s1; c); e(s2; c)) m Consequently, we can now substitute m for mmax in the Jaro distance computation: dj (s1; s2) = mmax iff (3

1)js1jjs2j : js1j + js2j (10)

For instance, let s1 = "astronaut"; s2 = "astrochimp". The retrieval of mmax ist shown in Table 2. The aim of our evaluation was to study how well our approach performs on real data. We chose DBpedia 3.9 as a source of data for our experiments as it contains data pertaining to 1.1 million persons and thus allows for both fine-grained evaluations and scalability evaluations. All experiments where deduplication experiments, i.e., S = T . We considered the list of all rdfs:label in DBpedia in our runtime evaluation and scalability experiments. We also computed the runtime of our approach on up to 105 labels for our scalability experiments. All experiments were performed on a 2.5 GHz Intel Core i5 machine with 16GB RAM running OS X 10.9.3. In our first series of experiments, we evaluated the runtime of all filter combinations against the na¨ıve approach on a small dataset containing 1000 labels from DBpedia. The results of our evaluation are shown in Figure 4.1. This evaluation suggests that all filters outperform the na¨ıve approach. Moreover, the combination of all filtersl lead to the best overall runtime in most cases. Interestingly, the character-based filter leads to a significant reduction of the number of comparisons (see Figure 2) by more than 2 orders of magnitude. However, the runtime improvement is not as substantial. This result seems to indicate that the lookup in the character indexes is very time-demanding. We will thus aim to improve our character indexing in future work. Overall, the results on this dataset already shows that we outperform the na¨ıve approach by more than an order of magnitude when is high. The runtimes on a larger sample of size 104 show an even better improvement (see Figure 3). This suggests that the relative improvement of our approach improves with the size of the problem.

0:8 0:85

0:9

Threshold

0:95 1 The aim of the scalability evaluation was to measure how well our approach deals with datasets of growing size datasets. In our first set of experiments, we looked at the growth of the runtime of our approach on datasets of growing sizes. Our results suggest that our approach grows linearly with the number of labels contained in S and T (see Figure 5). This suggests that the runtime of our approach can be easily predicted for large datasets, which of importance when asking users to wait for the results of the computation. The second series of scalability experiments looked at the runtime behaviour of our approach on a large dataset with 105 labels. Our results suggest that the runtime of our approach falls superlinearly with an increase of the threshold (see Figure 4). This behaviour suggest that our approach is especially useful on clean datasets, where high thresholds can be used for link discovery.

0:8 0:85 0:95

1 0:9

Threshold

na¨ıve range (r) length (l) freq. (f) r+l r+f r+l+f

na¨ıve r+l ( = 0:91) r+l ( = 0:95) r+l ( = 0:99) r+l+f ( = 0:91) r+l+f ( = 0:95) r+l+f ( = 0:99)

Fig. 5. Runtimes with multiple thresholds for growing input sizes We compared our approach with SILK2.6.0. To this end, we retrieved all rdfs:label of instances of subclasses of Person. We only compared with SILK on small datasets (i.e., on classes with small numbers of instances) as the results on these small datasets already showed that we outperform SILK consistently.10 Our results are shown in Table 3. They suggest that the absolute difference in runtime grows with the size of the datasets. Thus, we did not consider testing larger datasets against SILK as in the best case, we were already 4.7 times faster than SILK (Architect dataset, = 0:95). DBpedia Class Size OA(0:8) OA(0:9) OA(0:95) SILK(0:8) SILK(0:9) SILK(0:95) Actors Architect Criminal The work presented herein is related to record linkage, deduplication, link discovery and the efficient computation of Hausdorff distances. An extensive amount of literature has been published by the database community on record linkage (see [ 11,6 ] for surveys). With regard to time complexity, time-efficient deduplication algorithms such as PPJoin+ [ 29 ], EDJoin [ 28 ], PassJoin [ 12 ] and TrieJoin [ 26 ] were developed over the last years. Several of these were then integrated into the hybrid link discovery framework LIMES [ 16 ]. Moreover, dedicated time-efficient approaches were developed for LD. For example, RDF-AI [ 24 ] implements a five-step approach that comprises the preprocessing, matching, fusion, interlink and post-processing of data sets. [ 17 ] presents an approach based on the Cauchy-Schwarz that allows discarding a large number of unnecessary computations. The approaches HYPPO [ 14 ] and HR3 [ 15 ] rely on space tiling in spaces with measures that can be split into independent measures across the dimensions of the problem at hand. Especially, HR3 was shown to be the first approach that can achieve a relative reduction ratio r0 less or equal to any given relative reduction ratio r > 1. Standard blocking approaches were implemented in the first versions of SILK and later replaced with MultiBlock [ 9 ], a lossless multi-dimensional blocking technique. KnoFuss [ 20 ] also implements blocking techniques to achieve acceptable runtimes. Further approaches can be found in [ 25,4,21,22,7 ].

In addition to addressing the runtime of link discovery, several machine-learning approaches have been developed to learn link specifications (also called linkage rules) for link discovery. For example, machine-learning frameworks such as FEBRL [ 2 ] and MARLIN [ 1 ] rely on models such as Support Vector Machines [ 3 ] and decision 10 We ran SILK with -Dthreads = 1 for the sake of fairness. trees [ 23 ] to detect classifiers for record linkage. RAVEN [ 18 ] relies on active learning to detect linear or Boolean classifiers. The EAGLE approach [ 19 ] combines active learning and genetic programming to detect link specifications. KnoFuss [ 20 ] goes a step further and presents an unsupervised approach based on genetic programming for finding accurate link specifications. Other record deduplication approaches based on active learning and genetic programming are presented in [ 5,8 ]. 6

Conclusion and Future Work

In this paper, we present a novel approach for the efficient execution of bounded JaroWinkler computations. Our approach is based on three filters which allow discarding a large number of comparisons. While our evaluation suggests that the filters are complementary, the character-based filter seems not to contribute to a significant reduction of the runtime once we deal with large datasets. We showed that our approach scales linearly with the amount of data it is faced with. Moreover, we showed that our approach can be make effective use of large thresholds by reducing the total runtime of the approach considerably. We also compared our approach with the state-of-the-art framework SILK 2.6.0 and showed that we outperform it on all datasets. In future work, we will study the character-based filter in more detail and aim to eradicate its exact performace bottleneck. Moreover, we will evaluate partitioning of datasets and parallelization of filters to further improve the runtime of large datasets. Finally, we will test whether our approach improves the accuracy of specification detection algorithms such as EAGLE.

Mikhail

Bilenko and

Raymond J.

Mooney . Adaptive duplicate detection using learnable string similarity measures . In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '03 , pages 39 - 48 , New York, NY, USA, 2003 . ACM.

Peter

Christen . Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface . In KDD , pages 1065 - 1068 , 2008 .

Nello

Cristianini and

Elisa

Ricci . Support vector machines . In Encyclopedia of Algorithms . 2008 .

Philippe

Cudre ´-Mauroux, Parisa Haghani, Michael Jost, Karl Aberer, and Hermann de Meer. idmesh: graph-based disambiguation of linked data . In WWW , pages 591 - 600 , 2009 .

5. J. De Freitas , G.L. Pappa , A.S. da Silva, M.A. Gonc¸alves, E. Moura,

Veloso ,

A.H.F.

Laender , and M.G. de Carvalho . Active learning genetic programming for record deduplication . In Evolutionary Computation (CEC) , 2010 IEEE Congress on , pages 1 - 8 . IEEE, 2010 .

6. Ahmed

Elmagarmid , Panagiotis G. Ipeirotis, and Vassilios

Verykios . Duplicate record detection: A survey . IEEE Trans. Knowl . Data Eng., 19 ( 1 ): 1 - 16 , 2007 .

7. Je´roˆme Euzenat, Alfio Ferrara, Willem Robert van Hage, et al. Results of the Ontology Alignment Evaluation Initiative 2011 . In

, 2011 .

Robert

Isele and

Christian

Bizer . Learning expressive linkage rules using genetic programming . PVLDB , 5 ( 11 ): 1638 - 1649 , 2012 .

Robert

Isele , Anja Jentzsch, and

Christian

Bizer . Efficient Multidimensional Blocking for Link Discovery without losing Recall . In WebDB, 2011 .

10. Jaro . Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida . Journal of the American Statistical Association 84 ( 406 ): 414 - 420 , 1989 .

11. Hanna Ko¨pcke and Erhard Rahm. Frameworks for entity matching: A comparison . Data Knowl. Eng. , 69 ( 2 ): 197 - 210 , 2010 .

12. Guoliang

Dong

Deng ,

Jiannan

Wang , and

Jianhua

Feng . Pass-join: a partition-based method for similarity joins . Proc. VLDB Endow ., 5 ( 3 ): 253 - 264 , November 2011 .

13. Axel-Cyrille Ngonga Ngomo. Orchid - reduction -ratio-optimal computation of geo-spatial distances for link discovery . In International Semantic Web Conference (1) , pages 395 - 410 , 2013 .

14. Axel-Cyrille Ngonga Ngomo . A Time-Efficient Hybrid Approach to Link Discovery . In OM , 2011 .

15. Axel-Cyrille Ngonga Ngomo . Link discovery with guaranteed reduction ratio in affine spaces with minkowski measures . In International Semantic Web Conference (1) , pages 378 - 393 , 2012 .

16. Axel-Cyrille Ngonga Ngomo . On link discovery using a hybrid approach . J. Data Semantics , 1 ( 4 ): 203 - 217 , 2012 .

17. Axel-Cyrille Ngonga Ngomo and So¨ren Auer. LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data . In IJCAI , pages 2312 - 2317 , 2011 .

18. Axel-Cyrille Ngonga

Ngomo

, Jens Lehmann, So¨ren Auer, and Konrad Ho¨ffner. Raven - active learning of link specifications . In OM , 2011 .

19. Axel-Cyrille Ngonga Ngomo and Klaus Lyko . Eagle: Efficient active learning of link specifications using genetic programming . In ESWC , pages 149 - 163 , 2012 .

20. Andriy

Nikolov

, Mathieu d'Aquin,

and Enrico

Motta . Unsupervised learning of link discovery configuration . In ESWC , pages 119 - 133 , 2012 .

21. Andriy

Nikolov

, Victoria Uren, Enrico Motta, and

Anne

Roeck . Overcoming schema heterogeneity between linked semantic repositories to improve coreference resolution . In Proceedings of the 4th Asian Conference on The Semantic Web, ASWC '09 , pages 332 - 346 , Berlin, Heidelberg, 2009 . Springer-Verlag.

22. George

Papadakis

, Ekaterini Ioannou, Claudia Niedere´e, Themis Palpanas, and

Wolfgang

Nejdl . Eliminating the redundancy in blocking-based entity resolution methods . In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11 , pages 85 - 94 , New York, NY, USA, 2011 . ACM.

23.

S. R.

Safavian and

Landgrebe . A survey of decision tree classifier methodology . Systems, Man and Cybernetics , IEEE Transactions on, 21 ( 3 ): 660 - 674 , 1991 .

24. Francois

Scharffe

, Yanbin Liu, and

Chuguang

Zhou . Rdf-ai: an architecture for rdf datasets matching, fusion and interlink . In Proc. IJCAI 2009 workshop on Identity , reference, and knowledge representation (IR-KR) , Pasadena (CA US) , 2009 .

25.

Dezhao

Song and

Jeff

Heflin . Automatically generating data linkages using a domainindependent candidate selection approach . In International Semantic Web Conference (1) , pages 649 - 664 , 2011 .

26. Jiannan

Wang

Guoliang

Li ,

and Jianhua

Feng . Trie-join: Efficient trie-based string similarity joins with edit-distance constraints . PVLDB , 3 ( 1 ): 1219 - 1230 , 2010 .

27. William

Winkler . String comparator metrics and enhanced decision rules in the fellegisunter model of record linkage . In Proceedings of the Section on Survey Research , pages 354 - 359 , 1990 .

28. Chuan

Xiao

Wei

Wang , and

Xuemin

Lin . Ed-Join: an efficient algorithm for similarity joins with edit distance constraints . PVLDB , 1 ( 1 ): 933 - 944 , 2008 .

29. Chuan

Xiao

, Wei

Wang

, Xuemin Lin , and Jeffrey Xu Yu. Efficient similarity joins for near duplicate detection . In WWW , pages 131 - 140 , 2008 .