<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Rapid Execution of Weighted Edit Distances</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tommaso Soru</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel-Cyrille Ngonga Ngomo</string-name>
          <email>ngongag@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science University of Leipzig Augustusplatz 10</institution>
          ,
          <addr-line>04109 Leipzig</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The comparison of large numbers of strings plays a central role in ontology matching, record linkage and link discovery. While several standard string distance and similarity measures have been developed with these explicit goals in mind, similarities and distances learned out of the data have been shown to often perform better with respect to the F-measure that they can achieve. Still, the practical use of dataspeci c measures is often hindered by one major factor: their runtime. While time-e cient algorithms that allow scaling to millions of strings have been developed for standard metrics over the last years, dataspeci c versions of these measures are usually slow to run and require signi cantly more time for the same task. In this paper, we present an approach for the time-e cient execution of weighted edit distances. Our approach is based on a sequence of e cient lters that allow reducing the number of candidate pairs for which the weighted edit distance has to be computed. We also show how existing time-e cient deduplication approaches based on the edit distance can be extended to deal with weighted edit distances. We compare our approach with such an extension of PassJoin on benchmark data and show that we outperform it by more than one order of magnitude.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The computation of string similarities plays a central role in manifold disciplines,
especially in ontology matching and link discovery on the Web of Data1 [20].
Over the last decades, manifold domain-speci c string similarities have been
developed for improving the accuracy of automatic techniques that rely on them.
For example, the Jaro-Winkler similarity was developed especially to perform
well on person names [22]. Still, newer works in machine learning have shown
that learning string similarities directly from data can lead to algorithms with a
performance superior to that of those which rely on standard similarity measures.
Especially, work on link discovery on the Web of Data has shown that
dataspeci c weighted edit distances can lead to higher F-measures [21].</p>
      <p>One main problem has yet plagued the approaches which rely on string
similarity measures learned from data: their runtime. While dedicated algorithms for
1 Throughout this paper, we use the expression \link discovery" to mean the discovery
of typed relations that link instances from knowledge bases on the Web of Data.
the time-e cient comparison of large volumes of data have been developed over
the last years (e.g., PPJoin+ [24], EDJoin [23], PassJoin [12] and TrieJoin [6]),
the time-e cient computation of data-speci c string similarities has been paid
little attention to. Thus, running the data-speci c counterparts of standard
similarity measures is often orders of magnitude slower. Previous work have
circumvented this problem in manifold ways, including the execution of approximations
of the data-speci c similarity measure. For example, weighted edit distances
are sometimes approximated by rst computing the edit distance between two
strings A and B and only subsequently applying the weight of each of the edit
operations [10]. Other approximations can be found in [3, 2].</p>
      <p>In this paper, we address the problem of the time-e cient computation of
weighted edit distances by presenting a novel approach, REEDED. Our approach
uses weight bounds on the input cost matrix to e ciently discard similarity
computations that would lead to dissimilar pairs. By these means, REEDED can
outperform state-of-the-art approaches for the computation of edit distances by
more than one order of magnitude on real datasets. We explain our approach
by using an example from link discovery based on the data shown in Table 1.
Here, the task is to detect possible pairs (s; t) 2 S T such that s owl:sameAs t,
where S is a set of source resources and T is a set of target resources.</p>
      <p>This paper is structured as follows: In Section 2, we present preliminaries to
our work. Thereafter, we present the REEDED approach for the time-e cient
computation of weighted edit distances (Section 3). In Section 4, we evaluate
our approach on four datasets and show that we outperform a weighted version
of the state-of-the-art approach PassJoin by more than one order of magnitude.
Finally, we conclude with Section 6 after giving a brief overview of related work
in Section 5.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <sec id="sec-2-1">
        <title>Notation and Problem Statement</title>
        <p>Let be an alphabet and be the set all sequences that can be generated by
using elements of . We call the elements of characters and assume that
contains the empty character . The edit distance { or Levenshtein distance {
of two strings A 2 and B 2 is the minimum number of edit operations
that must be performed to transform A into B [11]. An edit operation can be
the insertion or the deletion of a character, or the substitution of a character
with another one. In a plain edit distance environment, all edit operations have
a cost of 1. Thus, the distance between the strings \Generalized" and
\Generalised" is the same as the distance between \Diabetes Type I" and \Diabetes
Type II". Yet, while the rst pair of strings is clearly semantically equivalent
for most applications, the elements of the second pair bears related yet
significantly di erent semantics (especially for medical applications). To account for
the higher probability of edit operations on certain characters bearing a higher
semantic di erence, weighted edit distances were developed. In a weighted edit
distance environment, a cost function cost : ! [0; 1] assigned to each of
The entry mij is the cost for substituting the ith character ci of with the jth
character cj of the same set. Note that if ci = , mij encode the insertion of cj.
On the other hand, if cj = , mij encode the deletion of ci.</p>
        <p>In most applications which require comparing large sets of strings, string
similarities are used to address the following problem: Given a set S of source
strings and a set T of target strings, nd the set R(S; T; p; ) of all pairs (A; B) 2
S T such that</p>
        <p>p(A; B)
where is a distance threshold and p is the plain edit distance. Several
scalable approaches have been developed to tackle this problem for plain edit
distances [12, 24, 23]. Still, to the best of our knowledge, no scalable approach has
been proposed for nding all (A; B) 2 S T such that (A; B) for weighted
edit distances . In this paper we address exactly this problem by presenting
REEDED. This approach assumes that the computation of weighted edit
distances can be carried out by using an extension of the dynamic programming
approach used for the plain edit distance.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Extension of Non-Weighted Approaches</title>
        <p>All of the approaches developed to address the problem at hand with the plain
edit distance can be easily extended to deal with weighted edit distances for
which the dynamic programming approach underlying the computation of the
plain edit distance still holds. Such an extension can be carried out in the
following fashion: Let
=</p>
        <p>
          min
1 i;j j j^i6=j
mij:
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
the possible edit operations. The totality all of costs can be encoded in a cost
matrix M . The cost matrix is quadratic and of dimensions j j j j for which
the following holds:
        </p>
        <p>8i 2 f1; : : : ; j jg mii = 0
Then, if the weighted edit distance between two strings A and B is d, then at
most d= edit operations were carried out to transform A into B. By using this
insight, we can postulate that for any weighted edit distance with cost matrix
M , the following holds
8A 2
8B 2
(A; B)
! p(A; B)
:
Thus, we can reduce the task of nding the set R(S; T; ; ) to that of rst
nding R(S; T; p; = ) and subsequently ltering R(S; T; p; = ) by using the
condition (A; B) . To the best of our knowledge, PassJoin [12] is currently
the fastest approach for computing R(S; T; p; ) with plain edit distances. We
thus extended it to deal with weighted edit distances and compared it with our
approach. Our results show that we outperform the extension of PassJoin by
more than one order of magnitude.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The REEDED Approach</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Overview</title>
        <p>Our approach REEDED (Rapid Execution of Weighted Edit Distances) aims to
compute similar strings using weighted edit distance within a practicable amount
of time. The REEDED approach is basically composed of three nested lters as
shown in Figure 1, where each lter takes a set of pairs as input and yields
a subset of the input set according to a prede ned rule. In the initial step of
REEDED, the input data is loaded from S, a source data set, and T , a target
data set. Their Cartesian product S T is the input of the rst length-aware
lter. The output of the rst lter L is the input of the second character-aware
lter. The weighted edit distance will be calculated only for the pairs that pass
through the second lter, i.e. set N . The nal result A is the set of pairs whose
weighted edit distance is less or equal than a threshold . Note that pairs are
processed one by one. This ensures that our algorithm performs well with respect
to its space complexity.
for (i 6= j) ^ (i0 6= j0) ^ (i00 6= j00). Formally, we can enforce this condition of the
cost matrix M by ensuring that
9k &gt; 0 8mij : k &lt; mij
Given that the entries in cost matrices are usually bound to have a maximal
value of 1, we will assume without restriction of generality that
8i 2 f1; : : : ; j jg8j 2 f1; : : : ; j jg i 6= j ! 0:5 &lt; mij
1:
Thus, in the following, we will assume that 8mij : mij &gt; 0:5.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Length-aware Filter</title>
        <p>The length-aware lter is the rst lter of REEDED. Once the data sets have
been loaded, the Cartesian product</p>
        <p>S</p>
        <p>T = fhs; ti : s 2 S; t 2 T g
is computed, which in our example corresponds to fhs1; t1i; hs1; t2i; : : : ; hs5; t5ig.
The basic insight behind the rst lter is that given two strings s and t with
lengths jsj resp. jtj, we need at least jjsj jtjj edit operations to transform s into
t. Now given that each edit operation costs at least , the cost of transforming
s to t will be at least jjsj jtjj. Consequently, the rule which the lter relies on
is the following:
3.2</p>
      </sec>
      <sec id="sec-3-3">
        <title>Key Assumption</title>
        <p>Similarly to the extensions of plain edit distances for weighted edit distances,
REEDED assumes the dynamic programming approach commonly used for
computing plain edit distances can be used for computing the weighted edit distance
described by the cost matrix M . With respect to M , this assumption translates
to the weights in the matrix being such that there is no sequence of two edit
operations mij and mi0j0 that is equivalent to a third edit operation mi00j00 with
hs; ti 2 L ) hs; ti 2 S</p>
        <p>
          T ^ jjsj
jtjj
= :
(9)
In the following, we will set = = , where is as de ned in Equation (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ).
        </p>
        <p>In our example, let us assume = 1 and mij 2 (0:5; 1:0]. Then, = 2. If we
assume that S.name has been mapped to T.name, then at the end of this step,
13 of the 25 initial pairs in S T are dismissed. The remaining 8 pairs are:
L = fhs1; t1i; hs2; t2i; hs3; t3i; hs4; t4i; hs5; t5i; hs2; t3i; hs4; t5i; hs5; t4ig :
(10)
(5)
(6)
(7)
(8)
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Character-aware Filter</title>
        <p>The second lter is the character-aware lter which only selects the pairs of
strings that do not di er by more than a given number of characters. The
intuition behind the lter is that given two strings s and t, if jCj is the number of
characters that do not belong to both strings, we need at least djCj=2e
operations to transform s into t. As above, the cost of transforming s to t will be at
least djCj=2e. The characters of each string are collected into two sets,
respectively Cs for the source string and Ct for the target string. Since s and t may
contain more than one occurrence of a single character, characters in Cs and Ct
are enumerated. Then, the algorithm computes their exclusive disjunction C:
Finally, the lter performs the selection by applying the rule:</p>
        <p>C = Cs</p>
        <p>Ct:
hs; ti 2 N
() hs; ti 2 L ^
jCj
2
:
In our example, a further pair can be dismissed by these means, leading to the
set of remaining pairs being as follows:</p>
        <p>N = fhs1; t1i; hs2; t2i; hs3; t3i; hs4; t4i; hs5; t5i; hs4; t5i; hs5; t4ig
The pair that is rejected is hs2; t3i, for which C = fh1; i1; o1; i2; a1; s1g, which
leads to the rule not being satis ed.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Veri cation</title>
        <p>For all the pairs left in N , the weighted edit distance among is calculated. After
that, the third lter selects the pairs whose distance is less or equal than .
hs; ti 2 A
() hs; ti 2 N ^
(s; t)
In our example data sets, the set</p>
        <p>A = fhs1; t1i; hs2; t2i; hs3; t3i; hs4; t4i; hs5; t5ig
is the nal result of the selection. Note that the pairs hs4; t5i and hs5; t4i are
discarded, because their distance (1:2) is greater than the threshold (1:0). 2
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>The goal of our evaluation was to quantify how well REEDED performs in
comparison to the state of the art. We thus compared REEDED with the extension
of PassJoin as proposed in [12]. We chose PassJoin because it was shown to
outperform other approaches for the e cient computation of edit distances, incl.
EDJoin [23] and TrieJoin [6]. Note that we did run an implementation of EdJoin
on the DBLP dataset presented below and it required approximately twice the
runtime of PassJoin.
2 A proof of the correctness of REEDED can be found in the extended version of this
paper at http://svn.aksw.org/papers/2013/OM_Reeded/public.pdf.
(11)
(12)
(13)
(14)
We compared the approaches across several distance thresholds on four di
erent datasets that were extracted from real data (see Fig. 2).3 The rst two of
these data sets contained publication data from the datasets DBLP and ACM.
The third and fourth dataset contained product labels from the product
catalogs Google Products and ABT [9]. We chose these datasets because they were
extracted from real data sources and because of the di erent string length
distribution across them. By running our experiments on these data sets, we could
thus ensure that our results are not only valid on certain string length
distributions. As weight matrix we used a confusion matrix built upon the frequency of
typographical errors presented in [8]. The original confusion matrices report the
number of occurrences f for each error:</p>
      <p>S = ffij : substitution of i (incorrect) for j (correct)g
For insertion and deletion, we calculate the total frequency:</p>
      <p>I = ffij : insertion of j after ig</p>
      <p>D = ffij : deletion of j after ig
!jI = X iIj</p>
      <p>i
!jD = X
i</p>
      <p>D
ij
The weights of our weight matrix are thus de ned as:
mij =
8
&gt; 1
&gt;
&lt;</p>
      <p>1
&gt;
&gt;: 1</p>
      <p>S
2 maxi(j S) : i 6=
2(ma!xjI(!Im)inm(!inI)(!I )) : i =
2(ma!xiD(!Dm)inm(!inD()!D)) : i 6=
^ j 6=
^ j 6=
^ j =
In other words, the higher the probability of an error encoded in mij , the lower
its weight.</p>
      <p>All experiments were carried out on a 64-bit server running Ubuntu 10.0.4
with 4 GB of RAM and a 2.5 GHz XEON CPU. Each experiment was run 5
times.
4.2</p>
      <sec id="sec-4-1">
        <title>Results</title>
        <p>In Figure 2 we show the string length distribution in the data sets. The results of
our experiments are shown in Table 2. Our results show clearly that REEDED
3 The data used for the evaluation is publicly available at http://dbs.uni-leipzig.
de/en/research/projects/object\_matching/fever/benchmark\_datasets\
_for\_entity\_resolution.
(15)
(16)
(17)
(18)
(19)
(20)
100 150
String Length
200</p>
        <p>250
(a) Data set DBLP.title
100</p>
        <p>200
String Length
300</p>
        <p>400
(b) Data set ACM.authors
60
50
cy40
n
eu30
q
e
rF20
outperforms PassJoin in all experimental settings. On the DBLP dataset
(average string length = 56.36), REEDED is already 2 times faster than PassJoin for
the threshold 2. For = 4, we reach an order of magnitude in di erence with
runtimes of 25.91 (REEDED) and 246.93 (PassJoin). The runtime of REEDED
seems to grow quasi-linearly with increasing values of . The results on ACM
corroborate the results for the two algorithms. Here, we are 2.16 times faster
than PassJoin for = 2 and 6.57 times faster for = 5. We achieve similar
results on the Google Products dataset and are an order of magnitude faster
than PassJoin for = 4 already. The results we achieve the ABT dataset allow
deriving further characteristics of REEDED. Here, the algorithm scales best and
requires for = 5 solely 1.58 times the runtime it required for = 1. This is
clearly due to the considerably longer strings contained in this dataset.</p>
        <p>We analyzed the results on each of the lters in our approach and measure
the reduction ratio (given by 1 jN j=jS T j) achieved by the length-aware and
character-aware lters. Table 3 shows the set sizes at each ltering step. Both
the rst and the second lter reduce the number of selected pairs by one or two
orders of magnitude for all the datasets. As expected, the length-aware lter is
most e ective on datasets with large average string lengths. For example, only
1.9% of the Cartesian product of the ABT dataset makes it through the rst
lter for = 1 while the lter allows 6.8% of the DBLP Cartesian product
through for = 1. One interesting characteristic of the approach is that the
size of the L grows quasi linearly with the value of . The character-aware lter
seems to have the opposite behavior to the length-aware lter and can discard
more string pair on data with small average string lengths. For example, less
than 1% of L makes it through the lter for = 1 on the DBLP dataset while
5.1% of L makes it through the same lter for = 1 on ABT.</p>
        <p>We also measured the runtime improvement as well as the precision and
recall we achieved by combining REEDED with the ACIDS approach and applying
this combination to the datasets reported in [21]. The results are shown in
Table 4. For the datasets on which the edit distance can be used, the approach
achieves a superior precision and recall than state-of-the-art approaches (such
as MARLIN [4] and Febrl [5]) which do not rely on data-speci c measures. Yet,
on more noisy datasets, the approach leads to poorer results. In particular, the
edit distance has been shown not to be a good measure when the strings to be
compared are too long. Also, the words contained in the source string may be
completely di erent from the words contained in the target string, yet referring
to the same meaning. A notable shortcoming of the ACIDS approach is the
runtime, wherein the learning system iterated for at least 7 hours to nd the weight
con guration of the weighted edit distance and optimize the classi cation [21].
As shown in Table 4, REEDED enhances the execution time of ACIDS
reducing the total runtime by 3 orders of magnitude on the DBLP{ACM and the
ABT{Buy dataset.
Our work is mostly related to the rapid execution of similarity functions and
link discovery. Time-e cient string comparison algorithms such as PPJoin+ [24],
EDJoin [23], PassJoin [12] and TrieJoin [6] were developed for the purpose of
entity resolution and were integrated into frameworks such as LIMES [15]. In
addition to time-e cient string similarity computation approaches for entity
resolution, approaches for the e cient computation string and numeric similarities
were developed in the area of link discovery. For example, [16] presents an
approach based on the Cauchy-Schwarz inequality. The approaches HYPPO [13]
and HR3 [14] rely on space tiling in spaces with measures that can be split into
independent measures across the dimensions of the problem at hand. Especially,
HR3 was shown to be the rst approach that can achieve a relative reduction
ratio r0 less or equal to any given relative reduction ratio r &gt; 1. Another way to
go about computing R(S; T; ; ) lies in the use of lossless blocking approaches
such MultiBlock [7].</p>
        <p>Manifold approaches have been developed on string similarity learning (see,
e.g., [19, 4, 3, 2]). [4] for example learns edit distances by employing batch
learning and SVMs to record deduplication and points out that domain-speci c
similarities can improve the quality of classi ers. [3, 2] rely on a theory for good edit
distances developed by [1] to determine classi ers based on edit distances that
are guaranteed to remain under a given classi cation error. Yet, to the best of
our knowledge, REEDED is the rst approach for the time-e cient execution of
weighted edit distances.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        In this paper we presented REEDED, an approach for the time-e cient
comparison of sets using weighted distances. After presenting the intuitions behind
our approach, we proved that it is both correct and complete. We compared our
approach with an extension of PassJoin for weighted edit distances and showed
that we are more than an order of magnitude faster on 4 di erent data sets.
REEDED is the cornerstone of a larger research agenda. As it enable to now
run weighted edit distances on large datasets within acceptable times, it is also
the key to developing active learning systems for link discovery that do not only
learn link speci cations but also similarity measures directly out of the data.
As shown in [21], this combination promises to outperform the state of the art,
which has relied on standard measures so far. In future work, we will thus
combine REEDED with speci cation learning approaches such as EAGLE [18] and
RAVEN [17] and study the e ect of weighted edit distances on these approaches.
5. Peter Christen. Febrl: a freely available record linkage system with a graphical
user interface. In Proceedings of the second Australasian workshop on Health data
and knowledge management - Volume 80, HDKM '08, pages 17{25, Darlinghurst,
Australia, Australia, 2008. Australian Computer Society, Inc.
6. Jianhua Feng, Jiannan Wang, and Guoliang Li. Trie-join: a trie-based method for
e cient string similarity joins. The VLDB Journal, 21(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ):437{461, August 2012.
7. Robert Isele, Anja Jentzsch, and Christian Bizer. E cient Multidimensional
Blocking for Link Discovery without losing Recall. In WebDB, 2011.
8. Mark D. Kernighan, Kenneth Ward Church, and William A. Gale. A spelling
correction program based on a noisy channel model. In COLING, pages 205{210,
1990.
9. Hanna Kopcke, Andreas Thor, and Erhard Rahm. Evaluation of entity resolution
approaches on real-world match problems. PVLDB, 3(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ):484{493, 2010.
10. S. Kurtz. Approximate string searching under weighted edit distance. In Proc.
      </p>
      <p>
        WSP, volume 96, pages 156{170. Citeseer, 1996.
11. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and
reversals. Doklady Akademii Nauk SSSR 163 (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), pages 845{848, 1965.
12. Guoliang Li, Dong Deng, Jiannan Wang, and Jianhua Feng. Pass-join: a
partitionbased method for similarity joins. Proc. VLDB Endow., 5(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ):253{264, November
2011.
13. Axel-Cyrille Ngonga Ngomo. A Time-E cient Hybrid Approach to Link Discovery.
      </p>
      <p>In OM, 2011.
14. Axel-Cyrille Ngonga Ngomo. Link Discovery with Guaranteed Reduction Ratio in</p>
      <p>A ne Spaces with Minkowski Measures. In ISWC, pages 378{393, 2012.
15. Axel-Cyrille Ngonga Ngomo. On Link Discovery using a Hybrid Approach. Journal
on Data Semantics, 1:203 { 217, 2012.
16. Axel-Cyrille Ngonga Ngomo and Soren Auer. LIMES - A Time-E cient Approach
for Large-Scale Link Discovery on the Web of Data. In IJCAI, pages 2312{2317,
2011.
17. Axel-Cyrille Ngonga Ngomo, Jens Lehmann, Soren Auer, and Konrad Ho ner.</p>
      <p>
        RAVEN { Active Learning of Link Speci cations. In Sixth International Ontology
Matching Workshop, 2011.
18. Axel-Cyrille Ngonga Ngomo and Klaus Lyko. Eagle: E cient active learning of
link speci cations using genetic programming. In Proceedings of ESWC, 2012.
19. E. S. Ristad and P. N. Yianilos. Learning string-edit distance. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 20(5):522{532, 1998.
20. Pavel Shvaiko and Jero^me Euzenat. Ontology matching: State of the art and future
challenges. IEEE Trans. Knowl. Data Eng., 25(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ):158{176, 2013.
21. Tommaso Soru and Axel-Cyrille Ngonga Ngomo. Active learning of domain-speci c
distances for link discovery. In Proceedings of JIST, 2012.
22. William E. Winkler. Overview of record linkage and current research directions.
      </p>
      <p>
        Technical report, BUREAU OF THE CENSUS, 2006.
23. Chuan Xiao, Wei Wang, and Xuemin Lin. Ed-Join: an e cient algorithm for
similarity joins with edit distance constraints. PVLDB, 1(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ):933{944, 2008.
24. Chuan Xiao, Wei Wang, Xuemin Lin, and Je rey Xu Yu. E cient similarity joins
for near duplicate detection. In WWW, pages 131{140, 2008.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Maria-Florina</surname>
            <given-names>Balcan</given-names>
          </string-name>
          , Avrim Blum, and
          <string-name>
            <given-names>Nathan</given-names>
            <surname>Srebro</surname>
          </string-name>
          .
          <article-title>Improved guarantees for learning via similarity functions</article-title>
          .
          <source>In COLT</source>
          , pages
          <volume>287</volume>
          {
          <fpage>298</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Aurelien</given-names>
            <surname>Bellet</surname>
          </string-name>
          , Amaury Habrard, and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Sebban</surname>
          </string-name>
          .
          <article-title>Good edit similarity learning by loss minimization</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>89</volume>
          (
          <issue>1-2</issue>
          ):5{
          <fpage>35</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Aurlien</given-names>
            <surname>Bellet</surname>
          </string-name>
          , Amaury Habrard, and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Sebban</surname>
          </string-name>
          .
          <article-title>Learning good edit similarities with generalization guarantees</article-title>
          .
          <source>In Proceedings of the ECML/PKDD</source>
          <year>2011</year>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Mikhail</given-names>
            <surname>Bilenko</surname>
          </string-name>
          and
          <string-name>
            <given-names>Raymond J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          .
          <article-title>Adaptive duplicate detection using learnable string similarity measures</article-title>
          .
          <source>In KDD</source>
          , pages
          <volume>39</volume>
          {
          <fpage>48</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>