<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Time-Efficient Execution of Bounded Jaro-Winkler Distances</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kevin Dreßler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel-Cyrille Ngonga Ngomo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Leipzig AKSW Research Group Augustusplatz 10</institution>
          ,
          <addr-line>04103 Leipzig</addr-line>
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Over the last years, time-efficient approaches for the discovery of links between knowledge bases have been regarded as a key requirement towards implementing the idea of a Data Web. A considerable portion of the information contained available as RDF on the Web pertains to persons. Thus, efficient and effective measures for comparing names are central to facilitate the integration of information about persons on the Web of Data. The Jaro-Winkler measure has been developed especially for the purpose of comparing person names. Hence, we present a novel approach for the efficient comparison of sets of strings using this measure. We evaluate our approach on several datasets derived from DBpedia 3.9 and containing up to 105 strings and show that it scales linearly with the size of the data for large thresholds. We also evaluate our approach against SILK and show that we outperform it even on small datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Linked Open Data Cloud (LOD Cloud) has developed to a compendium of more
than 2000 datasets over the last few years.1 Currently, data sets pertaining to more than
14 million persons have already been made available on the Linked Data Web.2 While
this number is impressive on its own, it is well known that the population of the planet
has surpassed 7 billion people. Hence, the Web of Data contains information on less
that 1% of the overall population of the planet (counting both the living and the dead).
The output of open-government movements,3 scientific conferences,4 health data5 and
similar endeavours yet promises to make massive amounts of data pertaining to
persons available in the near future. Dealing with this upcoming increase of the number of
person-related resources requires providing means to integrate these datasets with the
aim to facilitate statistical analysis, data mining, personlization, etc. However, while the
1 See http://stats.lod2.eu for an overview of the current state of the Cloud. Last
access: July 11th, 2014.
2 Data collected from http://stats.lod2.eu. Last access: July 11th, 2014.
3 See for example http://data.gov.uk/.
4 See for example http://data.semanticweb.org/
5 http://aksw.org/Projects/GHO
number of datasets on the Linked Data Web grows drastically, the number of links
between datasets still stagnates.6 Addressing this lack of links requires solving two main
problems: the quadratic time complexity of link discovery (efficiency) and the
automatic support of the detection of link specifications (effectiveness). In this paper, we
address the efficiency of the execution of bounded Jaro-Winkler measures,7 which are
known to be effective when comparing person names [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. To this end, we derive
equations that allow discarding a large number of computations while executing bounded
Jaro-Winkler comparisons with high thresholds.
      </p>
      <p>The contributions of this paper are as follows:
1. We derive length- and range-based filters that allow reducing the number of strings
t that are compared with a string s .
2. We present a character-based filter that allows detecting whether two strings s and
t share enough resemblance to be similar according to the Jaro-Winkler measure.
3. We evaluate our approach w.r.t. to its runtime and its scalability with several
threshold settings and dataset sizes.</p>
      <p>The rest of this paper is structured as follows: In Section 2, we present the problem
we tackled as well as the formal notation necessary to understand this work. In the
subsequent Section 3, we present the three approaches we developed to reduce the runtime
of bounded Jaro-Winkler computations. We then evaluate our approach in Section 4.
Related work is presented in Section 5, where we focus on approaches that aim to
improve the time-efficiency of link discovery. We conclude in Section 6. The approach
presented herein is now an integral part of LIMES.8
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>In the following, we present some of the symbols and terms used within this work.
2.1</p>
      <sec id="sec-2-1">
        <title>Link Discovery</title>
        <p>
          In this work, we use link discovery as a hypernym for deduplication, record linkage,
entity resolution and similar terms used across literature. The formal specification of
link discovery adopted herein is tantamount to the definition proposed in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]: Given
a set S of source resources, a set T of target resources and a relation R, our goal is
to find the set M S T of pairs (s; t) such that R(s; t). If R is owl:sameAs,
then we are faced with a deduplication task. Given that the explicit computation of
M is usually a very complex endeavour, M is most commonly approximated by a set
M 0 = f(s; t; (s; t)) 2 S T R+ : (s; t) g, where is a (potentially
complex) similarity function and 2 [0; 1] is a similarity threshold. Given that this problem
is in O(n2), using na¨ıve algorithms to compare large S and T is most commonly
impracticable. Thus, time-efficient approaches for the computation of bounded measures
6 http://linklion.org
7 We use bounded measures in the same sense as [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], i.e., to mean that we are only interested
in pairs of strings whose similarity is greater than or equal to a given lower bound.
8 http://limes.sf.net
have been developed over the last years for measures such as the Levenshtein distance,
Minkowski distances, trigrams and many more [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>
          In this paper, we thus study the following problem: Given a threshold 2 [0; 1] and
two sets of strings S and T , compute the set M 0 = f(s; t; (s; t)) 2 S T R+ :
(s; t) g. Two categories of approaches can be considered to improve the runtime
of measures: Lossy approaches return a subset M 00 of M 0 which can be calculated
efficiently but for which there are no guarantees that M 00 = M 0. Lossless approaches
on the other hand ensure that their result set M 00 is exactly the same as M 0. In this
paper, we present a lossless approach. To the best of our knowledge, only one other
link discovery framework implements a lossless approach that has been designed to
exploit the bound defined by the threshold to ensure a more efficient computation of
the Jaro-Winkler distance, i.e., the SILK framework with the approach MultiBlock [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
We thus compare our approach with SILK 2.6.0 in the evaluation section of this paper.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>The Jaro-Winkler Similarity</title>
        <p>Let be the set of all the strings that can be generated by using an alphabet A. The
Jaro measure dj : ! [0; 1] is a string similarity measure approach which was
developed originally for name comparison in the U.S. Census. This measure takes into
account the number of character matches m and the ratio of their transpositions t:
dj =
( 0
31 jsm1j + jsm2j + mm t
if m = 0
otherwise
Here two characters are considered to be a match if and only only if (1) they are the
same and (2) they are at most at a distance w = b max(js21j;js2j) c from each other. For
example, for s1 = "Spears" and s2 = "P ears", the second s of s1 matches the s of
s2 while the first s of s1 does not match the s of s2.</p>
        <p>
          The Jaro-Winkler measure [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] is an extension of the Jaro distance. This extension is
based on Winkler’s observation that typing errors occur most commonly in the middle
or at the end of a word, but very rarely in the beginning. Hence, it is legitimate to
put more emphasis on matching prefixes if the Jaro distance exceeds a certain ”boost
threshold” bt, originally set to 0:7.
        </p>
        <p>dw =
dj
dj + (`p(1</p>
        <p>if dj &lt; bt
dj )) otherwise
(1)
(2)
Here, ` denotes the length of the common prefix and p is a weighting factor. Winkler
uses p = 0:1 and ` 4. Note that `p must not be greater than 1.</p>
        <p>For the strings s1 = "DEM OCRACY "; s2 = "DEM OGARP HY " (with s2
being intentionally misspelled) we get the following output of the Jaro-Winkler
measure.</p>
        <p>– js1j = 9; js2j = 10
– w = 4
– m = 7
– t = 1
– dj = 13 jsm1j + jsm2j + mm t
– dw = dj + `p (1 dj ) = 0:867</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Improving the Runtime of Bounded Jaro-Winkler</title>
      <p>The main principle behind reducing the runtime of the computation of measures is to
reduce their reduction ratio. Here, we use a sequence of filters that allow discarding
similarity computations while being sure that they would have led to a similarity score
which would have been less than our threshold . To this end, we regard the problem as
that of finding filters that return an upper bound estimation e(s1; s2) dw(s1; s2) for
some properties of the input strings that can be computed in constant time. For a given
threshold , if e(s1; s2) , then we can safely ignore the input (s1; s2).
3.1</p>
      <sec id="sec-3-1">
        <title>Length-based filters</title>
        <p>In the following, we denoted the length of a string s with jsj. Our first filter is based on
the insight that large length differences are a guarantee for poor similarity. For example,
the strings "a" and "alpha" cannot have a Jaro-Winkler similarity of 1 by virtue of
their length difference. We can formalize this idea as follows: Let s1 and s2 be strings
with respective lengths js1j and js2j. Without loss of generality, we will assume that
js1j js2j. Moreover, let m be the number of matches across s1 and s2. Because
m js1j, we can substitute m with js1j and gain the following upper bound estimation
for dj (s1; s2):
dj =
Now the lower bound for the number t of transpositions is 0. Thus, we obtain the
following equation.</p>
        <p>dj 31 1 + jjss21jj + 1 32 + 3jjss12jj
The application of this approximation on Winkler’s extension is trivial:
dw = dj + ` p (1
dj )
2 + js1j + ` p
3 3js2j</p>
        <p>Consider the pair s1 = "bike" and s2 = "bicycle" and a threshold = 0:9.
Applying the estimation for Jaro we get dj 23 347 = 0:857. This exceeeds the boost
threshold, so we use equation 5 to compute e(s1; s2) = 0:885. Now we do not have to
actually compute dw(s1; s2), since e(s1; s2) &lt; .</p>
        <p>By using this approach we can decide in O(1)9 if a given pairs score is greater than
a given threshold, which saves us the much more expensive score computation for a big
number of pairs, provided that the input strings sufficiently vary in length.
9 In most programming languages, especially Java (which we used for our implementation), the
length of string is stored in a variable and can thus be accessed in constant time.
(3)
(4)
(5)
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Filtering ranges by length</title>
        <p>The approach described above can be reversed to limit the number of pairs that we are
going to be iterated over. To this end, we can construct a index : N ! 2 which maps
strings lengths l 2 N to all strings s with jsj = l. With the help of this index, we can
now determine the set of strings t that should be compared with the subset S(l) of S
that only contains strings of length l. We go about using this insight by computing the
upper and lower bound for the length of a string t that should be compared with a string
s. This is basically equivalent to asking what is the minimum length difference jjsj jtjj
so that e(s; t) is satisfied. We transpose equation 5 to the following for our lower
bound:</p>
        <sec id="sec-3-2-1">
          <title>Analogously, we can derive the following upper bound:</title>
          <p>For example, consider a list of strings S with equally distributed, distinct string
lengths (4; 7; 11; 18). Using Equation 6 and Equation 7 we obtain Table 1. Taking into
account the last column of the table, we will save a total of 38 comparisons.
jtj jsjmin jsjmax sizes in range
An even more fine-grained approach can be chosen to filter out computations. Let e :</p>
          <p>A ! N be the function with returns the number of occurrences of a given character
c in a string s. For the strings s1 and2, the number of maximum possible matches mmax
can be expressed as
mmax =</p>
          <p>X
c2s1
min(e(s1; c); e(s2; c))
m
Consequently, we can now substitute m for mmax in the Jaro distance computation:
dj (s1; s2) =
mmax
iff
(3</p>
          <p>1)js1jjs2j :
js1j + js2j
(10)</p>
          <p>For instance, let s1 = "astronaut"; s2 = "astrochimp". The retrieval of mmax
ist shown in Table 2.
The aim of our evaluation was to study how well our approach performs on real data.
We chose DBpedia 3.9 as a source of data for our experiments as it contains data
pertaining to 1.1 million persons and thus allows for both fine-grained evaluations and
scalability evaluations. All experiments where deduplication experiments, i.e., S = T .
We considered the list of all rdfs:label in DBpedia in our runtime evaluation and
scalability experiments. We also computed the runtime of our approach on up to 105
labels for our scalability experiments. All experiments were performed on a 2.5 GHz
Intel Core i5 machine with 16GB RAM running OS X 10.9.3.
In our first series of experiments, we evaluated the runtime of all filter combinations
against the na¨ıve approach on a small dataset containing 1000 labels from DBpedia.
The results of our evaluation are shown in Figure 4.1. This evaluation suggests that
all filters outperform the na¨ıve approach. Moreover, the combination of all filtersl lead
to the best overall runtime in most cases. Interestingly, the character-based filter leads
to a significant reduction of the number of comparisons (see Figure 2) by more than
2 orders of magnitude. However, the runtime improvement is not as substantial. This
result seems to indicate that the lookup in the character indexes is very time-demanding.
We will thus aim to improve our character indexing in future work. Overall, the results
on this dataset already shows that we outperform the na¨ıve approach by more than an
order of magnitude when is high. The runtimes on a larger sample of size 104 show
an even better improvement (see Figure 3). This suggests that the relative improvement
of our approach improves with the size of the problem.</p>
          <p>0:8
0:85</p>
          <p>0:9</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Threshold</title>
          <p>0:95
1
The aim of the scalability evaluation was to measure how well our approach deals with
datasets of growing size datasets. In our first set of experiments, we looked at the growth
of the runtime of our approach on datasets of growing sizes. Our results suggest that
our approach grows linearly with the number of labels contained in S and T (see
Figure 5). This suggests that the runtime of our approach can be easily predicted for large
datasets, which of importance when asking users to wait for the results of the
computation. The second series of scalability experiments looked at the runtime behaviour of
our approach on a large dataset with 105 labels. Our results suggest that the runtime of
our approach falls superlinearly with an increase of the threshold (see Figure 4). This
behaviour suggest that our approach is especially useful on clean datasets, where high
thresholds can be used for link discovery.</p>
          <p>0:8
0:85
0:95</p>
          <p>1
0:9</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Threshold</title>
          <p>na¨ıve
range (r)
length (l)
freq. (f)
r+l
r+f
r+l+f</p>
          <p>na¨ıve
r+l ( = 0:91)
r+l ( = 0:95)
r+l ( = 0:99)
r+l+f ( = 0:91)
r+l+f ( = 0:95)
r+l+f ( = 0:99)</p>
          <p>
            Fig. 5. Runtimes with multiple thresholds for growing input sizes
We compared our approach with SILK2.6.0. To this end, we retrieved all rdfs:label
of instances of subclasses of Person. We only compared with SILK on small datasets
(i.e., on classes with small numbers of instances) as the results on these small datasets
already showed that we outperform SILK consistently.10 Our results are shown in
Table 3. They suggest that the absolute difference in runtime grows with the size of the
datasets. Thus, we did not consider testing larger datasets against SILK as in the best
case, we were already 4.7 times faster than SILK (Architect dataset, = 0:95).
DBpedia Class Size OA(0:8) OA(0:9) OA(0:95) SILK(0:8) SILK(0:9) SILK(0:95)
Actors
Architect
Criminal
The work presented herein is related to record linkage, deduplication, link discovery
and the efficient computation of Hausdorff distances. An extensive amount of literature
has been published by the database community on record linkage (see [
            <xref ref-type="bibr" rid="ref11 ref6">11,6</xref>
            ] for
surveys). With regard to time complexity, time-efficient deduplication algorithms such as
PPJoin+ [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ], EDJoin [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ], PassJoin [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] and TrieJoin [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ] were developed over the
last years. Several of these were then integrated into the hybrid link discovery
framework LIMES [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. Moreover, dedicated time-efficient approaches were developed for
LD. For example, RDF-AI [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ] implements a five-step approach that comprises the
preprocessing, matching, fusion, interlink and post-processing of data sets. [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] presents
an approach based on the Cauchy-Schwarz that allows discarding a large number of
unnecessary computations. The approaches HYPPO [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] and HR3 [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] rely on space
tiling in spaces with measures that can be split into independent measures across the
dimensions of the problem at hand. Especially, HR3 was shown to be the first approach
that can achieve a relative reduction ratio r0 less or equal to any given relative
reduction ratio r &gt; 1. Standard blocking approaches were implemented in the first versions
of SILK and later replaced with MultiBlock [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], a lossless multi-dimensional blocking
technique. KnoFuss [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] also implements blocking techniques to achieve acceptable
runtimes. Further approaches can be found in [
            <xref ref-type="bibr" rid="ref21 ref22 ref25 ref4 ref7">25,4,21,22,7</xref>
            ].
          </p>
          <p>
            In addition to addressing the runtime of link discovery, several machine-learning
approaches have been developed to learn link specifications (also called linkage rules)
for link discovery. For example, machine-learning frameworks such as FEBRL [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]
and MARLIN [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] rely on models such as Support Vector Machines [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] and decision
10 We ran SILK with -Dthreads = 1 for the sake of fairness.
trees [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] to detect classifiers for record linkage. RAVEN [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] relies on active
learning to detect linear or Boolean classifiers. The EAGLE approach [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] combines active
learning and genetic programming to detect link specifications. KnoFuss [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] goes a
step further and presents an unsupervised approach based on genetic programming for
finding accurate link specifications. Other record deduplication approaches based on
active learning and genetic programming are presented in [
            <xref ref-type="bibr" rid="ref5 ref8">5,8</xref>
            ].
6
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we present a novel approach for the efficient execution of bounded
JaroWinkler computations. Our approach is based on three filters which allow discarding
a large number of comparisons. While our evaluation suggests that the filters are
complementary, the character-based filter seems not to contribute to a significant reduction
of the runtime once we deal with large datasets. We showed that our approach scales
linearly with the amount of data it is faced with. Moreover, we showed that our
approach can be make effective use of large thresholds by reducing the total runtime of
the approach considerably. We also compared our approach with the state-of-the-art
framework SILK 2.6.0 and showed that we outperform it on all datasets.
In future work, we will study the character-based filter in more detail and aim to
eradicate its exact performace bottleneck. Moreover, we will evaluate partitioning of datasets
and parallelization of filters to further improve the runtime of large datasets. Finally, we
will test whether our approach improves the accuracy of specification detection
algorithms such as EAGLE.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Mikhail</given-names>
            <surname>Bilenko</surname>
          </string-name>
          and
          <string-name>
            <given-names>Raymond J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          .
          <article-title>Adaptive duplicate detection using learnable string similarity measures</article-title>
          .
          <source>In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <source>KDD '03</source>
          , pages
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          , New York, NY, USA,
          <year>2003</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Peter</given-names>
            <surname>Christen</surname>
          </string-name>
          .
          <article-title>Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface</article-title>
          .
          <source>In KDD</source>
          , pages
          <fpage>1065</fpage>
          -
          <lpage>1068</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Nello</given-names>
            <surname>Cristianini</surname>
          </string-name>
          and
          <string-name>
            <given-names>Elisa</given-names>
            <surname>Ricci</surname>
          </string-name>
          .
          <article-title>Support vector machines</article-title>
          .
          <source>In Encyclopedia of Algorithms</source>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Philippe</given-names>
            <surname>Cudre</surname>
          </string-name>
          ´-Mauroux, Parisa Haghani, Michael Jost, Karl Aberer, and Hermann de Meer.
          <article-title>idmesh: graph-based disambiguation of linked data</article-title>
          .
          <source>In WWW</source>
          , pages
          <fpage>591</fpage>
          -
          <lpage>600</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>J. De Freitas</surname>
            ,
            <given-names>G.L.</given-names>
          </string-name>
          <string-name>
            <surname>Pappa</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          da Silva, M.A. Gonc¸alves, E. Moura,
          <string-name>
            <given-names>A.</given-names>
            <surname>Veloso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.H.F.</given-names>
            <surname>Laender</surname>
          </string-name>
          , and
          <string-name>
            <surname>M.G. de Carvalho</surname>
          </string-name>
          .
          <article-title>Active learning genetic programming for record deduplication</article-title>
          .
          <source>In Evolutionary Computation (CEC)</source>
          ,
          <source>2010 IEEE Congress on</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ahmed</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Elmagarmid</surname>
          </string-name>
          , Panagiotis G. Ipeirotis, and
          <string-name>
            <surname>Vassilios</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Verykios</surname>
          </string-name>
          .
          <article-title>Duplicate record detection: A survey</article-title>
          .
          <source>IEEE Trans. Knowl</source>
          . Data Eng.,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Je´roˆme Euzenat, Alfio Ferrara, Willem Robert van Hage, et al.
          <article-title>Results of the Ontology Alignment Evaluation Initiative 2011</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>OM</given-names>
          </string-name>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Robert</given-names>
            <surname>Isele</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Learning expressive linkage rules using genetic programming</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1638</fpage>
          -
          <lpage>1649</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Robert</given-names>
            <surname>Isele</surname>
          </string-name>
          , Anja Jentzsch, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Efficient Multidimensional Blocking for Link Discovery without losing Recall</article-title>
          . In WebDB,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jaro</surname>
          </string-name>
          .
          <article-title>Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          <volume>84</volume>
          (
          <issue>406</issue>
          ):
          <fpage>414</fpage>
          -
          <lpage>420</lpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hanna</surname>
          </string-name>
          <article-title>Ko¨pcke and Erhard Rahm. Frameworks for entity matching: A comparison</article-title>
          .
          <source>Data Knowl. Eng.</source>
          ,
          <volume>69</volume>
          (
          <issue>2</issue>
          ):
          <fpage>197</fpage>
          -
          <lpage>210</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Guoliang</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Dong</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jiannan</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jianhua</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>Pass-join: a partition-based method for similarity joins</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <fpage>253</fpage>
          -
          <lpage>264</lpage>
          ,
          <year>November 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Axel-Cyrille Ngonga Ngomo. Orchid -</surname>
          </string-name>
          reduction
          <article-title>-ratio-optimal computation of geo-spatial distances for link discovery</article-title>
          .
          <source>In International Semantic Web Conference (1)</source>
          , pages
          <fpage>395</fpage>
          -
          <lpage>410</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Axel-Cyrille Ngonga Ngomo</surname>
          </string-name>
          .
          <article-title>A Time-Efficient Hybrid Approach to Link Discovery</article-title>
          .
          <source>In OM</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Axel-Cyrille Ngonga Ngomo</surname>
          </string-name>
          .
          <article-title>Link discovery with guaranteed reduction ratio in affine spaces with minkowski measures</article-title>
          .
          <source>In International Semantic Web Conference (1)</source>
          , pages
          <fpage>378</fpage>
          -
          <lpage>393</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Axel-Cyrille Ngonga Ngomo</surname>
          </string-name>
          .
          <article-title>On link discovery using a hybrid approach</article-title>
          .
          <source>J. Data Semantics</source>
          ,
          <volume>1</volume>
          (
          <issue>4</issue>
          ):
          <fpage>203</fpage>
          -
          <lpage>217</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Axel-Cyrille Ngonga</surname>
          </string-name>
          Ngomo and
          <article-title>So¨ren Auer. LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data</article-title>
          .
          <source>In IJCAI</source>
          , pages
          <fpage>2312</fpage>
          -
          <lpage>2317</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Axel-Cyrille Ngonga</surname>
            <given-names>Ngomo</given-names>
          </string-name>
          , Jens Lehmann,
          <article-title>So¨ren Auer, and Konrad Ho¨ffner. Raven - active learning of link specifications</article-title>
          .
          <source>In OM</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Axel-Cyrille Ngonga</surname>
            Ngomo and
            <given-names>Klaus</given-names>
          </string-name>
          <string-name>
            <surname>Lyko</surname>
          </string-name>
          . Eagle:
          <article-title>Efficient active learning of link specifications using genetic programming</article-title>
          .
          <source>In ESWC</source>
          , pages
          <fpage>149</fpage>
          -
          <lpage>163</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Andriy</surname>
            <given-names>Nikolov</given-names>
          </string-name>
          , Mathieu d'Aquin,
          <string-name>
            <given-names>and Enrico</given-names>
            <surname>Motta</surname>
          </string-name>
          .
          <article-title>Unsupervised learning of link discovery configuration</article-title>
          .
          <source>In ESWC</source>
          , pages
          <fpage>119</fpage>
          -
          <lpage>133</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Andriy</surname>
            <given-names>Nikolov</given-names>
          </string-name>
          , Victoria Uren, Enrico Motta, and
          <string-name>
            <given-names>Anne</given-names>
            <surname>Roeck</surname>
          </string-name>
          .
          <article-title>Overcoming schema heterogeneity between linked semantic repositories to improve coreference resolution</article-title>
          .
          <source>In Proceedings of the 4th Asian Conference on The Semantic Web, ASWC '09</source>
          , pages
          <fpage>332</fpage>
          -
          <lpage>346</lpage>
          , Berlin, Heidelberg,
          <year>2009</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>George</surname>
            <given-names>Papadakis</given-names>
          </string-name>
          , Ekaterini Ioannou, Claudia Niedere´e, Themis Palpanas, and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Nejdl</surname>
          </string-name>
          .
          <article-title>Eliminating the redundancy in blocking-based entity resolution methods</article-title>
          .
          <source>In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11</source>
          , pages
          <fpage>85</fpage>
          -
          <lpage>94</lpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Safavian</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Landgrebe</surname>
          </string-name>
          .
          <article-title>A survey of decision tree classifier methodology</article-title>
          .
          <source>Systems, Man and Cybernetics</source>
          , IEEE Transactions on,
          <volume>21</volume>
          (
          <issue>3</issue>
          ):
          <fpage>660</fpage>
          -
          <lpage>674</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Francois</surname>
            <given-names>Scharffe</given-names>
          </string-name>
          , Yanbin Liu, and
          <string-name>
            <given-names>Chuguang</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Rdf-ai: an architecture for rdf datasets matching, fusion and interlink</article-title>
          .
          <source>In Proc. IJCAI 2009 workshop on Identity</source>
          , reference, and
          <article-title>knowledge representation (IR-KR)</article-title>
          ,
          <source>Pasadena (CA US)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <given-names>Dezhao</given-names>
            <surname>Song</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Heflin</surname>
          </string-name>
          .
          <article-title>Automatically generating data linkages using a domainindependent candidate selection approach</article-title>
          .
          <source>In International Semantic Web Conference (1)</source>
          , pages
          <fpage>649</fpage>
          -
          <lpage>664</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Jiannan</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Guoliang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Jianhua</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>Trie-join: Efficient trie-based string similarity joins with edit-distance constraints</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1219</fpage>
          -
          <lpage>1230</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>William</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Winkler</surname>
          </string-name>
          .
          <article-title>String comparator metrics and enhanced decision rules in the fellegisunter model of record linkage</article-title>
          .
          <source>In Proceedings of the Section on Survey Research</source>
          , pages
          <fpage>354</fpage>
          -
          <lpage>359</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Chuan</surname>
            <given-names>Xiao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Wei</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xuemin</given-names>
            <surname>Lin</surname>
          </string-name>
          . Ed-Join:
          <article-title>an efficient algorithm for similarity joins with edit distance constraints</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>933</fpage>
          -
          <lpage>944</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Chuan</surname>
            <given-names>Xiao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xuemin Lin</surname>
          </string-name>
          , and Jeffrey Xu Yu.
          <article-title>Efficient similarity joins for near duplicate detection</article-title>
          .
          <source>In WWW</source>
          , pages
          <fpage>131</fpage>
          -
          <lpage>140</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>