<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A High-Performance Approach to String Similarity using Most Frequent K Characters</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andre Valdestilhas</string-name>
          <email>valdestilhas@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Soru</string-name>
          <email>tsoru@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel-Cyrille Ngonga Ngomo</string-name>
          <email>ngonga@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AKSW/DICE, University of Leipzig</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The amount of available data has been growing significantly over the last decades. Thus, linking entries across heterogeneous data sources such as databases or knowledge bases, becomes an increasingly difficult problem, in particular w.r.t. the runtime of these tasks. Consequently, it is of utmost importance to provide time-efficient approaches for similarity joins in the Web of Data. While a number of scalable approaches have been developed for various measures, the Most Frequent k Characters (MFKC) measure has not been tackled in previous works. We hence present a sequence of filters that allow discarding comparisons when executing bounded similarity computations without losing recall. Therewith, we can reduce the runtime of bounded similarity computations by approximately 70%. Our experiments with a single-threaded, a parallel and a GPU implementation of our filters suggest that our approach scales well even when dealing with millions of potential comparisons.</p>
      </abstract>
      <kwd-group>
        <kwd>Similarity Search</kwd>
        <kwd>Blocking</kwd>
        <kwd>String Matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The problem of managing heterogeneity at both the semantic and syntactic levels
among various information resources [12,10] is one of the most difficult problems
on the information age. This is substantiated by most of the database research
self-assessment reports, which acknowledge that the hard question of semantic
heterogeneity, that is of handling variations in meaning or ambiguity in entity
interpretation, remains open [10]. In knowledge bases, Ontology Matching (OM)
solutions address the semantic heterogeneity problem in two steps: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) matching
entities to determine an alignment, i.e., a set of correspondences, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
interpreting an alignment according to application needs, such as data translation
or query answering. Record Linkage (RL) and, more recently, Link Discovery1
(LD) solutions on the other hand aim to determine pairs of entries that abide
by a given relation R. In both cases, string similarities are used to compute
1 The expression "link discovery" in this paper means the discovery of typed relations
that link instances from knowledge bases on the Web of Data. We never use it in
the sense of graph theory.
candidates for alignments. In addition to being central to RL and LD, these
similarities also play a key role in several other tasks such as data translation,
ontology merging and navigation on the Web of Data [10,2].
      </p>
      <p>One of the core tasks when developing time-efficient RL and LD solutions
hence lies in the development of time-efficient string similarities. In this paper, we
study the MFKC similarity function [8] and present an approach for improving
the performance of similarity joins. To this end, we develop a series of filters
which guarantee that particular pairs of resources do not abide by their respective
similarity threshold by virtue of their properties.</p>
      <p>
        The contributions of this paper are as follows:
1. We present two nested filters, (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) First Frequency Filter and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Hash
Intersection filter, that allow to discard candidates before calculating the actual
similarity value, thus giving a considerable performance gain.
2. We present the k similarity filter that allows detecting whether two strings
s and t are similar in a fewer number of steps.
3. We evaluate our approach with respect to its runtime and its scalability with
several threshold settings and dataset sizes.
4. We present several parallel implementations of our approach and show that
they work well on problems where jDs Dtj 105 pairs.
      </p>
      <p>The rest of the paper is structured as follows: In Section 2 related work is
presented, where we focus on approaches that aim to improve the time-efficiency
of the link discovery task. In Section 3, we present our nested filters, followed
by the Section 4 with the Correctness and Completeness. Section 5 with the
evaluation. In Section 6, we conclude.
2</p>
    </sec>
    <sec id="sec-2">
      <title>State of the Art and related work</title>
      <p>
        Our approach can be considered an extension of the state-of-the-art algorithm
introduced in [8], which describes a string-based distance function (SDF) based
on string hashing [9,7]. The naive approach of MFKC [8] is a metric for string
comparison built on a hash function, which gets a string and outputs the most
frequent two characters with their frequencies. This algorithm was used for text
mining operations. The approach can be divided into two parts: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) The hashing
function is applied to both input strings, where the output is a string that
contains the two most frequent characters; the first and third elements keep the
characters and second and fourth elements keep the frequency of these characters.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) The hashes are compared, where will return a real number between 0 and
lim. By default lim = 10, since the probability of having ten occurrences of the
two most frequent characters in common between two strings is low. If the output
of the function is 10, this case indicates that there is no common character and
any value below 10 means there are some common characters shared among the
strings.
      </p>
      <p>Our work is similar to the one presented in [13], which features a parallel
processing framework for string similarity using filters to avoid unnecessary
comparisons. Among the several types of string similarity, emerging works have been
done for measures such as Levenshtein-distance [6], which is a string distance
function that calculates the minimum number of edit operations (i.e., delete,
insert or update) to transform the first into the second string. The Jaccard
Index [5], also called Jaccard coefficient, works on the bitwise operators, where
the strings are treated at bit level. REEDED [11] was the first approach for the
time-efficient execution of weighted edit distances.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>
        Let us call N aiveM F KC the function which computes the MFKC algorithm
as described in [8]. Such function works with three parameters, i.e. two strings
s and t and an integer lim and returns the sum of frequencies, where f(ci; s)
is a function that returns the frequency of the character ci in the string s and
s fc1; :::; cng, i.e f(a; "andrea") = 2, because the character a has been found
twice and the hash functions h(s) and h(t) containing the characters and their
frequencies. The output of function is always positive, as shown in Equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
ci2h(s)\h(t)
      </p>
      <p>Our work aims to reduce the runtime of computation of the MFKC similarity
function. Here, we use a sequence of filters, which allow discarding similarity
computations and imply in a reduction of runtime. As input, the algorithm
receives datasets Ds and Dt, an integer number representing the k most frequent
characters and a threshold 2 [0; 1]. The similarity score of the pair of strings
from the Cartesian product from Ds and Dt must have a score greater or equal
the threshold to be considered a good pair, i.e. for a given threshold , if
the similarity function has a pair of strings with similarity score less than the
threshold, (s; t) &lt; , we can discard the computation of the MFKC score for
this pair. Our final result is a set which contains the pairs having similarity score
greater than or equal to the threshold, i.e. (s; t) .</p>
      <p>
        Our work studies the following problem: Given a threshold 2 [0; 1] and two
sets of strings Ds and Dt, compute the set M 0 = f(s; t; (s; t)) 2 Ds Dt R+ :
(s; t) g. Two categories of approaches can be considered to improve the
runtime of measures: Lossy approaches return a subset M 00 of M 0 which can
be calculated efficiently but for which there are no guarantees that M 00 = M 0.
Lossless approaches, on the other hand, ensure that their result set M 00 is exactly
the same as M 0. In this paper, we present a lossless approach that targets the
MFKC algorithm. Equation (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) shows our definition for the string similarity
function for the MFKC.
where s and t are strings, such that s; t 2 P , f(ci; s) is a function that returns
the frequency of the character ci in the string s, where s fc1; :::; cng, k
represents the limitation of the elements that belongs to the hashes; set h(s; k)\h(t; k)
means the intersection between the keys of hashes h(s; k) and h(t; k) (i.e., the
most frequent K characters). We expect two steps to obtain the similarity score:
1. Firstly, we transform the strings s and t in two hashes using Most Frequent
Character Hashing [8], according to the following example with k = 3:
s = aabbbcc ! h(s; k) = fb = 3; a = 2; c = 2g
t = bbccddee ! h(t; k) = fb = 2; c = 2; d = 2g
2. We calculate the sum of the character frequencies of matching characters
on the hashes h(s; k) and h(t; k), then, we normalize dividing by the sum of
the length of jsj and jtj resulting in a similarity score from 0 to 1 according
to the Equation (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) and the resulting score should be greater or equals the
threshold .
      </p>
      <p>
        (s; t; k; ) =
In this section, the runtime of MFKC defined in Equation (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) is improved using
filters where N is the output of first frequency filter, L is the output of hash
intersection filter and A represents the output of the k similarity filter.
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref14 ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref15 ref7">7</xref>
        )
First Frequency Filter As specified in the definition of MFKC [8] this filter
assumes that the hashes are already sorted in an descending way according to
the frequencies of characters, therefore the first element of each hash has the
highest frequency.
      </p>
      <p>Theorem 1. Showing that:</p>
      <p>
        Proof (Theorem 1). Let the intersection between hashes h(t; k) and h(s; k) be a
set of characters from c1 to cn, such that Equation (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ):
      </p>
      <p>
        h(t; k) \ h(s; k) = fc1; :::; cng
According to the definition of the frequencies f(ci; t) we have Equation (
        <xref ref-type="bibr" rid="ref14 ref6">6</xref>
        ):
where each ci appears f(ci; t) times, therefore:
t
      </p>
      <p>fc1; :::; c1; :::; cn; :::cng
f(c1; t) + ::: + f(cn; t)
jtj
Also, as n
k, because t
fc1; :::; cng, and f(ci; s)
h1(s; k) 8i=1;:::;n, then:
f(c1; s) + ::: + f(cn; s)
h1(s; k) + ::: + h1(s; k) = n(k)
k(h1(s; k))</p>
      <p>
        Therefore, from Equation (
        <xref ref-type="bibr" rid="ref15 ref7">7</xref>
        ) and Equation (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ), we obtain the Equation (
        <xref ref-type="bibr" rid="ref9">9</xref>
        ):
Hash Intersection Filter In this filter, we check if the intersection between
two hashes is an empty set, then the MFKC, represented by , will return a
similarity score of 0 and we can avoid the computation of similarity in this case.
Consequently, the rule which the filter relies on is the following.
      </p>
      <p>hs; ti 2 L ) hs; ti 2 Ds</p>
      <p>
        Dt ^ jh(s) \ h(t)j &gt; 0
we also can say that the Equation (
        <xref ref-type="bibr" rid="ref12">12</xref>
        ) represents a valid implication.
      </p>
      <p>h(s) \ h(t) = ; )</p>
      <p>
        The Equation (
        <xref ref-type="bibr" rid="ref12">12</xref>
        ) means that if the intersection between h(s; k) and h(t; k)
is a empty set, this implies that the similarity score will be 0. That means there
is no character matching, then there is no need to compute the similarity for
this pair of strings.
      </p>
      <p>K Similarity filter For all the pairs left, the similarity score among them is
calculated. After that, the third filter selects the pairs whose similarity score is
greater or equal than a threshold .</p>
      <p>hs; ti 2 A , hs; ti 2 N ^ (s; t)</p>
      <p>This filter provides a validation and we show that the score of previous k
similarity is always lower than the next k, according to the Equation (14) and in
some cases when the similarity score is been reached before compute all elements
2 h(s; k) \ h(t; k), thus saving computation in these cases.</p>
      <p>Here k is also used as a index of similarity function k(s; t) in order to get
the similarity of all cases of k, from 1 to k, also to show the monotonicity.</p>
      <p>Therefore we can say that the computation of similarity score occurs until
k(s; t) .</p>
      <p>We will demonstrate that the similarity score of previous k similarity is always
lower than the next k similarity, for all k 2 Z : k js \ tj.</p>
      <p>
        k+1(s; t)
k(s; t)
(14)
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        )
(
        <xref ref-type="bibr" rid="ref11">11</xref>
        )
(
        <xref ref-type="bibr" rid="ref12">12</xref>
        )
(
        <xref ref-type="bibr" rid="ref13">13</xref>
        )
We rewrote the equation for the first iteration, according to Equation (15)
k(s; t) =
      </p>
      <p>Pk
ci2h(s;k)\h(t;k);i=1 f(ci; s) + f(ci; t)</p>
      <p>Therefore, we can notice that the sum of frequencies will be always greater
or equal 0, according to f(ck+1; s) + f(ck+1; t) 0. Thus, Equation (14) holds
true.</p>
      <p>
        Filter sequence The sequence of the filters occurs basically in 4 steps, (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) We
starting to make the Cartesian product with the pairs of strings from the datasets
Ds and Dt, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Discarding pairs using the F irst F requency F ilter(N ), (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
Discarding pairs where there is no matching characters with the Hash Intersection f ilter(L)
and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) With the Most Frequent Character Filter (A) we will process only
similarities greater or equal the threshold , if the similarity function (s; t) in
the first k characters we can stop the computation of the similarity of this pair,
saving computation and add to our dataset with resulting pairs Dr, also shown
in the Algorithm 1.
      </p>
      <p>Algorithm 1 MFKC Similarity Joins
1: GoodP airs = fs1; t1; :::; sn; tngjhs; t 2 16: for all cfreq 2 hs \ ht do</p>
      <p>P i 17: if i k then
2: hs = fe1; e2; :::; engjei = hc; f (c; s)i 18: N ext t 2 Dt
3: ht = fe1; e2; :::; engjei = hc; f (c; t)i 19: end if
4: i; f req 2 N 20: f req = f req + cfreq
5: procedure MFKC(Ds; Dt; ; k) 21: i = jsfjr+ejqtj
6: for all s 2 Ds do (in parallel) 22: if i then
7: hs = h(s; k) 23: GoodP airs:add(s; t)
8: for all t 2 Dt do (in parallel) 24: N ext t 2 Dt
9: ht = h(t; k) 25: end if
1101:: if hNsj1se(jk+x)tj+tjtjtj2&lt;Dt then 2276:: endi =fori + 1
12: end if 28: end for
13: if jhs \ htj = 0 then 29: end for
14: N ext t 2 Dt 30: return GoodP airs
15: end if 31: end procedure</p>
    </sec>
    <sec id="sec-4">
      <title>Correctness and Completeness</title>
      <p>In this section, we prove formally that our MFKC is both correct and complete.</p>
      <p>Our MFKC consists of three nested filters, each of which creates a subset of
pairs, i.e. A L N Ds Dt. For the purpose of clearness, we name each
filtering rule:</p>
      <p>R1 ,
h1(s; k)k + jtj &lt;</p>
      <p>jsj + jtj
R2 , jh(s; k) \ h(t; k)j 6= 0</p>
      <p>R3 , (s; t)</p>
      <p>Each subset of our MFKC can be redefined as N = fhs; ti 2= Ds Dt : R1,
L = fhs; ti 2 Ds Dt : R1 ^ R2, and A = fhs; ti 2 Ds Dt : R1 ^ R2 ^ R3.
We then introduce A as the set of pairs whose similarity score is more or equal
than the threshold .</p>
      <p>A
= fhs; ti 2 Ds</p>
      <p>Dt : (s; t)
g = fhs; ti 2 Ds
Proof (Theorem 2). Proving Theorem 2 is equivalent to showing that A = A .
Let us consider all the pairs in A. While our MFKC’s correctness follows directly
from the definition of A, it is complete iff none of pairs discarded by the filters
actually belongs to A . Assuming that the hashes are sorted in a descending way
according the frequencies of the characters, therefore the first element of each
hash has the highest frequency. Therefore, once we have
h1(s; k)k + jtj &lt; ;</p>
      <p>jsj + jtj
the pair of strings s and t can be discarded without calculating the entire
similarity. When rule R3 applies, we have (s; t) &lt; , which leads to R3 ) R1. Thus,
set A can be rewritten as:
h(s; k) \ h(t; k) = ; )
When the rule R3 applies, we have (s; t) &lt; , which leads to R3 ) R2. Thus,
set A can be rewritten as:
Time complexity In order to calculate the time complexity of our MFKC,
firstly we considered the most frequent K characters from a string. The first step
is to sort the string lexically. Then, we can reach a linear complexity after this
sort, because the input with highest occurrences can be achieved with a linear
time complexity. The first string can be sorted in O(nlogn) and second string in
O(mlogm) times, as some classical sorting algorithms such as merge sort [3] and
quick sort [4] that work in O(nlogn) complexity. Thus, the total complexity is
O(nlogn) + O(mlogm), resulting in O(nlogn) as upper bound in the worst case.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>
        The aim of our evaluation is to show that our work outperforms the naive
approach and the parallel implementation has a performance gain in large datasets
with size greater than 105 pairs. A considerable number of pairs reach the
threshold before reaching the last k most frequent character, that also is a
demonstration about how much computation was avoided. Instead of all k’s we just need
k n where n is the nth most frequent character necessary to reach the
threshold . An example to show the efficiency of each filter can be found at Figures 1
and 1(a), where 10,273,950 comparisons from DBpedia+LinkedGeoData were
performed and Performance Gain (P G) = Recall(N ) + Recall(L). The recall2
can be seen in Figure 2(c). This evaluation has the intention to show results
of experiments on data from DBpedia3 and LinkedGeoData4. We considered
pairs of labels in order to do the evaluation. We have two motivations to chose
these datasets: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) they have been widely used in experiments pertaining to Link
Discovery (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the distributions of string sizes between these datasets are
significantly different [1]. All runtime and scalability experiments were performed on
a Intel Core i7 machine with 8GB RAM, a video card NVIDIA NVS4200 and
running Ms Windows 10.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Parallel implementation</title>
        <p>Our algorithm contains parallel code snippets with which we perform a load and
balance of the data among CPU/GPU cores when available. To illustrate this
2 Depicting DBPedia-Yago results. The YAGO was added to bring a reinforcement to
our evaluations, due to the fact of this dataset have been widely used in experiments
pertaining to Link Discovery. We also considered evaluations on recall, precision,
and f-score.
3 http://wiki.dbpedia.org/
4 http://linkedgeodata.org/
)
%
(
s
r
i
a
p
d
e
d
i
o
v
A
0:8
0:9</p>
        <p>1
part of our idea, we can state: Given a two datasets S; T , that contains all the
strings to be compared. Thus, make a Cartesian product of the strings S T ,
where each pair is the processed separately in threads that are spread among
CPU/GPU cores. Thus, we process the each comparison in parallel. The parallel
implementation works better in large datasets with size more than 105, that was
more than one time faster than the approach without parallelism and two times
faster than the naive approach as shown in Figures 3(a) to 3(c).
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Runtime Evaluation</title>
        <p>The evaluation in Figures 3(a) and 3(b) shows that all filter setup outperform
the naive approach, and the parallel approach does not suffer significant changes
related to the runtime according to the size of the dataset, as show Figure 3(c).
The experiments related to the variance of k, also were considered, as show in
Figure 3(d), the runtime varies according the size of k, indicating the influence
of k with values from 1 to 120 with 1; 001; 642 comparisons. The performance
(run-time) was improved as shown in Figures 3(a) and 3(b) and according the
recall with a performance gain of 26:07% as shown in Figure 1(a). The time
complexity is based on two sort process O(n log n) + O(m log m) resulting in
O(n log n) as a upper bound in the worst case.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Scalability Evaluation</title>
        <p>In the experiments (see Figures 3(c), 3(e) and 3(f)), we looked at the growth of
the runtime of our approach on datasets of growing sizes. The results show that
the combination of filters (N +L+A) is the best option for datasets of large sizes.
This result holds on both DBpedia and LinkedGeoData, so our approach can be
used on large datasets and achieves acceptable run-times. We also can realize
the quantity of avoided pairs in each combination of filters in Figure 1, that
consequently brings a performance gain. We looked at experiments with runtime
behavior on a large dataset with more than 106 labels as shown in Figure 3(b).
The results suggest that the runtime decreases according to the threshold
0:9
0:8
0:7
0:6
0:8
0:6
0:4
0:2
0
0:5
0
0:8
1
0:6
0:8
1
0:6
0:8
1
increment. Thus, one more point showing that our approach is useful on large
datasets, where can be used with high threshold values for link discovery area.
About our parallel implementation, Figure 3(c) shows that our GPU parallel
implementation works better on large datasets with size greater than 105.
5.4</p>
      </sec>
      <sec id="sec-5-4">
        <title>Comparison with existing approaches</title>
        <p>Our work overcomes the naive approach [8], thus, in order to show some
important points we compare our work not only with the state of the art, but
with popular algorithms such as Jaccard Index [5]. As shown in Figures 3(c),
3(e) and 3(f), our approach outperforms not only the naive approach, but also
Jaccard Index. We show that the threshold and k have a significant influence
related to the runtime. The naive approach present some points to consider,
among them, even if the naive approach states that they did experiments with
k = 7, the naive algorithm was designed for only k = 2, there are some cases
where k = 2 is not enough to get the similarity level expected, i.e. s = mystring1
and t = mystring2 limiting k = 2, we will have 2(s; t) = 0:2, showing that the
similarity is very low, but when k = 8, the similarity is 8(s; t) = 0:8 showing
that sometimes we can lose a good similarity case limiting k = 2. Our work fix all
these problems and also has a better runtime, as show Figure 3(a), Figure 3(b)
and Figure 3(c).</p>
        <p>An experiment with labels from DBpedia and Yago5 shows that the f-score
indicates a significant potential to be used with success as a string similarity
comparing with Jaccard Index and Jaro Winkler, as Figures 2(a) to 2(c) shows.</p>
        <p>To summarize the key features that makes our approach outperform the naive
approach are the following: We use more than two K most frequent characters
in our evaluation, our run-time for more than 107 comparisons is shorter (27,594
against 16,916) milliseconds, we do have a similarity threshold, allowing us to
5 http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/
research/yago-naga/yago/
0:8 0:9 1</p>
        <p>Similarity threshold
0:8 0:9 1</p>
        <p>Similarity threshold
(a) Runtime 1,001,642 comparisons.</p>
        <p>(b) Runtime 10,273,950 comparisons.
0</p>
        <p>0:5 1
# comparisons 107
0
50</p>
        <p>100
K
(c) The parallel approach improves (d) Runtime k most frequent characters,
the performance for more with values of k from 1 to 120,
than 105comparisons. over 1; 001; 642 comparisons, = 0:95.
s
d
on3;000
c
e
s
i
l
l
i 2;000
m
n
i
e 1;000
m
i
T
s
d
n
o
c
e
s
i
l
l
i
m
n
i
e
m
i
T
4
2
0</p>
        <p>104
s
d
n
co4;000
e
s
i
l
l
i
m
in2;000
e
m
i
T
2
1
0
4
2</p>
        <p>naive
N + L + A</p>
        <p>L + A
N + A</p>
        <p>A
parallel(N + L + A)</p>
        <p>Jaccard
naive
N + L + A</p>
        <p>L + A
N + A</p>
        <p>A
parallel(N + L + A)</p>
        <p>Jaccard
naive
N + L + A</p>
        <p>L + A
N + A</p>
        <p>A
parallel(N + L + A)</p>
        <p>Jaccard
2 4 6</p>
        <p>Number of CPUs
(e) CPU Speedup of algorithm
(1; 001; 642 comparisons).</p>
        <p>8
2 4 6 8</p>
        <p>Number of CPUs
(f) CPU Speedup of algorithm(10; 273; 950 comparisons).
discard comparisons avoiding extra processing, and a parallel implementation,
making our approach scalable. Jaccard does not show significant changes varying
the threshold, MFKC and Jaro Winkler present a very similar increase of the
f-score varying the threshold.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future work</title>
      <p>We presented an approach to reduce the computation runtime of similarity joins
using the Most Frequent k Characters algorithm with a sequence of filters that
allow discarding pairs before computing their actual similarity, thus reducing the
runtime of computation. We proved that our approach is both correct and
complete. The evaluation shows that all filter setup outperform the naive approach.
Our parallel implementation works better in larger datasets with size greater
than 105 pairs. It is also the key to developing systems for Record Linkage and
Link Discovery in knowledge bases. As future work, we plan to integrate it in
link discovery applications for the validation of equivalence links. The source
code is free and available online6. Acknowledgments available on footnotes7.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>K.</given-names>
            <surname>Dreßler</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          .
          <article-title>On the efficient execution of bounded jaro-winkler distances</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J.</given-names>
            <surname>Euzenat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shvaiko</surname>
          </string-name>
          , et al.
          <source>Ontology matching</source>
          , volume
          <volume>333</volume>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Goldstine</surname>
          </string-name>
          and J. von Neumann.
          <article-title>Planning and coding of problems for an electronic computing instrument</article-title>
          .
          <source>Institute for Advanced Study</source>
          ,
          <year>1948</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Hoare</surname>
          </string-name>
          . Quicksort. The
          <source>Computer Journal</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ):
          <fpage>10</fpage>
          -
          <lpage>16</lpage>
          ,
          <year>1962</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>P.</given-names>
            <surname>Jaccard</surname>
          </string-name>
          .
          <article-title>The distribution of the flora in the alpine zone</article-title>
          . New phytologist,
          <volume>11</volume>
          (
          <issue>2</issue>
          ):
          <fpage>37</fpage>
          -
          <lpage>50</lpage>
          ,
          <year>1912</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          .
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          .
          <source>In Soviet physics doklady</source>
          , volume
          <volume>10</volume>
          , pages
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>R.</given-names>
            <surname>Rivest</surname>
          </string-name>
          .
          <article-title>The MD5 message-digest algorithm</article-title>
          .
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Seker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Altun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Ayan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Mert</surname>
          </string-name>
          .
          <article-title>A novel string distance function based on most frequent k characters</article-title>
          .
          <source>IJMLC</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ):
          <fpage>177</fpage>
          -
          <lpage>182</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Seker</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Mert</surname>
          </string-name>
          .
          <article-title>A novel feature hashing for text mining</article-title>
          .
          <source>Journal of Technical Science And Technologies</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>37</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>P.</given-names>
            <surname>Shvaiko</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Euzenat</surname>
          </string-name>
          .
          <article-title>Ontology matching: State of the art and future challenges</article-title>
          .
          <source>TKDE</source>
          ,
          <volume>25</volume>
          (
          <issue>1</issue>
          ):
          <fpage>158</fpage>
          -
          <lpage>176</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>T.</given-names>
            <surname>Soru</surname>
          </string-name>
          and A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
          </string-name>
          .
          <article-title>Rapid execution of weighted edit distances</article-title>
          .
          <source>In Proceedings of the Ontology Matching Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>A.</given-names>
            <surname>Valdestilhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arndt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          .
          <article-title>DBpediaSameAs: An approach to tackle heterogeneity in dbpedia identifiers</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>C. Yan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , and
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Efficient string similarity join in multi-core and distributed systems</article-title>
          .
          <source>PLOS ONE</source>
          ,
          <volume>12</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          ,
          <year>03 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>6 https://github.com/firmao/StringSimilarity/</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>7 Acknowledgments to CNPq Brazil under grants No. 201536/2014-5 and H2020 projects SLIPO (GA no. 731581) and HOBBIT (GA no. 688227) as well as the DFG project LinkingLOD (project no</article-title>
          .
          <source>NG 105/3-2)</source>
          ,
          <article-title>the BMWI Projects SAKE (project no. 01MD15006E) and GEISER (project no</article-title>
          .
          <year>01MD16014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>