<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A High-Performance Approach to String Similarity using Most Frequent K Characters</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Andre</forename><surname>Valdestilhas</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">AKSW/DICE</orgName>
								<orgName type="institution">University of Leipzig</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tommaso</forename><surname>Soru</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">AKSW/DICE</orgName>
								<orgName type="institution">University of Leipzig</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Axel-Cyrille</forename><surname>Ngonga</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">AKSW/DICE</orgName>
								<orgName type="institution">University of Leipzig</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A High-Performance Approach to String Similarity using Most Frequent K Characters</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">6F46599C05C32CF9042007EF26A66812</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T09:46+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Similarity Search</term>
					<term>Blocking</term>
					<term>String Matching</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The amount of available data has been growing significantly over the last decades. Thus, linking entries across heterogeneous data sources such as databases or knowledge bases, becomes an increasingly difficult problem, in particular w.r.t. the runtime of these tasks. Consequently, it is of utmost importance to provide time-efficient approaches for similarity joins in the Web of Data. While a number of scalable approaches have been developed for various measures, the Most Frequent k Characters (MFKC) measure has not been tackled in previous works. We hence present a sequence of filters that allow discarding comparisons when executing bounded similarity computations without losing recall. Therewith, we can reduce the runtime of bounded similarity computations by approximately 70%. Our experiments with a single-threaded, a parallel and a GPU implementation of our filters suggest that our approach scales well even when dealing with millions of potential comparisons.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The problem of managing heterogeneity at both the semantic and syntactic levels among various information resources <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b9">10]</ref> is one of the most difficult problems on the information age. This is substantiated by most of the database research self-assessment reports, which acknowledge that the hard question of semantic heterogeneity, that is of handling variations in meaning or ambiguity in entity interpretation, remains open <ref type="bibr" target="#b9">[10]</ref>. In knowledge bases, Ontology Matching (OM) solutions address the semantic heterogeneity problem in two steps: <ref type="bibr" target="#b0">(1)</ref> matching entities to determine an alignment, i.e., a set of correspondences, and (2) interpreting an alignment according to application needs, such as data translation or query answering. Record Linkage (RL) and, more recently, Link Discovery 1 (LD) solutions on the other hand aim to determine pairs of entries that abide by a given relation R. In both cases, string similarities are used to compute The rest of the paper is structured as follows: In Section 2 related work is presented, where we focus on approaches that aim to improve the time-efficiency of the link discovery task. In Section 3, we present our nested filters, followed by the Section 4 with the Correctness and Completeness. Section 5 with the evaluation. In Section 6, we conclude.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">State of the Art and related work</head><p>Our approach can be considered an extension of the state-of-the-art algorithm introduced in <ref type="bibr" target="#b7">[8]</ref>, which describes a string-based distance function (SDF) based on string hashing <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b6">7]</ref>. The naive approach of MFKC <ref type="bibr" target="#b7">[8]</ref> is a metric for string comparison built on a hash function, which gets a string and outputs the most frequent two characters with their frequencies. This algorithm was used for text mining operations. The approach can be divided into two parts: <ref type="bibr" target="#b0">(1)</ref> The hashing function is applied to both input strings, where the output is a string that contains the two most frequent characters; the first and third elements keep the characters and second and fourth elements keep the frequency of these characters.</p><p>(2) The hashes are compared, where will return a real number between 0 and lim. By default lim = 10, since the probability of having ten occurrences of the two most frequent characters in common between two strings is low. If the output of the function is 10, this case indicates that there is no common character and any value below 10 means there are some common characters shared among the strings.</p><p>Our work is similar to the one presented in <ref type="bibr" target="#b12">[13]</ref>, which features a parallel processing framework for string similarity using filters to avoid unnecessary comparisons. Among the several types of string similarity, emerging works have been done for measures such as Levenshtein-distance <ref type="bibr" target="#b5">[6]</ref>, which is a string distance function that calculates the minimum number of edit operations (i.e., delete, insert or update) to transform the first into the second string. The Jaccard Index <ref type="bibr" target="#b4">[5]</ref>, also called Jaccard coefficient, works on the bitwise operators, where the strings are treated at bit level. REEDED <ref type="bibr" target="#b10">[11]</ref> was the first approach for the time-efficient execution of weighted edit distances.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Approach</head><p>Let us call N aiveM F KC the function which computes the MFKC algorithm as described in <ref type="bibr" target="#b7">[8]</ref>. Such function works with three parameters, i.e. two strings s and t and an integer lim and returns the sum of frequencies, where f(c i , s) is a function that returns the frequency of the character c i in the string s and s ⊇ {c 1 , ..., c n }, i.e f(a, "andrea") = 2, because the character a has been found twice and the hash functions h(s) and h(t) containing the characters and their frequencies. The output of function is always positive, as shown in Equation <ref type="bibr" target="#b0">(1)</ref>.</p><formula xml:id="formula_0">N aiveM F KC(s, t, lim) = lim − 2 ci∈h(s)∩h(t) f (c i , s) + f (c i , t)<label>(1)</label></formula><p>Our work aims to reduce the runtime of computation of the MFKC similarity function. Here, we use a sequence of filters, which allow discarding similarity computations and imply in a reduction of runtime. As input, the algorithm receives datasets D s and D t , an integer number representing the k most frequent characters and a threshold θ ∈ [0, 1]. The similarity score of the pair of strings from the Cartesian product from D s and D t must have a score greater or equal the threshold θ to be considered a good pair, i.e. for a given threshold θ, if the similarity function has a pair of strings with similarity score less than the threshold, σ(s, t) &lt; θ, we can discard the computation of the MFKC score for this pair. Our final result is a set which contains the pairs having similarity score greater than or equal to the threshold, i.e. σ(s, t) ≥ θ.</p><p>Our work studies the following problem: Given a threshold θ ∈ [0, 1] and two sets of strings D s and D t , compute the set M = {(s, t, σ(s, t)) ∈ D s × D t × R + : σ(s, t) ≥ θ}. Two categories of approaches can be considered to improve the runtime of measures: Lossy approaches return a subset M of M which can be calculated efficiently but for which there are no guarantees that M = M . Lossless approaches, on the other hand, ensure that their result set M is exactly the same as M . In this paper, we present a lossless approach that targets the MFKC algorithm. Equation <ref type="bibr" target="#b1">(2)</ref> shows our definition for the string similarity function σ for the MFKC.</p><formula xml:id="formula_1">σ(s, t) = ci∈h(s,k)∩h(t,k) f(c i , s) + f(c i , t) |s| + |t|<label>(2)</label></formula><p>where s and t are strings, such that s, t ∈ * , f(c i , s) is a function that returns the frequency of the character c i in the string s, where s ⊇ {c 1 , ..., c n }, k represents the limitation of the elements that belongs to the hashes; set h(s, k)∩h(t, k) means the intersection between the keys of hashes h(s, k) and h(t, k) (i.e., the most frequent K characters). We expect two steps to obtain the similarity score:</p><p>1. Firstly, we transform the strings s and t in two hashes using Most Frequent</p><p>Character Hashing <ref type="bibr" target="#b7">[8]</ref>, according to the following example with k = 3:</p><formula xml:id="formula_2">s = aabbbcc → h(s, k) = {b = 3, a = 2, c = 2} t = bbccddee → h(t, k) = {b = 2, c = 2, d = 2} 2.</formula><p>We calculate the sum of the character frequencies of matching characters on the hashes h(s, k) and h(t, k), then, we normalize dividing by the sum of the length of |s| and |t| resulting in a similarity score from 0 to 1 according to the Equation ( <ref type="formula" target="#formula_3">3</ref>) and the resulting score should be greater or equals the threshold θ.</p><formula xml:id="formula_3">σ(s, t, k, θ) = ci∈h(s,k)∩h(t,k) f(c i , s) + f(c i , t) |s| + |t| ≥ θ<label>(3)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Improving the Runtime</head><p>In this section, the runtime of MFKC defined in Equation ( <ref type="formula" target="#formula_1">2</ref>) is improved using filters where N is the output of first frequency filter, L is the output of hash intersection filter and A represents the output of the k similarity filter.</p><p>First Frequency Filter As specified in the definition of MFKC <ref type="bibr" target="#b7">[8]</ref> this filter assumes that the hashes are already sorted in an descending way according to the frequencies of characters, therefore the first element of each hash has the highest frequency.</p><p>Theorem 1. Showing that:</p><formula xml:id="formula_4">σ(s, t) = ci∈h(s,k)∩h(t,k) f(c i , s) + f(c i , t) |s| + |t| ≤ h 1 (s, k)k + |t| |s| + |t|<label>(4)</label></formula><p>implies that σ(s, t) &lt; θ.</p><p>Proof (Theorem 1). Let the intersection between hashes h(t, k) and h(s, k) be a set of characters from c 1 to c n , such that Equation (5):</p><formula xml:id="formula_5">h(t, k) ∩ h(s, k) = {c 1 , ..., c n }<label>(5)</label></formula><p>According to the definition of the frequencies f(c i , t) we have Equation ( <ref type="formula" target="#formula_6">6</ref>):</p><formula xml:id="formula_6">t ⊇ {c 1 , ..., c 1 , ..., c n , ...c n }<label>(6)</label></formula><p>where each c i appears f(c i , t) times, therefore:</p><formula xml:id="formula_7">f(c 1 , t) + ... + f(c n , t) ≤ |t|<label>(7)</label></formula><p>Also, as n ≤ k, because t ⊇ {c 1 , ..., c n }, and f(c i , s) ≤ h 1 (s, k) ∀ i=1,...,n , then:</p><formula xml:id="formula_8">f(c 1 , s) + ... + f(c n , s) ≤ h 1 (s, k) + ... + h 1 (s, k) = n(k) ≤ k(h 1 (s, k)) (8)</formula><p>Therefore, from Equation <ref type="bibr" target="#b6">(7)</ref> and Equation ( <ref type="formula">8</ref>), we obtain the Equation ( <ref type="formula" target="#formula_9">9</ref>):</p><formula xml:id="formula_9">ci∈h(s,k)∩h(t,k) f(c i , s) + f(c i , t) |s| + |t| = n i=1 f(c i , t) + n i=1 f(c i , s) |s| + |t| ≤ h 1 (s, k)k + |t| |s| + |t|<label>(9)</label></formula><p>Consequently, the rule which the filter relies on is the following.</p><formula xml:id="formula_10">s, t / ∈ N ⇒ s, t / ∈ D s × D t ∧ h 1 (s, k)k + |t| |s| + |t| ≤ θ<label>(10)</label></formula><p>Hash Intersection Filter In this filter, we check if the intersection between two hashes is an empty set, then the MFKC, represented by σ, will return a similarity score of 0 and we can avoid the computation of similarity in this case. Consequently, the rule which the filter relies on is the following.</p><formula xml:id="formula_11">s, t ∈ L ⇒ s, t ∈ D s × D t ∧ |h(s) ∩ h(t)| &gt; 0<label>(11)</label></formula><p>we also can say that the Equation ( <ref type="formula" target="#formula_12">12</ref>) represents a valid implication.</p><formula xml:id="formula_12">h(s) ∩ h(t) = ∅ ⇒ σ(s, t) = 0<label>(12)</label></formula><p>The Equation <ref type="bibr" target="#b11">(12)</ref> means that if the intersection between h(s, k) and h(t, k) is a empty set, this implies that the similarity score will be 0. That means there is no character matching, then there is no need to compute the similarity for this pair of strings.</p><p>K Similarity filter For all the pairs left, the similarity score among them is calculated. After that, the third filter selects the pairs whose similarity score is greater or equal than a threshold θ.</p><formula xml:id="formula_13">s, t ∈ A ⇔ s, t ∈ N ∧ σ(s, t) ≥ θ<label>(13)</label></formula><p>This filter provides a validation and we show that the score of previous k similarity is always lower than the next k, according to the Equation ( <ref type="formula" target="#formula_14">14</ref>) and in some cases when the similarity score is been reached before compute all elements ∈ h(s, k) ∩ h(t, k), thus saving computation in these cases.</p><p>Here k is also used as a index of similarity function σ k (s, t) in order to get the similarity of all cases of k, from 1 to k, also to show the monotonicity.</p><p>Therefore we can say that the computation of similarity score occurs until σ k (s, t) ≥ θ.</p><p>We will demonstrate that the similarity score of previous k similarity is always lower than the next k similarity, for all k ∈ Z * : k ≤ |s ∩ t|.</p><formula xml:id="formula_14">σ k+1 (s, t) ≥ σ k (s, t)<label>(14)</label></formula><p>We rewrote the equation for the first iteration, according to Equation (15)</p><formula xml:id="formula_15">σ k (s, t) = k ci∈h(s,k)∩h(t,k),i=1 f(c i , s) + f(c i , t) |s| + |t|<label>(15)</label></formula><p>Let k ∈ Z + be given and suppose the equation Equation ( <ref type="formula" target="#formula_14">14</ref>) is true for k +1.</p><formula xml:id="formula_16">k ci∈h(s,k)∩h(t,k) f(c i , s) + f(c i , t) + f(c k+1 , s) + f(c k+1 , t) ≥ k ci∈h(s,k)∩h(t,k) f(c i , s) + f(c i , t)<label>(16)</label></formula><p>Therefore, we can notice that the sum of frequencies will be always greater or equal 0, according to f(c k+1 , s) + f(c k+1 , t) ≥ 0. Thus, Equation (14) holds true.</p><p>Filter sequence The sequence of the filters occurs basically in 4 steps, (1) We starting to make the Cartesian product with the pairs of strings from the datasets D s and D t , (2) Discarding pairs using the F irst F requency F ilter(N ), (3) Discarding pairs where there is no matching characters with the Hash Intersection f ilter(L) and ( <ref type="formula" target="#formula_4">4</ref>) With the Most Frequent Character Filter (A) we will process only similarities greater or equal the threshold θ, if the similarity function σ(s, t) ≥ θ in the first k characters we can stop the computation of the similarity of this pair, saving computation and add to our dataset with resulting pairs Dr, also shown in the Algorithm 1. for all s ∈ Ds do (in parallel) In this section, we prove formally that our MFKC is both correct and complete.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 MFKC Similarity Joins</head><p>-We say that an approach is correct if the output O it returns is such that O ⊆ R(D s , D t , σ, θ). -Approaches are said to be complete if their output O is a superset of R(D s , D t , σ, θ), i.e., O ⊇ R(D s , D t , σ, θ).</p><p>Our MFKC consists of three nested filters, each of which creates a subset of pairs, i.e. A ⊆ L ⊆ N ⊆ D s × D t . For the purpose of clearness, we name each filtering rule:</p><formula xml:id="formula_17">R 1 h 1 (s, k)k + |t| |s| + |t| &lt; θ R 2 |h(s, k) ∩ h(t, k)| = 0 R 3 σ(s, t) ≥ θ</formula><p>Each subset of our MFKC can be redefined as</p><formula xml:id="formula_18">N = { s, t / ∈ D s × D t : R 1 , L = { s, t ∈ D s × D t : R 1 ∧ R 2 , and A = { s, t ∈ D s × D t : R 1 ∧ R 2 ∧ R 3 .</formula><p>We then introduce A * as the set of pairs whose similarity score is more or equal than the threshold θ.</p><formula xml:id="formula_19">A * = { s, t ∈ D s × D t : σ(s, t) ≥ θ} = { s, t ∈ D s × D t : R 3 }<label>(17)</label></formula><p>Theorem 2. Our MFKC filtering algorithm is correct and complete.</p><p>Proof (Theorem 2). Proving Theorem 2 is equivalent to showing that A = A * . Let us consider all the pairs in A. While our MFKC's correctness follows directly from the definition of A, it is complete iff none of pairs discarded by the filters actually belongs to A * . Assuming that the hashes are sorted in a descending way according the frequencies of the characters, therefore the first element of each hash has the highest frequency. Therefore, once we have</p><formula xml:id="formula_20">h 1 (s, k)k + |t| |s| + |t| &lt; θ,</formula><p>the pair of strings s and t can be discarded without calculating the entire similarity. When rule R 3 applies, we have σ(s, t) &lt; θ, which leads to R 3 ⇒ R 1 . Thus, set A can be rewritten as:</p><formula xml:id="formula_21">A = { s, t ∈ D s × D t : R 2 ∧ R 3 } (18)</formula><p>We are given two strings s and t and the respective hashes h(s, k) and h(t, k), the intersection between the characters of these two hashes is a empty set. Therefore, there is no character matching, which implies that s and t cannot be considered to have a similarity score greater than or equal to threshold θ:</p><formula xml:id="formula_22">h(s, k) ∩ h(t, k) = ∅ ⇒ σ(s, t) = 0</formula><p>When the rule R 3 applies, we have σ(s, t) &lt; θ, which leads to R 3 ⇒ R 2 . Thus, set A can be rewritten as:</p><formula xml:id="formula_23">A = { s, t ∈ D s × D t : R 3 } (19)</formula><p>which is the definition of A * in Equation ( <ref type="formula" target="#formula_19">17</ref>). Therefore, A = A * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Time complexity</head><p>In order to calculate the time complexity of our MFKC, firstly we considered the most frequent K characters from a string. The first step is to sort the string lexically. Then, we can reach a linear complexity after this sort, because the input with highest occurrences can be achieved with a linear time complexity. The first string can be sorted in O(nlogn) and second string in O(mlogm) times, as some classical sorting algorithms such as merge sort <ref type="bibr" target="#b2">[3]</ref> and quick sort <ref type="bibr" target="#b3">[4]</ref> that work in O(nlogn) complexity. Thus, the total complexity is O(nlogn) + O(mlogm), resulting in O(nlogn) as upper bound in the worst case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Evaluation</head><p>The aim of our evaluation is to show that our work outperforms the naive approach and the parallel implementation has a performance gain in large datasets with size greater than 10 5 pairs. A considerable number of pairs reach the threshold θ before reaching the last k most frequent character, that also is a demonstration about how much computation was avoided. Instead of all k's we just need k − n where n is the n th most frequent character necessary to reach the threshold θ. An example to show the efficiency of each filter can be found at Figures <ref type="figure" target="#fig_2">1  and 1</ref>(a), where 10,273,950 comparisons from DBpedia+LinkedGeoData were performed and Performance Gain (P G) = Recall(N ) + Recall(L). The recall<ref type="foot" target="#foot_1">2</ref> can be seen in Figure <ref type="figure" target="#fig_4">2(c)</ref>. This evaluation has the intention to show results of experiments on data from DBpedia<ref type="foot" target="#foot_2">3</ref> and LinkedGeoData<ref type="foot" target="#foot_3">4</ref> . We considered pairs of labels in order to do the evaluation. We have two motivations to chose these datasets: (1) they have been widely used in experiments pertaining to Link Discovery (2) the distributions of string sizes between these datasets are significantly different <ref type="bibr" target="#b0">[1]</ref>. All runtime and scalability experiments were performed on a Intel Core i7 machine with 8GB RAM, a video card NVIDIA NVS4200 and running Ms Windows 10.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Parallel implementation</head><p>Our algorithm contains parallel code snippets with which we perform a load and balance of the data among CPU/GPU cores when available. To illustrate this  part of our idea, we can state: Given a two datasets S, T , that contains all the strings to be compared. Thus, make a Cartesian product of the strings S × T , where each pair is the processed separately in threads that are spread among CPU/GPU cores. Thus, we process the each comparison in parallel. The parallel implementation works better in large datasets with size more than 10 5 , that was more than one time faster than the approach without parallelism and two times faster than the naive approach as shown in Figures <ref type="figure" target="#fig_8">3(a</ref>) to 3(c).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Runtime Evaluation</head><p>The evaluation in Figures <ref type="figure" target="#fig_8">3(a</ref>) and 3(b) shows that all filter setup outperform the naive approach, and the parallel approach does not suffer significant changes related to the runtime according to the size of the dataset, as show Figure <ref type="figure" target="#fig_8">3(c)</ref>.</p><p>The experiments related to the variance of k, also were considered, as show in Figure <ref type="figure" target="#fig_8">3</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Scalability Evaluation</head><p>In the experiments (see Figures <ref type="figure" target="#fig_8">3(c</ref>), 3(e) and 3(f)), we looked at the growth of the runtime of our approach on datasets of growing sizes. The results show that the combination of filters (N +L+A) is the best option for datasets of large sizes. This result holds on both DBpedia and LinkedGeoData, so our approach can be used on large datasets and achieves acceptable run-times. We also can realize the quantity of avoided pairs in each combination of filters in Figure <ref type="figure" target="#fig_0">1</ref>, that consequently brings a performance gain. We looked at experiments with runtime behavior on a large dataset with more than 10 6 labels as shown in Figure <ref type="figure" target="#fig_8">3</ref> increment. Thus, one more point showing that our approach is useful on large datasets, where can be used with high threshold values for link discovery area. About our parallel implementation, Figure <ref type="figure" target="#fig_8">3</ref>(c) shows that our GPU parallel implementation works better on large datasets with size greater than 10<ref type="foot" target="#foot_4">5</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Comparison with existing approaches</head><p>Our work overcomes the naive approach <ref type="bibr" target="#b7">[8]</ref>, thus, in order to show some important points we compare our work not only with the state of the art, but with popular algorithms such as Jaccard Index <ref type="bibr" target="#b4">[5]</ref>. As shown in Figures <ref type="figure" target="#fig_8">3(c</ref>), 3(e) and 3(f), our approach outperforms not only the naive approach, but also Jaccard Index. We show that the threshold θ and k have a significant influence related to the runtime. The naive approach present some points to consider, among them, even if the naive approach states that they did experiments with k = 7, the naive algorithm was designed for only k = 2, there are some cases where k = 2 is not enough to get the similarity level expected, i.e. s = mystring1 and t = mystring2 limiting k = 2, we will have σ 2 (s, t) = 0.2, showing that the similarity is very low, but when k = 8, the similarity is σ 8 (s, t) = 0.8 showing that sometimes we can lose a good similarity case limiting k = 2. Our work fix all these problems and also has a better runtime, as show Figure <ref type="figure" target="#fig_8">3</ref>(a), Figure <ref type="figure" target="#fig_8">3</ref>(b) and Figure <ref type="figure" target="#fig_8">3(c</ref>). An experiment with labels from DBpedia and Yago 5 shows that the f-score indicates a significant potential to be used with success as a string similarity comparing with Jaccard Index and Jaro Winkler, as Figures <ref type="figure" target="#fig_4">2(a</ref>) to 2(c) shows.</p><p>To summarize the key features that makes our approach outperform the naive approach are the following: We use more than two K most frequent characters in our evaluation, our run-time for more than 10    discard comparisons avoiding extra processing, and a parallel implementation, making our approach scalable. Jaccard does not show significant changes varying the threshold, MFKC and Jaro Winkler present a very similar increase of the f-score varying the threshold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and Future work</head><p>We presented an approach to reduce the computation runtime of similarity joins using the Most Frequent k Characters algorithm with a sequence of filters that allow discarding pairs before computing their actual similarity, thus reducing the runtime of computation. We proved that our approach is both correct and complete. The evaluation shows that all filter setup outperform the naive approach. Our parallel implementation works better in larger datasets with size greater than 10 5 pairs. It is also the key to developing systems for Record Linkage and Link Discovery in knowledge bases. As future work, we plan to integrate it in link discovery applications for the validation of equivalence links. The source code is free and available online <ref type="foot" target="#foot_5">6</ref> . Acknowledgments available on footnotes<ref type="foot" target="#foot_6">7</ref> .</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>1 :</head><label>1</label><figDesc>GoodP airs = {s1; t1, ..., sn; tn}| s, t ∈ * 2: hs = {e1, e2, ..., en}|ei = c, f (c, s) 3: ht = {e1, e2, ..., en}|ei = c, f (c, t) 4: i, f req ∈ N * 5: procedure MFKC(Ds, Dt, θ, k) 6:</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Avoided pairs and recall.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>(d), the runtime varies according the size of k, indicating the influence of k with values from 1 to 120 with 1, 001, 642 comparisons. The performance (run-time) was improved as shown in Figures3(a) and 3(b) and according the recall with a performance gain of 26.07% as shown in Figure 1(a). The time complexity is based on two sort process O(n log n) + O(m log m) resulting in O(n log n) as a upper bound in the worst case.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Precision, Recall and F-Measure.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>7</head><label>7</label><figDesc>comparisons is shorter (27,594 against 16,916) milliseconds, we do have a similarity threshold, allowing us to + L + A) Jaccard (d) Runtime k most frequent characters, with values of k from 1 to 120, over 1, 001, 642 comparisons, θ = 0.95.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head></head><label></label><figDesc>CPU Speedup of algorithm (1, 001, 642 comparisons).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head></head><label></label><figDesc>+ L + A) Jaccard (f) CPU Speedup of algorithm(10, 273, 950 comparisons).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Run-time experiments results.</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">The expression "link discovery" in this paper means the discovery of typed relations that link instances from knowledge bases on the Web of Data. We never use it in the sense of graph theory.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">Depicting DBPedia-Yago results. The YAGO was added to bring a reinforcement to our evaluations, due to the fact of this dataset have been widely used in experiments pertaining to Link Discovery. We also considered evaluations on recall, precision, and f-score.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://wiki.dbpedia.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">http://linkedgeodata.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/ research/yago-naga/yago/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">https://github.com/firmao/StringSimilarity/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">Acknowledgments to CNPq Brazil under grants No. 201536/2014-5 and H2020 projects SLIPO (GA no. 731581) and HOBBIT (GA no. 688227) as well as the DFG project LinkingLOD (project no. NG 105/3-2), the BMWI Projects SAKE (project no. 01MD15006E) and GEISER (project no. 01MD16014).</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">On the efficient execution of bounded jaro-winkler distances</title>
		<author>
			<persName><forename type="first">K</forename><surname>Dreßler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-C</forename><forename type="middle">N</forename><surname>Ngomo</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Ontology matching</title>
		<author>
			<persName><forename type="first">J</forename><surname>Euzenat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shvaiko</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
			<publisher>Springer</publisher>
			<biblScope unit="volume">333</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Planning and coding of problems for an electronic computing instrument</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">H</forename><surname>Goldstine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Neumann</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1948">1948</date>
		</imprint>
		<respStmt>
			<orgName>Institute for Advanced Study</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Quicksort</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Hoare</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Computer Journal</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="10" to="16" />
			<date type="published" when="1962">1962</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The distribution of the flora in the alpine zone</title>
		<author>
			<persName><forename type="first">P</forename><surname>Jaccard</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">New phytologist</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="37" to="50" />
			<date type="published" when="1912">1912</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Binary codes capable of correcting deletions, insertions, and reversals</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">I</forename><surname>Levenshtein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Soviet physics doklady</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="707" to="710" />
			<date type="published" when="1966">1966</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">The MD5 message-digest algorithm</title>
		<author>
			<persName><forename type="first">R</forename><surname>Rivest</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1992">1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A novel string distance function based on most frequent k characters</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Seker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Altun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Ayan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IJMLC</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="177" to="182" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A novel feature hashing for text mining</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Seker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Technical Science And Technologies</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="37" to="40" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Ontology matching: State of the art and future challenges</title>
		<author>
			<persName><forename type="first">P</forename><surname>Shvaiko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Euzenat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">TKDE</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="158" to="176" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Rapid execution of weighted edit distances</title>
		<author>
			<persName><forename type="first">T</forename><surname>Soru</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-C. Ngonga</forename><surname>Ngomo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Ontology Matching Workshop</title>
				<meeting>the Ontology Matching Workshop</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">DBpediaSameAs: An approach to tackle heterogeneity in dbpedia identifiers</title>
		<author>
			<persName><forename type="first">A</forename><surname>Valdestilhas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Arndt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kontokostas</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Efficient string similarity join in multi-core and distributed systems</title>
		<author>
			<persName><forename type="first">C</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PLOS ONE</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="1" to="16" />
			<date type="published" when="2017">03 2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
