<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sorted Neighborhood for Schema-free RDF Data</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Texas at Austin</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Entity Resolution (ER) concerns identifying pairs of entities that refer to the same underlying entity. To avoid O(n2) pairwise comparison of n entities, blocking methods are used. Sorted Neighborhood is an established blocking method for Relational Databases. It has not been applied to schema-free Resource Description Framework (RDF) data sources widely prevalent in the Linked Data ecosystem. This paper presents a Sorted Neighborhood work ow that may be applied to schema-free RDF data. The work ow is modular and makes minimal assumptions about its inputs. Empirical evaluations of the proposed algorithm on ve real-world benchmarks demonstrate its utility compared to two state-of-the-art blocking baselines.</p>
      </abstract>
      <kwd-group>
        <kwd>Entity Resolution</kwd>
        <kwd>Sorted Neighborhood</kwd>
        <kwd>Schema-free RDF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Entity Resolution (ER) is the problem of identifying pairs of entities across
databases that refer to the same underlying entity. An example is illustrated in
Figure 1. The problem goes by many names in the data mining and knowledge
discovery communities, examples being deduplication [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], record linkage [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], link
discovery [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and coreference resolution [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], to name just a few. Instances of
the problem have been documented for structured [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], semistructured [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], as
well as unstructured data models [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Given n entities and a boolean link speci cation function that determines
whether two entities are equivalent, a nave ER application would run in time
O(n2). Scalability indicates a two-step approach [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The rst step is called
blocking. Blocking uses a many-many clustering function called a blocking key to
assign entities in near-linear time into one or more blocks, where each block
represents a cluster [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In the second similarity step, only entities sharing a block
are paired and evaluated by the (expensive) link speci cation function [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In the Relational Database (RDB) community, a state-of-the-art blocking
algorithm is Sorted Neighborhood (SN) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The classic SN version accepts a
rigidly structured tabular database as input, and sorts the table by using the
blocking key as a sorting key. A window of constant size is then slid over the
records, and records sharing a window are paired and become candidates for the
second ER step. In several studies, the RDB version of SN was veri ed to have
excellent theoretical run-time guarantees and good empirical performance [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To the best of our knowledge, an SN algorithm has never been proposed for
schema-free RDF data sources despite its success. Such data sources are common
on Linked Open Data (LOD), which has continued to grow since its inception in
2007 with just 12 datasets [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. According to a recently concluded study, LOD
currently contains over 1000 datasets and tens of billions of triples1.
      </p>
      <p>
        Typical blocking systems such as Silk and Limes in the Semantic Web
community assume that the link speci cation function is known a priori [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
Using the properties of the link speci cation function, these systems partition
the space of entities into blocks. In contrast, Sorted Neighborhood does not
assume any knowledge of the link speci cation function [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In practice, employing
an agnostic blocking method such as Sorted Neighborhood enables an ER
practitioner to decouple the two ER steps in terms of implementation and execution.
      </p>
      <p>This paper presents a Sorted Neighborhood work ow that processes
schemafree RDF data and accommodates a range of practical options (Section 3). The
described algorithm is evaluated on ve real-world test cases against two
established baselines (Section 4). Preliminary results show that the method is a
promising blocking procedure for schema-free RDF datasets.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Entity Resolution is an old problem [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and is methodologically surveyed by
Elmagarmid et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The blocking step is surveyed separately by Christen [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
who has also written a book synthesizing recent trends in ER research [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We
note that traditionally, blocking has not been given as much importance as the
second ER step, but this changed in the mid-90's, with large databases
increasingly accessible over the Web [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Hernandez and Stolfo published the Sorted
Neighborhood work [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and extended it to include parallel implementations [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Recent work has adapted the method to XML data, but under the assumption
that the XML data sources possess the same schema, possibly after a schema
1 http://linkeddata.org
matching step [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. To the best of our knowledge, the method has not been
adapted yet to schema-free semi-structured data.
      </p>
      <p>
        The advent of Linked Open Data has brought renewed focus on the
RDFbased version of Entity Resolution, which is commonly denoted as instance
matching or link discovery in the Linked Data community [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The fourth
Linked Data principle states that data must not exist in silos, but must be
interlinked to maximize value [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Linked Open Data (LOD) is currently known
to contain over eight million entities [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], many of which are believed to refer
to the same underlying entity. Silk and Limes are two popular ER frameworks
that have sought to address the issue of discovering these entity pairs [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Another example of a Semantic Web-based ER system is RDF-AI, which
offers customizable packages, similar to the proposed work ow [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The survey
by Schar e et al. provides more details on recently developed ER systems in
the Linked Data community [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. While these systems represent signi cant
advances, the blocking methods that they employ are typically not agnostic of
the link speci cation function. In contrast, the Sorted Neighborhood work ow
proposed in this paper is completely independent of the second ER step, and
operates on schema-free RDF data. Additionally, the proposed work ow o ers
options for both batch and online data access settings.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>The Work ow</title>
      <p>ated by an unknown link speci cation function. As earlier noted, an advantage
of Sorted Neighborhood is that it is agnostic of the link speci cation function.
3.1</p>
      <sec id="sec-3-1">
        <title>Blocking keys</title>
        <p>
          The quality of the work ow in Figure 2 depends directly on the quality of
blocking keys B1 and B2. There are several methods in the literature on both de ning
and learning appropriate blocking keys for schema-free data [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. One of the
earliest solutions for this problem was to use a simple token-based distance
measure (e.g. cosine similarity) to cluster entities, and to ignore all structural
information [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. This token-based algorithm, called Canopy Clustering [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], was
found to have good performance on many test cases, but has been outperformed
by more sophisticated learning algorithms in recent years [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          Two classes of viable learning algorithms (for blocking keys) have emerged as
state-of-the-art. The rst is the Attribute Clustering (AC) algorithm [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. AC is
an unsupervised procedure that groups properties (or attributes ) after
computing the overlap between the properties' object value-sets using a trigram-based
similarity score. Two entities share a block if they share tokens in any two
properties that were grouped together. The main advantage of AC is that it does not
require property alignments to be computed between the input RDF graphs, and
is simple and e ective to implement. Empirically, the learned blocking keys were
found to be competitive when evaluated by a variety of blocking methods [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
Note that the Sorted Neighborhood method was not proposed or considered in
that paper. In Section 4, we use a high-performing blocking method called block
purging from the original AC paper as a baseline for evaluating the proposed
Sorted Neighborhood work ow.
        </p>
        <p>
          The second class of learnable blocking keys, called the Disjunctive Normal
Form (DNF) blocking keys, are currently considered state-of-the-art in the
Relational Database community [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The learning procedure for these keys is
adaptive, in that training samples are used by a machine learning algorithm
to learn the key. In recent work, we extended DNF blocking keys to operate
on schema-free RDF data [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. A potential disadvantage of this method
is that tractably learning these keys requires some form of property alignment
between the schema-free datasets. Recent research has proposed some reliable
methods for automatic property alignment, making the class of DNF blocking
keys a promising unsupervised alternative to the AC class.
        </p>
        <p>
          In summary, a practitioner has several viable options for acquiring blocking
keys. In this paper, the only requirement that is imposed on the blocking key
is that it must not rely on the existence of a schema or type information. In
practice, this implies that the blocking key will be de ned either in terms of the
properties in the input RDF graphs or the subject URIs in the triples (or both).
An even simpler alternative is to use the Canopy Clustering algorithm and block
entities based on their degree of token overlap [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We consider this option as a
baseline in Section 4.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Projected Logical Property Table</title>
        <p>As mentioned in Section 3.1, the blocking keys B1;2 will be de ned2 in terms
of the properties in the RDF graphs G1;2, since schema information cannot be
assumed. Let the set of properties used in the construction of Bi be denoted as
the property set Pi of the blocking key Bi (i = 1; 2).</p>
        <p>Example 1. Consider Figure 1, and let the blocking keys B1 and B2 be given by
the respective expressions3 Tokens(subject) [ Tokens(d1:hasBrother) and
Tokens(subject) [ Tokens(d2:sibling), where subject indicates the subject URI of
the entity to which the blocking key is applied. The property sets P1 and P2 are
respectively fd1:hasBrotherg and fd2:siblingg. subject is not included in the
sets since it is not technically a property.</p>
        <p>
          Given B1 and B2, the property sets P1 and P2 are rst constructed. The
goal of extracting the property sets is to construct a projected logical property
table. A logical property table representation of an RDF graph Gi is de ned by
the schema fsubjectg [ Pi, where Pi is the set of all property URIs occurring
in Gi [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The subject column essentially serves as the key for the table. Two
special cases need to be borne in mind when populating a property table. First,
there may be subjects that do not have object values for a given property. For
example, in graph G1 in Figure 1, d1:Sam_Crax does not have an object value
for the property d1:hasBrother. For such cases, a reserved keyword (e.g. null ) is
inserted in the corresponding table cell4. The second special case is that an entity
may have mutliple object values for a given property. Again, an example can be
found in graph G1 in Figure 1, with d1:Joan_Crax_Bates having two object
values for the property d1:hasBrother. For such cases, a reserved delimiter
(e.g. ;) is used to collect multiple object values in a single table cell.
        </p>
        <p>A projected logical property table with respect to a property set P is a table
that contains only the subject column and the properties in P . In other words,
the columns P P are projected out from the schema of the logical property
table. As with the logical property table, the projected table also always has the
subject column as its key.</p>
        <p>
          The advantages of this tabular serialization (of an RDF graph) are
subsequently described. At present, we note that, if an RDF graph Gi is accessible
as a batch le (e.g. in N-triples format), devising a linear-time batch-processing
algorithm for scanning the triples and populating the table in two passes is
straightforward [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. In a rst pass, the triples would be scanned and the
property set Pi would be populated, along with an index with the subject URIs as
keys. In the second pass, the rows of the initialized table would be populated.
The index ensures that updates and inserts into the table are near-constant.
2 B1;2 is shorthand for the phrase `B1 and B2'; similarly for other quantities.
3 Tokens is a function that tokenizes a string into a set of words, based on a given set
of delimiters. The set of tokens generated by the blocking key is precisely the set of
blocking key values (BKVs) assigned to that entity.
4 Located in the row of subject d1:Sam_Crax and column d1:hasBrother.
        </p>
        <p>If the graph is accessible through a SPARQL endpoint, an e cient procedure
is less straightforward. A nave option would be to use a SELECT * style query
to download the entire graph as a batch dump and then run the linear-time
algorithm mentioned above. In an online setting, bandwidth is a precious resource.
If jP j &lt;&lt; jPj, obtaining a full batch dump is clearly not e cient.</p>
        <p>Instead, we deploy the following query on the SPARQL endpoint to obtain
a table of loosely structured tuples :
SELECT ?entity ?o1 : : : ?od
WHEREf
fSELECT DISTINCT ?entity WHERE ?entity ?p ?o.</p>
        <p>FILTER (?p = &lt; p1 &gt; jj : : : jj ?p = &lt; pd &gt;gg
OPTIONAL f?entity &lt; p1 &gt; ?o1g
: : :
OPTIONAL f?entity &lt; pn &gt; ?odg
g</p>
        <p>Here, &lt; pi &gt; (for i 2 f1 : : : dg) is a placeholder for a property URI in P . In the
nested query, the triples are ltered by the properties in P . As earlier mentioned,
an RDF entity is not guaranteed to have an object value for a property in P .
The Optional clause (for each property) provides a safeguard for this possibility.
Example 2. Consider again the rst graph in Figure 1. Assuming the blocking
key Tokens(subject) [ Tokens(d1:hasBrother) de ned in the example earlier, the
loosely structured tuples generated by executing the query are shown in Figure
3 (a). Figure 3 (b) shows the projected logical property table serialization of
graph G1 given the property set P1 = fd1 : hasBrotherg.</p>
        <p>These tuples are denoted as loosely-structured for two reasons. First, the
table is not guaranteed to contain a column that serves as a key. Secondly, there
may be empty table cells. In order to impose structure on this table, therefore,
the table is serialized into a projected logical property table.</p>
        <p>
          We note that an advantage of employing the property table, instead of
inventing a serialization, is that it already has a physical implementation in triple
stores such as Jena [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. The primary advantage of the physical table is that
it allows storing RDF les in back-end Relational Database infrastructure and
Algorithm 1 Modi ed Sorted Neighborhood
Input: Blocking keys B1;2, projected logical property tables T1;2, windowing
parameter w
Output: Candidate set of entity-entity pairs
1. Initialize to be an empty set
2. Apply B1 (B2) to each tuple in T1 (T2) to obtain multimaps 1 = f&lt; bkv; R &gt;g
( 2 = f&lt; bkv; S &gt;g), with bkv in each key-value set pair referring to a blocking
key value string and R (S), denoted as a block, being a set of subject URIs in
T1 (T2)
3. Join 1 and 2 on their keys to obtain joined map map = f&lt; bkv; &lt; R; S &gt;g,
where keySet [ map]= keySet [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] \ keySet [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and the larger of R and S in
each pair in map is truncated so that jRj = jSj
4. Sort map into a list L with columns BKV , subject1 and subject2, using the
keys in map (or the BKV column) as sorting keys
5. For each tuple &lt; bkv; s1; s2 &gt; in L, emit pairs &lt; s1; bkv &gt; and &lt; s2; bkv &gt; as
entity-BKV pairs (Figure 2)
6. Slide a window of size w over tuples in L (Figure 4)
7. Add entity-entity pair &lt; e1; e2 &gt; to , where e1 (e2) is a subject URI from
column subject1 (subject2) and e1 and e2 are in (not necessarily the same)
tuples that fall within a common window
8. return
signi cantly speeds up self-joins on properties for certain SPARQL queries. This
paper proposes yet another use-case for this table; namely, realizing the Sorted
Neighborhood algorithm for schema-free RDF data.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Candidate Set Generation</title>
        <p>
          Given two projected logical property tables T1;2 derived from graphs G1;2,
the blocking keys B1;2, and the windowing parameter w, Algorithm 1 performs
Sorted Neighborhood (SN) on T1;2. Note that the original Sorted Neighborhood
cannot be used, since it assumes either a single dataset (adhering to a single
schema) or two datasets that have been individually cleansed of duplicates and
then merged into a single dataset. Also, the original SN method assumes that
each entity has at most one blocking key value [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In the heterogeneous space of
RDF datasets, these assumptions are too restrictive [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Algorithm 1 performs
the SN procedure without making these assumptions.
        </p>
        <p>
          First, the algorithm applies the blocking keys to each tuple in the projected
property tables and obtains two multimaps 1;2 of blocks indexed by a
blocking key value (BKV), with each block essentially being a set of subject URIs.
Considering the blocking key B1 =Tokens(subject) [ Tokens(d1:hasBrother) in
Example 1, one example of a generated block (on the rst dataset in Figure
1) would be fd1 : M ike Bates; d1 : J oan Crax Batesg, indexed by the BKV
Bates, since the two entities share a common token in their subject URIs. Note
that an entity may have multiple BKVs and may occur in several blocks. Next,
the algorithm joins 1 and 2 into a map map by using the keysets (the BKVs)
as the join keys. Thus, blocks in 1 that do not have a corresponding5 block in
2 are discarded; similarly for blocks in 2. Also, for any two blocks R and S
(from 1 and 2 respectively), we truncate the larger block by randomly
removing elements till jRj = jSj. In line 4, Algorithm 1 sorts map into a three-column
sorted list by using the BKVs as sorting keys. This is similar to the sorting step
in the original SN algorithm [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Finally, a window of size w is slid over the tuples
in L and two entities (from columns subject1 and subject2 respectively) sharing
a window are added to (Figure 4).
        </p>
        <p>Algorithm 1 (and also, the last part of the work ow) allows the user to make
operational decisions on whether the candidate set or the BKVs of entities (or
both) should be published as sets of triples using two specially de ned
properties :hasCandidate and :IsBKVOf with equivalence class and inverse function
semantics respectively. These triples can be published by third-party sources on
dedicated servers and be made accessible as business services, similar to
Relational Database cloud-based data matching6 services.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Analysis</title>
      <sec id="sec-4-1">
        <title>Test Cases, Metrics and Setup</title>
        <p>The system is evaluated on ve publicly available benchmarks, four of which
(Persons 1, Persons 2, Film and Restaurants ) were published as part of the 2010
instance matching track of OAEI7, which is an annual Semantic Web initiative,
and one of which (IM-Similarity ) is a multilingual, crowdsourced dataset
describing books and was released as part of OAEI 2014. These datasets, summarized
in Table 1, span several domains and o er a variety of qualitative challenges.</p>
        <p>The blocking step8 is typically evaluated by computing e ciency and e
ectiveness metrics for the candidate set of entity-entity pairs generated by the
5 That is, indexed by the same blocking key value (BKV).
6 An example is the D&amp;B Business Insight service: https://datamarket.azure.com/
dataset/dnb/businessinsight.
7 Ontology Alignment Evaluation Initiative: http://oaei.ontologymatching.org/
2010/#instance
8 This includes both the blocking key learning and the subsequent blocking method
(such as the Sorted Neighborhood method presented in this paper).
blocking method. Let the number of matching (or ground-truth) entity pairs in
be denoted by the symbol m. Similarly, let denote the full set (the
Cartesian product) of entity pairs, and m denote the ground-truth set of matching
entity pairs. With this terminology, the e ciency metric, Reduction Ratio (RR),
and the e ectiveness metric, Pairs Completeness (PC), are de ned below:
RR =
1</p>
        <p>j j
j j
P C = j mj</p>
        <p>j mj
F -score =
2</p>
        <p>RR P C
(RR + P C)</p>
        <p>We capture the tradeo between RR and PC by computing their harmonic
mean or their F-score9:
Note that, for all three metrics (PC, RR and the F-score), higher values indicate
better performance.</p>
        <p>
          We discovered the blocking key for each dataset in the ve test cases by
employing the trigrams-based Attribute Clustering (AC) algorithm that was
described brie y in Section 3.1; complete details are provided in the original
paper by Papadakis et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. In general, the learning algorithm leads to a
blocking key that considers tokens of multiple properties (or `attributes') and
the name of the entity (its subject URI) as blocking key values. Brief examples
were earlier provided. Once learned, Algorithm 1 is executed with w varied from
2 to 50. For each value of w, PC, RR and F-score values are recorded, with
values corresponding to the highest-achieved F-score reported.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Baselines and Implementation</title>
        <p>
          Two baseline blocking methods are used to gauge the performance of Algorithm
1 on the ve test cases. The rst baseline is the block purging method that was
9 This is di erent from the F-score employed in Information Retrieval, where it is
typically the harmonic mean of precision and recall.
(1)
(2)
(3)
used with the AC blocking key in the original paper along with a variety of other
blocking methods, many of which made assumptions that are not applicable to
this work [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Block purging was the least restrictive in terms of assumptions,
simple to implement and execute (but with excellent empirical performance),
and similar to the proposed method, involved a single parameter maxP airs. For
these reasons, we employ it as a baseline to test the proposed system. Similar
to lines 1-3 of Algorithm 1, block purging rst generates a map of joined blocks
map (but without truncating larger blocks). Rather than employing a sliding
window, the method discards an entry &lt; bkv; &lt; R; S &gt;&gt; from map if jRjjSj &gt;
maxP airs. Among surviving entries, all entity-entity pairs in R S are added
to the candidate set .
        </p>
        <p>
          We also employed the classic Canopy Clustering (CC) blocking method as
a second baseline [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. CC is a generic clustering method that relies on an
inexpensive similarity function such as cosine similarity, and is known to perform
well on tabular datasets [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Given a single threshold parameter thresh = [0; 1],
the algorithm operates by locating for every tuple t in the rst dataset, all
tuples s in the second dataset such that the tuple pair &lt; t; s &gt; has similarity at
least thresh, and adds &lt; t; s &gt; to the candidate set . Note that CC does not
rely on a blocking key, but uses all tokens in the tuples to compute the cosine
similarity. We execute CC on the full10 logical property table representations
of the input RDF datasets. Similar to w in Algorithm 1 and maxP airs in the
block purging method, thresh is varied and only the highest-achieved F-score
values are reported. For a xed value of the thresholds, note that the algorithms
are deterministic and only need to be run once. Finally, we restricted parameter
tuning (per test case) to a maximum of fty iterations. In terms of run-time,
all algorithms (for a given setting of the parameters) ran within a minute per
dataset, making them roughly equal in that respect.
        </p>
        <p>Finally, all programs were implemented serially in Java on a 32-bit Ubuntu
virtual machine with 3385 MB of RAM and a 2.40 GHz Intel 4700MQ i7
processor. We have released datasets and ground-truth les on a project website11.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Results and Discussion</title>
        <p>
          Table 2 shows the highest F-scores obtained by the proposed method and the
block purging baseline, along with corresponding PC and RR values. The results
show that, on three of the ve test cases, the modi ed SN procedure outperforms
the block purging algorithm. On the RR metric, SN outperforms block purging
by over 6% and on the F-score metric by over 3.5%. On the PC metric, block
purging does better by about 1.7%. Overall, the results show that the modi ed
SN algorithm is a promising blocking candidate. We note that, on the RR metric
in particular, the SN algorithm is more stable, with block purging achieving less
than 75% on at least two datasets. As earlier authors have noted [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], even small
10 Since there is no blocking key (and no property set), projection does not take place.
11 https://sites.google.com/a/utexas.edu/mayank-kejriwal/projects/
sorted-neighborhood
variations in RR can have a large impact on the run-time of a full ER work ow,
since RR is quadratic in the number of input entities (Eq. 1). Thus, SN may be
a more reliable method than block purging under resource-constrained settings.
        </p>
        <p>Although not tabulated here due to space constraints, the average (on the ve
test cases) highest F-score and corresponding PC and RR results obtained by the
Canopy Clustering (CC) baseline are 84.74%, 80.39% and 96.58% respectively.
Both the block purging and proposed method outperform CC on both PC and
F-score by a large margin, demonstrating that on schema-free RDF data, simple
token-based blocking techniques are rarely su cient.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we proposed a Sorted Neighborhood work ow for schema-free
RDF data, implemented as a sequence of relatively independent modules. All
work ow steps can be operationalized using existing Semantic Web technology.
Evaluations on ve test cases show that the proposed algorithm is competitive
with established blocking techniques for schema-free RDF data.</p>
      <p>Future work will aim to deploy the system as a Linked Data service in a
distributed paradigm, and to evaluate it on large-scale datasets. Given certain
assumptions about the input datasets, a promising theoretical avenue is to
automatically determine the optimal value of w for the inputs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Bilenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kamath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          .
          <article-title>Adaptive blocking: Learning to scale up record linkage</article-title>
          .
          <source>In Data Mining</source>
          ,
          <year>2006</year>
          . ICDM'
          <fpage>06</fpage>
          . Sixth International Conference on, pages
          <volume>87</volume>
          {
          <fpage>96</fpage>
          . IEEE,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked data-the story so far</article-title>
          .
          <source>International journal on semantic web and information systems</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>22</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          .
          <article-title>Further topics and research directions</article-title>
          .
          <source>In Data Matching</source>
          , pages
          <volume>209</volume>
          {
          <fpage>228</fpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          .
          <article-title>A survey of indexing techniques for scalable record linkage and deduplication. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on,
          <volume>24</volume>
          (
          <issue>9</issue>
          ):
          <volume>1537</volume>
          {
          <fpage>1555</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>A. K. Elmagarmid</surname>
            ,
            <given-names>P. G.</given-names>
          </string-name>
          <string-name>
            <surname>Ipeirotis</surname>
            , and
            <given-names>V. S.</given-names>
          </string-name>
          <string-name>
            <surname>Verykios</surname>
          </string-name>
          .
          <article-title>Duplicate record detection: A survey. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>16</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Stolfo</surname>
          </string-name>
          .
          <article-title>The merge/purge problem for large databases</article-title>
          .
          <source>In ACM SIGMOD Record</source>
          , volume
          <volume>24</volume>
          , pages
          <fpage>127</fpage>
          {
          <fpage>138</fpage>
          . ACM,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Stolfo</surname>
          </string-name>
          .
          <article-title>Real-world data is dirty: Data cleansing and the merge/purge problem</article-title>
          .
          <source>Data mining and knowledge discovery</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):9{
          <fpage>37</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>E cient multidimensional blocking for link discovery without losing recall</article-title>
          .
          <source>In WebDB</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Miranker</surname>
          </string-name>
          .
          <article-title>An unsupervised algorithm for learning blocking schemes</article-title>
          .
          <source>In Data Mining</source>
          ,
          <year>2013</year>
          . ICDM'
          <fpage>13</fpage>
          . Thirteenth International Conference on. IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Miranker</surname>
          </string-name>
          .
          <article-title>A two-step blocking scheme learner for scalable link discovery</article-title>
          .
          <source>In Ontology Matching Workshop. ISWC'14. Thirteenth International Semantic Web Conference</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Miranker</surname>
          </string-name>
          .
          <article-title>A dnf blocking scheme learner for heterogeneous datasets</article-title>
          .
          <source>arXiv preprint arXiv:1501.01694</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>A. McCallum</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Nigam</surname>
            , and
            <given-names>L. H.</given-names>
          </string-name>
          <string-name>
            <surname>Ungar</surname>
          </string-name>
          .
          <article-title>E cient clustering of high-dimensional data sets with application to reference matching</article-title>
          .
          <source>In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>169</volume>
          {
          <fpage>178</fpage>
          . ACM,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>J. F. McCarthy</surname>
            and
            <given-names>W. G.</given-names>
          </string-name>
          <string-name>
            <surname>Lehnert</surname>
          </string-name>
          .
          <article-title>Using decision trees for coreference resolution</article-title>
          .
          <source>arXiv preprint cmp-lg/9505043</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. H.
          <string-name>
            <surname>Newcombe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kennedy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Axford</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>James</surname>
          </string-name>
          .
          <article-title>Automatic linkage of vital records</article-title>
          .
          <year>1959</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>A</surname>
            .-
            <given-names>C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Ngomo</surname>
          </string-name>
          .
          <article-title>A time-e cient hybrid approach to link discovery</article-title>
          .
          <source>Ontology Matching, page 1</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. G. Papadakis,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ioannou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Niederee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Fankhauser</surname>
          </string-name>
          .
          <article-title>E cient entity resolution for large heterogeneous information spaces</article-title>
          .
          <source>In Proceedings of the fourth ACM international conference on Web search and data mining</source>
          , pages
          <volume>535</volume>
          {
          <fpage>544</fpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>S.</given-names>
            <surname>Puhlmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Xml duplicate detection using sorted neighborhoods</article-title>
          .
          <source>In Advances in Database Technology-EDBT</source>
          <year>2006</year>
          , pages
          <fpage>773</fpage>
          {
          <fpage>791</fpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. F. Schar e,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          , et al.
          <article-title>Data linking for the semantic web</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <volume>46</volume>
          {
          <fpage>76</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19. F. Schar e, Y. Liu, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Rdf-ai: an architecture for rdf datasets matching, fusion and interlink</article-title>
          .
          <source>In Proc. IJCAI 2009 workshop on Identity</source>
          , reference, and
          <article-title>knowledge representation (IR-KR)</article-title>
          ,
          <source>Pasadena (CA US)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>M. Schmachtenberg</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bizer</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Adoption of the linked data best practices in di erent topical domains</article-title>
          .
          <source>In The Semantic Web{ISWC</source>
          <year>2014</year>
          , pages
          <fpage>245</fpage>
          {
          <fpage>260</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>K.</given-names>
            <surname>Wilkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sayers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Kuno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          , et al.
          <article-title>E cient rdf storage and retrieval in jena2</article-title>
          .
          <source>In SWDB</source>
          , volume
          <volume>3</volume>
          , pages
          <fpage>131</fpage>
          {
          <fpage>150</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>