=Paper= {{Paper |id=Vol-1365/paper1 |storemode=property |title=Sorted Neighborhood for Schema-free RDF Data |pdfUrl=https://ceur-ws.org/Vol-1365/paper1.pdf |volume=Vol-1365 }} ==Sorted Neighborhood for Schema-free RDF Data== https://ceur-ws.org/Vol-1365/paper1.pdf
Sorted Neighborhood for Schema-free RDF Data

                    Mayank Kejriwal and Daniel P. Miranker

                           University of Texas at Austin
                       {kejriwal,miranker}@cs.utexas.edu



       Abstract. Entity Resolution (ER) concerns identifying pairs of enti-
       ties that refer to the same underlying entity. To avoid O(n2 ) pairwise
       comparison of n entities, blocking methods are used. Sorted Neighbor-
       hood is an established blocking method for Relational Databases. It has
       not been applied to schema-free Resource Description Framework (RDF)
       data sources widely prevalent in the Linked Data ecosystem. This pa-
       per presents a Sorted Neighborhood workflow that may be applied to
       schema-free RDF data. The workflow is modular and makes minimal
       assumptions about its inputs. Empirical evaluations of the proposed al-
       gorithm on five real-world benchmarks demonstrate its utility compared
       to two state-of-the-art blocking baselines.

       Keywords: Entity Resolution, Sorted Neighborhood, Schema-free RDF


1    Introduction

Entity Resolution (ER) is the problem of identifying pairs of entities across
databases that refer to the same underlying entity. An example is illustrated in
Figure 1. The problem goes by many names in the data mining and knowledge
discovery communities, examples being deduplication [5], record linkage [4], link
discovery [15] and coreference resolution [13], to name just a few. Instances of
the problem have been documented for structured [5], semistructured [17], as
well as unstructured data models [13].
    Given n entities and a boolean link specification function that determines
whether two entities are equivalent, a naı̈ve ER application would run in time
O(n2 ). Scalability indicates a two-step approach [5]. The first step is called block-
ing. Blocking uses a many-many clustering function called a blocking key to
assign entities in near-linear time into one or more blocks, where each block rep-
resents a cluster [4]. In the second similarity step, only entities sharing a block
are paired and evaluated by the (expensive) link specification function [15], [8].
    In the Relational Database (RDB) community, a state-of-the-art blocking
algorithm is Sorted Neighborhood (SN) [6]. The classic SN version accepts a
rigidly structured tabular database as input, and sorts the table by using the
blocking key as a sorting key. A window of constant size is then slid over the
records, and records sharing a window are paired and become candidates for the
second ER step. In several studies, the RDB version of SN was verified to have
excellent theoretical run-time guarantees and good empirical performance [6],
2       Mayank Kejriwal and Daniel P. Miranker




Fig. 1. An instance of the Entity Resolution (ER) problem on two RDF graphs. Various
entities between the two graphs need to be interlinked using owl:sameAs edges


[4]. To the best of our knowledge, an SN algorithm has never been proposed for
schema-free RDF data sources despite its success. Such data sources are common
on Linked Open Data (LOD), which has continued to grow since its inception in
2007 with just 12 datasets [20]. According to a recently concluded study, LOD
currently contains over 1000 datasets and tens of billions of triples1 .
     Typical blocking systems such as Silk and Limes in the Semantic Web com-
munity assume that the link specification function is known a priori [8], [15].
Using the properties of the link specification function, these systems partition
the space of entities into blocks. In contrast, Sorted Neighborhood does not as-
sume any knowledge of the link specification function [6]. In practice, employing
an agnostic blocking method such as Sorted Neighborhood enables an ER prac-
titioner to decouple the two ER steps in terms of implementation and execution.
     This paper presents a Sorted Neighborhood workflow that processes schema-
free RDF data and accommodates a range of practical options (Section 3). The
described algorithm is evaluated on five real-world test cases against two es-
tablished baselines (Section 4). Preliminary results show that the method is a
promising blocking procedure for schema-free RDF datasets.


2     Related Work

Entity Resolution is an old problem [14], and is methodologically surveyed by
Elmagarmid et al. [5]. The blocking step is surveyed separately by Christen [4],
who has also written a book synthesizing recent trends in ER research [3]. We
note that traditionally, blocking has not been given as much importance as the
second ER step, but this changed in the mid-90’s, with large databases increas-
ingly accessible over the Web [4]. Hernàndez and Stolfo published the Sorted
Neighborhood work [6], and extended it to include parallel implementations [7].
Recent work has adapted the method to XML data, but under the assumption
that the XML data sources possess the same schema, possibly after a schema
1
    http://linkeddata.org
                           Sorted Neighborhood for Schema-free RDF Data         3




    Fig. 2. A modular Sorted Neighborhood workflow for schema-free RDF graphs


matching step [17]. To the best of our knowledge, the method has not been
adapted yet to schema-free semi-structured data.
    The advent of Linked Open Data has brought renewed focus on the RDF-
based version of Entity Resolution, which is commonly denoted as instance
matching or link discovery in the Linked Data community [18], [15]. The fourth
Linked Data principle states that data must not exist in silos, but must be in-
terlinked to maximize value [2]. Linked Open Data (LOD) is currently known
to contain over eight million entities [20], many of which are believed to refer
to the same underlying entity. Silk and Limes are two popular ER frameworks
that have sought to address the issue of discovering these entity pairs [15], [8].
Another example of a Semantic Web-based ER system is RDF-AI, which of-
fers customizable packages, similar to the proposed workflow [19]. The survey
by Scharffe et al. provides more details on recently developed ER systems in
the Linked Data community [18]. While these systems represent significant ad-
vances, the blocking methods that they employ are typically not agnostic of
the link specification function. In contrast, the Sorted Neighborhood workflow
proposed in this paper is completely independent of the second ER step, and
operates on schema-free RDF data. Additionally, the proposed workflow offers
options for both batch and online data access settings.


3    The Workflow

Figure 2 illustrates a Sorted Neighborhood workflow for schema-free RDF data.
At a high level, the workflow accepts two inputs, namely the pair of RDF graphs
G1 and G2 containing entities that need to be interlinked, as well as correspond-
ing blocking keys B1 and B2 . The final output is a set of entity-entity pairs
(denoted as the candidate set), which is piped to the second ER step and evalu-
4      Mayank Kejriwal and Daniel P. Miranker

ated by an unknown link specification function. As earlier noted, an advantage
of Sorted Neighborhood is that it is agnostic of the link specification function.


3.1   Blocking keys

The quality of the workflow in Figure 2 depends directly on the quality of block-
ing keys B1 and B2 . There are several methods in the literature on both defining
and learning appropriate blocking keys for schema-free data [9], [16]. One of the
earliest solutions for this problem was to use a simple token-based distance
measure (e.g. cosine similarity) to cluster entities, and to ignore all structural
information [12]. This token-based algorithm, called Canopy Clustering [12], was
found to have good performance on many test cases, but has been outperformed
by more sophisticated learning algorithms in recent years [1], [9].
     Two classes of viable learning algorithms (for blocking keys) have emerged as
state-of-the-art. The first is the Attribute Clustering (AC) algorithm [16]. AC is
an unsupervised procedure that groups properties (or attributes) after comput-
ing the overlap between the properties’ object value-sets using a trigram-based
similarity score. Two entities share a block if they share tokens in any two prop-
erties that were grouped together. The main advantage of AC is that it does not
require property alignments to be computed between the input RDF graphs, and
is simple and effective to implement. Empirically, the learned blocking keys were
found to be competitive when evaluated by a variety of blocking methods [16].
Note that the Sorted Neighborhood method was not proposed or considered in
that paper. In Section 4, we use a high-performing blocking method called block
purging from the original AC paper as a baseline for evaluating the proposed
Sorted Neighborhood workflow.
     The second class of learnable blocking keys, called the Disjunctive Normal
Form (DNF) blocking keys, are currently considered state-of-the-art in the Re-
lational Database community [1], [9]. The learning procedure for these keys is
adaptive, in that training samples are used by a machine learning algorithm
to learn the key. In recent work, we extended DNF blocking keys to operate
on schema-free RDF data [11], [10]. A potential disadvantage of this method
is that tractably learning these keys requires some form of property alignment
between the schema-free datasets. Recent research has proposed some reliable
methods for automatic property alignment, making the class of DNF blocking
keys a promising unsupervised alternative to the AC class.
     In summary, a practitioner has several viable options for acquiring blocking
keys. In this paper, the only requirement that is imposed on the blocking key
is that it must not rely on the existence of a schema or type information. In
practice, this implies that the blocking key will be defined either in terms of the
properties in the input RDF graphs or the subject URIs in the triples (or both).
An even simpler alternative is to use the Canopy Clustering algorithm and block
entities based on their degree of token overlap [12]. We consider this option as a
baseline in Section 4.
                             Sorted Neighborhood for Schema-free RDF Data             5

3.2   Projected Logical Property Table

As mentioned in Section 3.1, the blocking keys B1,2 will be defined2 in terms
of the properties in the RDF graphs G1,2 , since schema information cannot be
assumed. Let the set of properties used in the construction of Bi be denoted as
the property set Pi of the blocking key Bi (i = 1, 2).

Example 1. Consider Figure 1, and let the blocking keys B1 and B2 be given by
the respective expressions3 Tokens(subject) ∪ Tokens(d1:hasBrother) and To-
kens(subject) ∪ Tokens(d2:sibling), where subject indicates the subject URI of
the entity to which the blocking key is applied. The property sets P1 and P2 are
respectively {d1:hasBrother} and {d2:sibling}. subject is not included in the
sets since it is not technically a property.

    Given B1 and B2 , the property sets P1 and P2 are first constructed. The
goal of extracting the property sets is to construct a projected logical property
table. A logical property table representation of an RDF graph Gi is defined by
the schema {subject} ∪ Pi , where Pi is the set of all property URIs occurring
in Gi [11]. The subject column essentially serves as the key for the table. Two
special cases need to be borne in mind when populating a property table. First,
there may be subjects that do not have object values for a given property. For
example, in graph G1 in Figure 1, d1:Sam_Crax does not have an object value
for the property d1:hasBrother. For such cases, a reserved keyword (e.g. null ) is
inserted in the corresponding table cell4 . The second special case is that an entity
may have mutliple object values for a given property. Again, an example can be
found in graph G1 in Figure 1, with d1:Joan_Crax_Bates having two object
values for the property d1:hasBrother. For such cases, a reserved delimiter
(e.g. ;) is used to collect multiple object values in a single table cell.
    A projected logical property table with respect to a property set P is a table
that contains only the subject column and the properties in P . In other words,
the columns P − P are projected out from the schema of the logical property
table. As with the logical property table, the projected table also always has the
subject column as its key.
    The advantages of this tabular serialization (of an RDF graph) are subse-
quently described. At present, we note that, if an RDF graph Gi is accessible
as a batch file (e.g. in N-triples format), devising a linear-time batch-processing
algorithm for scanning the triples and populating the table in two passes is
straightforward [11]. In a first pass, the triples would be scanned and the prop-
erty set Pi would be populated, along with an index with the subject URIs as
keys. In the second pass, the rows of the initialized table would be populated.
The index ensures that updates and inserts into the table are near-constant.
2
  B1,2 is shorthand for the phrase ‘B1 and B2 ’; similarly for other quantities.
3
  Tokens is a function that tokenizes a string into a set of words, based on a given set
  of delimiters. The set of tokens generated by the blocking key is precisely the set of
  blocking key values (BKVs) assigned to that entity.
4
  Located in the row of subject d1:Sam_Crax and column d1:hasBrother.
6       Mayank Kejriwal and Daniel P. Miranker




Fig. 3. (a) is the table of loosely structured tuples obtained when the given SPARQL
query is executed on graph G1 in Figure 1, assuming property set P1 = {d1 :
hasBrother}. (b) is the projected logical property table serialization of G1 w.r.t P1


     If the graph is accessible through a SPARQL endpoint, an efficient procedure
is less straightforward. A naı̈ve option would be to use a SELECT * style query
to download the entire graph as a batch dump and then run the linear-time algo-
rithm mentioned above. In an online setting, bandwidth is a precious resource.
If |P | << |P|, obtaining a full batch dump is clearly not efficient.
     Instead, we deploy the following query on the SPARQL endpoint to obtain
a table of loosely structured tuples:
SELECT ?entity ?o1 . . . ?od
WHERE{
{SELECT DISTINCT ?entity WHERE ?entity ?p ?o.
FILTER (?p = < p1 > || . . . || ?p = < pd >}}
OPTIONAL {?entity < p1 > ?o1 }
...
OPTIONAL {?entity < pn > ?od }
}
     Here, < pi > (for i ∈ {1 . . . d}) is a placeholder for a property URI in P . In the
nested query, the triples are filtered by the properties in P . As earlier mentioned,
an RDF entity is not guaranteed to have an object value for a property in P .
The Optional clause (for each property) provides a safeguard for this possibility.
Example 2. Consider again the first graph in Figure 1. Assuming the blocking
key Tokens(subject) ∪ Tokens(d1:hasBrother) defined in the example earlier, the
loosely structured tuples generated by executing the query are shown in Figure
3 (a). Figure 3 (b) shows the projected logical property table serialization of
graph G1 given the property set P1 = {d1 : hasBrother}.
    These tuples are denoted as loosely-structured for two reasons. First, the
table is not guaranteed to contain a column that serves as a key. Secondly, there
may be empty table cells. In order to impose structure on this table, therefore,
the table is serialized into a projected logical property table.
    We note that an advantage of employing the property table, instead of in-
venting a serialization, is that it already has a physical implementation in triple
stores such as Jena [21]. The primary advantage of the physical table is that
it allows storing RDF files in back-end Relational Database infrastructure and
                            Sorted Neighborhood for Schema-free RDF Data             7

Algorithm 1 Modified Sorted Neighborhood
Input: Blocking keys B1,2 , projected logical property tables T1,2 , windowing param-
  eter w
Output: Candidate set Γ of entity-entity pairs

   1. Initialize Γ to be an empty set
   2. Apply B1 (B2 ) to each tuple in T1 (T2 ) to obtain multimaps Π1 = {< bkv, R >}
      (Π2 = {< bkv, S >}), with bkv in each key-value set pair referring to a blocking
      key value string and R (S), denoted as a block, being a set of subject URIs in
      T1 (T2 )
   3. Join Π1 and Π2 on their keys to obtain joined map Πmap = {< bkv, < R, S >},
      where keySet[Πmap ]= keySet[Π1 ] ∩ keySet[Π2 ], and the larger of R and S in
      each pair in Πmap is truncated so that |R| = |S|
   4. Sort Πmap into a list L with columns BKV , subject1 and subject2 , using the
      keys in Πmap (or the BKV column) as sorting keys
   5. For each tuple < bkv, s1 , s2 > in L, emit pairs < s1 , bkv > and < s2 , bkv > as
      entity-BKV pairs (Figure 2)
   6. Slide a window of size w over tuples in L (Figure 4)
   7. Add entity-entity pair < e1 , e2 > to Γ , where e1 (e2 ) is a subject URI from
      column subject1 (subject2 ) and e1 and e2 are in (not necessarily the same)
      tuples that fall within a common window
   8. return Γ



significantly speeds up self-joins on properties for certain SPARQL queries. This
paper proposes yet another use-case for this table; namely, realizing the Sorted
Neighborhood algorithm for schema-free RDF data.


3.3   Candidate Set Generation

   Given two projected logical property tables T1,2 derived from graphs G1,2 ,
the blocking keys B1,2 , and the windowing parameter w, Algorithm 1 performs
Sorted Neighborhood (SN) on T1,2 . Note that the original Sorted Neighborhood
cannot be used, since it assumes either a single dataset (adhering to a single
schema) or two datasets that have been individually cleansed of duplicates and
then merged into a single dataset. Also, the original SN method assumes that
each entity has at most one blocking key value [6]. In the heterogeneous space of
RDF datasets, these assumptions are too restrictive [16]. Algorithm 1 performs
the SN procedure without making these assumptions.
   First, the algorithm applies the blocking keys to each tuple in the projected
property tables and obtains two multimaps Π1,2 of blocks indexed by a block-
ing key value (BKV), with each block essentially being a set of subject URIs.
Considering the blocking key B1 =Tokens(subject) ∪ Tokens(d1:hasBrother) in
Example 1, one example of a generated block (on the first dataset in Figure
1) would be {d1 : M ike Bates, d1 : Joan Crax Bates}, indexed by the BKV
Bates, since the two entities share a common token in their subject URIs. Note
that an entity may have multiple BKVs and may occur in several blocks. Next,
8       Mayank Kejriwal and Daniel P. Miranker




Fig. 4. The window sliding procedure on the sorted three-column list L, with w = 2


the algorithm joins Π1 and Π2 into a map Πmap by using the keysets (the BKVs)
as the join keys. Thus, blocks in Π1 that do not have a corresponding5 block in
Π2 are discarded; similarly for blocks in Π2 . Also, for any two blocks R and S
(from Π1 and Π2 respectively), we truncate the larger block by randomly remov-
ing elements till |R| = |S|. In line 4, Algorithm 1 sorts Πmap into a three-column
sorted list by using the BKVs as sorting keys. This is similar to the sorting step
in the original SN algorithm [6]. Finally, a window of size w is slid over the tuples
in L and two entities (from columns subject1 and subject2 respectively) sharing
a window are added to Γ (Figure 4).
    Algorithm 1 (and also, the last part of the workflow) allows the user to make
operational decisions on whether the candidate set or the BKVs of entities (or
both) should be published as sets of triples using two specially defined proper-
ties :hasCandidate and :IsBKVOf with equivalence class and inverse function
semantics respectively. These triples can be published by third-party sources on
dedicated servers and be made accessible as business services, similar to Rela-
tional Database cloud-based data matching6 services.


4     Experimental Analysis

4.1   Test Cases, Metrics and Setup

The system is evaluated on five publicly available benchmarks, four of which
(Persons 1, Persons 2, Film and Restaurants) were published as part of the 2010
instance matching track of OAEI7 , which is an annual Semantic Web initiative,
and one of which (IM-Similarity) is a multilingual, crowdsourced dataset describ-
ing books and was released as part of OAEI 2014. These datasets, summarized
in Table 1, span several domains and offer a variety of qualitative challenges.
    The blocking step8 is typically evaluated by computing efficiency and effec-
tiveness metrics for the candidate set Γ of entity-entity pairs generated by the
5
  That is, indexed by the same blocking key value (BKV).
6
  An example is the D&B Business Insight service: https://datamarket.azure.com/
  dataset/dnb/businessinsight.
7
  Ontology Alignment Evaluation Initiative: http://oaei.ontologymatching.org/
  2010/#instance
8
  This includes both the blocking key learning and the subsequent blocking method
  (such as the Sorted Neighborhood method presented in this paper).
                             Sorted Neighborhood for Schema-free RDF Data          9

Table 1. Details of evaluation benchmarks. Each benchmark is a pair of RDF files
with the notation (where applicable) being (first dataset) /× (second dataset)


ID Name         Triples           Entity pairs                Matching entities
1 Persons 1     9000/7000         2000 × 1000 = 2 million     500
2 Persons 2     10,800/5600       2400 × 800 ≈ 1.92 million   400
3 Restaurants 1130/7520           339 × 2256 = 764,784        89
4 IM-Similarity 2204/2184         181 × 180 = 32,580          496
5 Film          9995/8979         1549 × 519 = 803,931        412



blocking method. Let the number of matching (or ground-truth) entity pairs in
Γ be denoted by the symbol Γm . Similarly, let Ω denote the full set (the Carte-
sian product) of entity pairs, and Ωm denote the ground-truth set of matching
entity pairs. With this terminology, the efficiency metric, Reduction Ratio (RR),
and the effectiveness metric, Pairs Completeness (PC), are defined below:

                                            1 − |Γ |
                                   RR =                                           (1)
                                              |Ω|

                                             |Γm |
                                     PC =                                         (2)
                                             |Ωm |
  We capture the tradeoff between RR and PC by computing their harmonic
mean or their F-score9 :
                                           2 × RR × P C
                              F -score =                                          (3)
                                            (RR + P C)

Note that, for all three metrics (PC, RR and the F-score), higher values indicate
better performance.
   We discovered the blocking key for each dataset in the five test cases by
employing the trigrams-based Attribute Clustering (AC) algorithm that was
described briefly in Section 3.1; complete details are provided in the original
paper by Papadakis et al. [16]. In general, the learning algorithm leads to a
blocking key that considers tokens of multiple properties (or ‘attributes’) and
the name of the entity (its subject URI) as blocking key values. Brief examples
were earlier provided. Once learned, Algorithm 1 is executed with w varied from
2 to 50. For each value of w, PC, RR and F-score values are recorded, with
values corresponding to the highest-achieved F-score reported.

4.2    Baselines and Implementation
Two baseline blocking methods are used to gauge the performance of Algorithm
1 on the five test cases. The first baseline is the block purging method that was
9
    This is different from the F-score employed in Information Retrieval, where it is
    typically the harmonic mean of precision and recall.
10        Mayank Kejriwal and Daniel P. Miranker

used with the AC blocking key in the original paper along with a variety of other
blocking methods, many of which made assumptions that are not applicable to
this work [16]. Block purging was the least restrictive in terms of assumptions,
simple to implement and execute (but with excellent empirical performance),
and similar to the proposed method, involved a single parameter maxP airs. For
these reasons, we employ it as a baseline to test the proposed system. Similar
to lines 1-3 of Algorithm 1, block purging first generates a map of joined blocks
Πmap (but without truncating larger blocks). Rather than employing a sliding
window, the method discards an entry < bkv, < R, S >> from Πmap if |R||S| >
maxP airs. Among surviving entries, all entity-entity pairs in R × S are added
to the candidate set Γ .
    We also employed the classic Canopy Clustering (CC) blocking method as
a second baseline [12]. CC is a generic clustering method that relies on an in-
expensive similarity function such as cosine similarity, and is known to perform
well on tabular datasets [4]. Given a single threshold parameter thresh = [0, 1],
the algorithm operates by locating for every tuple t in the first dataset, all tu-
ples s in the second dataset such that the tuple pair < t, s > has similarity at
least thresh, and adds < t, s > to the candidate set Γ . Note that CC does not
rely on a blocking key, but uses all tokens in the tuples to compute the cosine
similarity. We execute CC on the full10 logical property table representations
of the input RDF datasets. Similar to w in Algorithm 1 and maxP airs in the
block purging method, thresh is varied and only the highest-achieved F-score
values are reported. For a fixed value of the thresholds, note that the algorithms
are deterministic and only need to be run once. Finally, we restricted parameter
tuning (per test case) to a maximum of fifty iterations. In terms of run-time,
all algorithms (for a given setting of the parameters) ran within a minute per
dataset, making them roughly equal in that respect.
    Finally, all programs were implemented serially in Java on a 32-bit Ubuntu
virtual machine with 3385 MB of RAM and a 2.40 GHz Intel 4700MQ i7 pro-
cessor. We have released datasets and ground-truth files on a project website11 .


4.3      Results and Discussion

Table 2 shows the highest F-scores obtained by the proposed method and the
block purging baseline, along with corresponding PC and RR values. The results
show that, on three of the five test cases, the modified SN procedure outperforms
the block purging algorithm. On the RR metric, SN outperforms block purging
by over 6% and on the F-score metric by over 3.5%. On the PC metric, block
purging does better by about 1.7%. Overall, the results show that the modified
SN algorithm is a promising blocking candidate. We note that, on the RR metric
in particular, the SN algorithm is more stable, with block purging achieving less
than 75% on at least two datasets. As earlier authors have noted [4], even small
10
     Since there is no blocking key (and no property set), projection does not take place.
11
     https://sites.google.com/a/utexas.edu/mayank-kejriwal/projects/
     sorted-neighborhood
                             Sorted Neighborhood for Schema-free RDF Data           11

Table 2. Comparative results of Algorithm 1 and the block purging baseline. Both
methods used the same blocking key, learned through Attribute Clustering [16]. Count
tabulates number of test cases on which a method performs equal/better/worse


                       Sorted Neighborhood                    Block Purging
    Test Case
                  PC        RR        F-score       PC          RR        F-score
1   Persons 1     100.00% 99.26%      99.63%        100.00%     98.86%    99.43%
2   Persons 2     91.25%    89.77%    90.50%        99.75%      99.02%    99.38%
3   Restaurants 100.00% 99.41%        99.70%        100.00%     99.57%    99.79%
4   IM-Similarity 100.00% 84.73%      91.73%        100.00%     62.79%    77.14%
5   Film          97.33%    92.10%    94.64%        97.33%      73.09%    83.49%
    Average       97.72% 93.05% 95.24%              99.42%      86.67% 91.85%
    Count         4/0/1     0/3/2     0/3/2         4/1/0       0/2/3     0/2/3


variations in RR can have a large impact on the run-time of a full ER workflow,
since RR is quadratic in the number of input entities (Eq. 1). Thus, SN may be
a more reliable method than block purging under resource-constrained settings.
    Although not tabulated here due to space constraints, the average (on the five
test cases) highest F-score and corresponding PC and RR results obtained by the
Canopy Clustering (CC) baseline are 84.74%, 80.39% and 96.58% respectively.
Both the block purging and proposed method outperform CC on both PC and
F-score by a large margin, demonstrating that on schema-free RDF data, simple
token-based blocking techniques are rarely sufficient.


5   Conclusion and Future Work
In this paper, we proposed a Sorted Neighborhood workflow for schema-free
RDF data, implemented as a sequence of relatively independent modules. All
workflow steps can be operationalized using existing Semantic Web technology.
Evaluations on five test cases show that the proposed algorithm is competitive
with established blocking techniques for schema-free RDF data.
    Future work will aim to deploy the system as a Linked Data service in a
distributed paradigm, and to evaluate it on large-scale datasets. Given certain
assumptions about the input datasets, a promising theoretical avenue is to au-
tomatically determine the optimal value of w for the inputs.


References
 1. M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale
    up record linkage. In Data Mining, 2006. ICDM’06. Sixth International Conference
    on, pages 87–96. IEEE, 2006.
 2. C. Bizer, T. Heath, and T. Berners-Lee. Linked data-the story so far. International
    journal on semantic web and information systems, 5(3):1–22, 2009.
 3. P. Christen. Further topics and research directions. In Data Matching, pages
    209–228. Springer, 2012.
12      Mayank Kejriwal and Daniel P. Miranker

 4. P. Christen. A survey of indexing techniques for scalable record linkage and dedu-
    plication. Knowledge and Data Engineering, IEEE Transactions on, 24(9):1537–
    1555, 2012.
 5. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection:
    A survey. Knowledge and Data Engineering, IEEE Transactions on, 19(1):1–16,
    2007.
 6. M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases.
    In ACM SIGMOD Record, volume 24, pages 127–138. ACM, 1995.
 7. M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and
    the merge/purge problem. Data mining and knowledge discovery, 2(1):9–37, 1998.
 8. R. Isele, A. Jentzsch, and C. Bizer. Efficient multidimensional blocking for link
    discovery without losing recall. In WebDB, 2011.
 9. M. Kejriwal and D. P. Miranker. An unsupervised algorithm for learning blocking
    schemes. In Data Mining, 2013. ICDM’13. Thirteenth International Conference
    on. IEEE, 2013.
10. M. Kejriwal and D. P. Miranker. A two-step blocking scheme learner for scalable
    link discovery. In Ontology Matching Workshop. ISWC’14. Thirteenth Interna-
    tional Semantic Web Conference, 2014.
11. M. Kejriwal and D. P. Miranker. A dnf blocking scheme learner for heterogeneous
    datasets. arXiv preprint arXiv:1501.01694, 2015.
12. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional
    data sets with application to reference matching. In Proceedings of the sixth ACM
    SIGKDD international conference on Knowledge discovery and data mining, pages
    169–178. ACM, 2000.
13. J. F. McCarthy and W. G. Lehnert. Using decision trees for coreference resolution.
    arXiv preprint cmp-lg/9505043, 1995.
14. H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital
    records. 1959.
15. A.-C. N. Ngomo. A time-efficient hybrid approach to link discovery. Ontology
    Matching, page 1, 2011.
16. G. Papadakis, E. Ioannou, C. Niederée, and P. Fankhauser. Efficient entity resolu-
    tion for large heterogeneous information spaces. In Proceedings of the fourth ACM
    international conference on Web search and data mining, pages 535–544. ACM,
    2011.
17. S. Puhlmann, M. Weis, and F. Naumann. Xml duplicate detection using sorted
    neighborhoods. In Advances in Database Technology-EDBT 2006, pages 773–791.
    Springer, 2006.
18. F. Scharffe, A. Ferrara, A. Nikolov, et al. Data linking for the semantic web. In-
    ternational Journal on Semantic Web and Information Systems, 7(3):46–76, 2011.
19. F. Scharffe, Y. Liu, and C. Zhou. Rdf-ai: an architecture for rdf datasets matching,
    fusion and interlink. In Proc. IJCAI 2009 workshop on Identity, reference, and
    knowledge representation (IR-KR), Pasadena (CA US), 2009.
20. M. Schmachtenberg, C. Bizer, and H. Paulheim. Adoption of the linked data best
    practices in different topical domains. In The Semantic Web–ISWC 2014, pages
    245–260. Springer, 2014.
21. K. Wilkinson, C. Sayers, H. A. Kuno, D. Reynolds, et al. Efficient rdf storage and
    retrieval in jena2. In SWDB, volume 3, pages 131–150, 2003.