<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Two-step Blocking Scheme Learner for Scalable Link Discovery</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Texas at Austin</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A two-step procedure for learning a link-discovery blocking scheme is presented. Link discovery is the problem of linking entities between two or more datasets. Identifying owl:sameAs links is an important, special case. A blocking scheme is a one-to-many mapping from entities to blocks. Blocking methods avoid O(n2) comparisons by clustering entities into blocks, and limiting the evaluation of link speci cations to entity pairs within blocks. Current link-discovery blocking methods use blocking schemes tailored for owl:sameAs links or that rely on assumptions about the underlying link speci cations. The presented framework learns blocking schemes for arbitrary link speci cations. The rst step of the algorithm is unsupervised and performs dataset mapping between a pair of dataset collections. The second supervised step learns blocking schemes on structurally heterogeneous dataset pairs. Application to RDF is accomplished by representing the RDF dataset in property table form. The method is empirically evaluated on four real-world test collections ranging over various domains and tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Heterogeneous Blocking</kwd>
        <kwd>Instance Matching</kwd>
        <kwd>Link Discovery</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the advent of Linked Data, discovering links between entities has emerged
as an active area of research [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Given a link speci cation, a naive approach
would discover links by conducting O(n2) comparisons on the set of n
entities. In the Entity Resolution (ER) community, a preprocessing technique called
blocking mitigates full pairwise comparisons by clustering entities into blocks.
Only entities within blocks are paired and compared [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Blocking is critical in
data integration systems [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Blocking methods require a blocking scheme to cluster entities. Advanced
methods have been proposed to use a given blocking scheme e ectively; relatively
fewer works address the learning of blocking schemes. Even within ER, blocking
scheme learners (BSLs) have met practical success only recently [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In
the Semantic Web, the problem has received attention as scalably discovering
owl:sameAs links [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. ER is an important, but special, case of the link discovery
problem, where the underlying link speci cation can be arbitrary. Such speci
cations can be learned, but brute-force applications would still be O(n2). Current
link discovery systems aim to be e cient by using token-based pre-clustering or
metric space assumptions [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>Learning link-discovery blocking schemes for arbitrary underlying links
remains unaddressed. Because the link can be arbitrary, a training corpus is
required. Consider the example in Figure 1. Given a small number of such
examples, the proposed BSL adaptively learns a scheme that covers true positives
while reducing full quadratic cost, without relying on the formal link speci
cation itself. Note that the learned blocking scheme is di erent from a learned link
speci cation. In this paper, we exclusively address blocking.</p>
      <p>In the Big Data era, scalability, automation and heterogeneity are essential
components of systems and hence, practical requirements for real-world link
discovery. Scalability is addressed by blocking, but current work assumes that
the dataset pairs between which entities are to be linked are provided. In other
words, datasets A and B are1 input to the pipeline, and entities in A need to be
linked to entities in B. Investigations in some important real-world domains show
that pairs of dataset collections also need to undergo linking. Each collection is
a set of datasets. An example is government data. Recent government e orts
have led to release of public data as batches of les, as one of our real-world test
sets demonstrates. Thus, two scalability issues are identi ed: at the collection
level, and at the dataset level. That is, datasets in one collection rst need to
be mapped to datasets in the second collection, after which a blocking scheme
is learned (and later, applied) on each mapped dataset pair.</p>
      <p>Automation implies that human intervention needs to be kept to a minimum.
In practice, this means that methods need to rely on fewer training examples.
Finally, a heterogeneity issue arises if datasets in the collections are in two di erent
data models, such as RDF and tabular.</p>
      <p>
        The proposed BSL addresses these challenges in a combined setting. It takes
as input two collections of datasets, with each collection an arbitrary mix of
RDF and tabular datasets. The rst step of the BSL performs unsupervised
dataset mapping by relying on document similarity and e cient solutions to the
Hungarian algorithm [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Each chosen dataset pair is input to the second step,
which learns a link-discovery blocking scheme given a constant size training
cor1 If A and B are the same dataset, the problem is commonly denoted deduplication.
pus that does not require growing with the dataset. RDF datasets are reconciled
with tabular datasets by representing them as property tables [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. The problem
thus reduces to learning schemes on tabular datasets with di erent schemas.
Elmagarmid et al. refer to this as the structural heterogeneity problem [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. To
robustly deal with small, constant training sets, the BSL uses bagging [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. To
the best of our knowledge, this is the rst paper that uses bagging, dataset
mapping and property tables to address automation, scalability and heterogeneity
respectively in the link-discovery blocking context.
      </p>
      <p>The rest of this paper is structured as follows. Section 2 describes the related
work in this area. Section 3 describes the property table representation and
the BSL in detail. Section 4 describe the experimental results, and the paper
concludes in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Link Discovery has been researched actively since the original Silk paper [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ],
which currently uses genetic programming to learn links [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Since we discuss
learning of blocking schemes, most link speci cation learners are compatible, not
competitive, with the proposed system. A full pipeline that can perform data
fusion is RDF-AI [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. We note that active learning techniques have been proposed
to address the automation issue [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]; in this paper, we show a complementary
solution using bagging.
      </p>
      <p>
        Blocking has been extensively studied in the record linkage community [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
with a comprehensive survey by Christen [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Initial BSLs were supervised [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
Recently, an unsupervised feature selection based BSL was proposed by us, but
assumed only owl:sameAs links and did not use bagging [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Token-based
clustering has also been applied to the problem [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], together with Locality Sensitive
Hashing (LSH) techniques [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However, LSH is usually applicable only to select
distance measures like Jaccard or cosine. As such, it is more popularly applied
to ontology matching [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Other Semantic Web e orts for owl:sameAs include
the approach by Song and He in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Ma et al. proposed a system based on type
semantics exclusively for ER on two individual datasets [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Multiple techniques for using blocking schemes have been investigated in
the Semantic Web community [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. An example (used in the Silk framework) is
MultiBlock [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Another e ort that assumes metric spaces is LIMES [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Finally,
an advanced survey of bagging can be found in the work by Verikas et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Algorithm</title>
      <p>
        In this section, the two steps of the overall BSL in Figure 3 are described.
Note that dataset mapping is an optional 2 step, but the second step (the core
learner) is essential. As a preliminary, the property table representation is also
summarized.
2 Albeit empirically advantageous, when applied, as Section 4 will show.
Property tables were rst proposed as physical data structures to e ciently
implement triple stores [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. This is the rst application using them for
linkdiscovery blocking. Figure 2 shows an example. Property tables reduce the
problem of linking RDF and tabular datasets (and also RDF-RDF linkage) to tabular
structural heterogeneity, that is, tables with di erent schemas. The rest of the
paper assumes property table representation of RDF.
3.2
      </p>
      <sec id="sec-3-1">
        <title>Dataset Mapping</title>
        <p>The pseudocode for dataset mapping is given in Algorithm 1. The inputs to
the algorithm are the dataset collections R and S and a boolean con dence
function that is subsequently described. We de ne a dataset collection as a set
of independently released datasets. Without loss of generality, assume that jRj
jSj. The output desired is a con dent mapping M R S.</p>
        <p>Algorithm 1 represents each dataset in each collection as a term frequency
(TF) vector. The TF vector for each dataset is constructed by assigning a unique
position to each distinct token and recording the count of that token in the
dataset. Each TF vector is normalized by dividing each element by the total
count of the respective token in the corresponding collection. A normalized TF
vector is di erent from a TFIDF3 vector.
3 The TFIDF vector is constructed by dividing each element of a TF vector by the
number of datasets in the corresponding collection in which the token occurred at
least once (in lines 6 and 7) rather than the total count of the token in that collection.</p>
        <p>
          A matrix Q is initialized with Q[i][j] containing the dot product of
normalized TF vectors Ri and Sj . Once the matrix is constructed, the max Hungarian
algorithm is invoked on the matrix, which has at least as many columns as rows,
by the assumption above [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The algorithm must assign each row to some
column, such that the sums of corresponding matrix entries are maximized. The
problem is also equivalent to maximum weighted bipartite graph matching.
        </p>
        <p>The con dence of each mapping is evaluated by C. As an example of a con
dence function, suppose the function returns True for a returned mapping (i; j)
i Q[i][j] is the dominating score, that is, greater than every score in its
constituent row and column. Intuitively, this means that the mapping is not only
the best possible, but also non-con icting. In some sense, this assumes an
aggressive strategy against false positives. Other4 strategies can be formulated for
other requirements; we leave these for future work.</p>
        <p>
          Assuming the dominating strategy a priori, the Hungarian algorithm can be
modi ed to terminate in linear time (in R and S), otherwise it is cubic in the
collection size [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Empirically, a reasonable con dence strategy would lead to
savings if the total number of records is far greater than the number of datasets.
For large collections, preferrable strategies should have theoretical guarantees,
like the dominating strategy. We observed dataset mapping to achieve
nearinstantaneous runtime, even with a standard Hungarian implementation.
        </p>
        <p>Intuitively, dataset mapping is expected to be a well-performing heuristic
because constituent datasets are independently released. Algorithms like LIMES,
Canopy Clustering and unsupervised methods bene t because they can cluster
entities in isolated dataset pairs, rather than all the entities in the collection.
With correct mapping, both quality and scalability are expected to improve.
Experimentally, the gains are demonstrated in Section 4.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Heterogeneous Blocking Scheme Learner</title>
        <p>
          Given two structurally heterogeneous tabular sources, the goal of the second step
is supervised learning of a heterogeneous blocking scheme. In earlier works, DNF
blocking schemes were found to be fairly representative and learned using
setcovering algorithms [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. In recent work, we showed that a feature selection
technique outperformed state-of-the-art DNF BSLs [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. To summarize the work
brie y, given training sets D and N containing duplicate and non-duplicate
tuple pairs respectively, each pair is rst converted into a vector with O(m1m2)
binary features, where m1, m2 is the number of attributes in datasets R1,R2
respectively. Thus, two sets FD and FN containing labeled feature vectors are
obtained. A set of features must now be chosen such that a minimum fraction
of positives are covered, and with no individual feature covering more than a
fraction of negatives. For further details on these parameters and the feature
conversion and selection process, we refer the reader to the original work [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>Note that the originally proposed algorithm had no concept of bagging and
assumed the training set was representative enough. Algorithm 2 shows how
bag4 For example, using a threshold to map multiple datasets to each other.</p>
        <sec id="sec-3-2-1">
          <title>Algorithm 1 Perform dataset mapping.</title>
          <p>Input: Dataset Collections R and S, boolean con dence function C
Output: A mapping M between R and S
1. for all datasets Ri 2 R do</p>
          <p>T!iR := Term Frequency vector of terms in Ri
2. end for
3. for all datasets Si 2 S do</p>
          <p>
            T!iS := Term Frequency vector of terms in Si
4. end for
5. Construct vectors T R!;S , with jth element T R!;S [j] :=
ging can be incorporated, to work with small5 training sets. Bagging parameters
; are now also input. speci es the number of bagging iterations and is the
sampling rate for bagging. In each bagging iteration, a fraction of the overall
training sample is chosen to undergo feature selection by calling
FisherDisjunctive (Algorithm 3 in [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]). A feature is technically a speci c blocking predicate
(SBP) e.g. CommonToken(Last Name, Name). Intuitively, the SBP implies that
two entities (from di erent datasets) with a common token in their respective
Last Name and Name eld values share a block. Features (or SBPs) chosen in
each bagging iteration are added to B0. The nal DNF blocking scheme B is
heterogeneous precisely because it accommodates two di erent schemas, as the
SBP example above shows.
          </p>
          <p>
            In some cases, training examples might not be available at all. If the task
is learning owl:sameAs links, the automatic training set generator in our
original work can be used to generate noisy training samples [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. The rest of the
procedure remains the same. We evaluate this scenario in one of our test suites.
5 Small training sets a ects learning algorithm quality, but compensate well for
O(m1m2) feature-vector dimensionality. Bagging allows controlled compromise
between scalability and quality.
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Algorithm 2 Learn Link-Speci c Blocking Scheme</title>
          <p>Input : Positive feature-vectors set FD, negative feature-vectors set FN , coverage
parameter , pruning parameter , bagging iterations , sampling size
Output : Blocking scheme B
1. Initialize B0 :=
2. for all iter = 1 : : : do</p>
          <p>Randomly sample (with replacement) jFDj and jFN j vectors from FD and
FN , insert into new sets F N0 and F D0 respectively</p>
          <p>B0 := B0 [ F isherDisjunctive(F D0; F N0 ; ; )
3. end for
4. Output disjunction of elements in B0 as B
In this section, the algorithm is experimentally evaluated. Datasets, metrics and
baseline are rst described, followed by a set of results and a discussion.
The algorithm is evaluated on four real-world dataset collections over three
different domains, described in Table 1. In the rst test set, the rst collection
consists of a single RDF dataset describing court cases decided in Colombia,
along with various properties of those cases. The second collection has two RDF
datasets, only one of which is relevant for linkage. This dataset describes article
numbers (as literal object values) in the Colombia constitution6. The task is to
predict links between the cases in the rst collection and the articles in the
second collection used to decide the case. An example was shown in Figure 1. The
second test set is similar, but for Venezuela. The third test set consists of ten US
6 constituteproject.org
government estimated budget datasets from 2009 to 2013, released separately
by the Treasury department and the Joint Committee on Taxation. All datasets
in this collection were published in tabular form, but the two collections are
structurally heterogeneous. The goal is to link entities (describing a particular
budget allocation) that share the same budget function (such as health) in the
same year. These three test cases are proper dataset collections, given at least
one collection contains more than one dataset. Other such collections can also
be observed on the respective website7.</p>
          <p>The fourth test set contains collections derived from the video games domain
and di ers from the other three test sets in three important respects. First, the
dataset mapping step is not applicable, since each collection only contains one
dataset. Note that the datasets in this test set are large compared with the other
test sets. Second, the rst dataset is RDF and was queried from DBpedia8 while
the second dataset is tabular and from a popular charting website9. Finally, the
noisy training set generator can be used, given the link is owl:sameAs. We have
collected all publicly available test cases on a single portal10, along with other
implementation details such as the features (or SBPs) used in the experiments.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Metrics</title>
        <p>
          We adopted two metrics, Pairs Completeness (PC) and Reduction Ratio (RR),
from the blocking literature [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. PC measures recall or e ectiveness in the
blocking setting; speci cally, the ratio of true positives that have fallen within the
block and the total number of true positives. RR is the percentage of
comparisons that have been avoided compared to full quadratic cost and represents
e ciency. For example, an RR of 99 percent means that the blocking scheme
has reduced the complexity of the full pairwise task by that amount. Note that
the optimal RR for 100 percent PC for the datasets in Table 1 can be calculated
by using the formula 1 C5=C4 where Cp stands for Column p. Following
previous research, PC-RR graphs are used to quantify the e ectiveness-e ciency
tradeo [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
4.3
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Baseline</title>
        <p>
          As earlier stated, many popular link discovery systems like Silk [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] use
tokenbased pre-matching to reduce complexity. Canopy Clustering, originally
proposed by McCallum et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], represents such methods since it is token-based,
makes few assumptions and has been shown to be experimentally robust [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
It was also used as the baseline in another competitive system [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]; hence, we
use it as our baseline. In the best performing implementation11, the algorithm
7 e.g. http://www.pewstates.org/research/reports/ for the third test case.
8 dbpedia.org
9 vgchartz.com
10 https://sites.google.com/a/utexas.edu/mayank-kejriwal/datasets
11 Documented by Baxter et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
randomly chooses a seed entity from one dataset to represent a cluster, and all
entities in the second dataset with TFIDF scores above a threshold are placed
in that cluster. Clusters may overlap.
4.4
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Methodology</title>
        <p>The dataset mapping step was applied on the rst three test sets in Table 1. It is
not applicable to the fourth test set. The dominating strategy introduced earlier
was used as the con dence function in Algorithm 1. For the second step, note
that the baseline chooses seeds randomly. We compensated by conducting ten
trials for each run of the algorithm, and averaging PC and RR. The threshold
was tuned and set to a low value of 0:0005 to maximize baseline recall.</p>
        <p>
          The parameters in Algorithm 2 were set to values that were found after some
initial tuning (on a subset of the rst test collection). The size of initial training
set was set to 300 each for both duplicates (jFDj) and non-duplicates (jFN j).
was set to 10, to 30, to 0:8 and to 0:2. In additional experiments, we varied
each of these parameters by 50%. There was no signi cant di erence from the
results we subsequently show with these parameter settings. Future work will
investigate automatic parameter tuning. In our earlier work, the feature selection
was also found robust to varying parameters [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Note that the training set is
kept constant, regardless of dataset growth.
        </p>
        <p>
          To evaluate the learned blocking scheme in a practical blocking method, one
additional parameter, maxBucketPairs is used to enable a technique called block
purging [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The technique discards blocks that have more than
maxBucketPairs candidate pairs, since the cost of processing these blocks is greater than
expected gain. Block purging was also used in the baseline, for consistency.
Varying maxBucketPairs from values typically ranging from 1000 to 100,000 e
ectively varies RR. Data points showing PC at di erent values of RR are obtained
and plotted. All experiments were run on an Intel Core 2 Duo machine with 3
GB of memory and 2.4 GHz clock speed. All code was implemented in Java.
4.5
        </p>
      </sec>
      <sec id="sec-3-6">
        <title>Results and Discussion</title>
        <p>With the dominating strategy, dataset mapping yielded perfect mappings for
all three test sets. We ran some additional experiments, including using more
government test data (from years 2003-2013 instead of 2009-2013) and Constitute
and Case Law data from other countries, and the mappings were still perfect.
It would seem, therefore, that the normalized TF measures and the dominating
strategy are suited to the problem, at least on the tested domains.</p>
        <p>Figure 4 shows the PC-RR tradeo results of the learned blocking scheme
for Canopy Clustering and the proposed method both with and without dataset
mapping12, on the Colombia and Venezuela collections. The gains of dataset
12 Recall that dataset mapping was designed to be compatible with other algorithms
as well, including Canopy Clustering.
mapping are readily apparent for CC in both cases. The proposed method,
Hetero, outperforms the non dataset-mapping version of CC but the gap narrows
considerably when dataset mapping is employed. This shows that, in cases where
training sets are not readily available, an o -the-shelf dataset mapping algorithm
can boost performance. The dataset mapping gains for Hetero aren't signi cant
on Venezuela, mainly because the algorithm performs well on this dataset even
without mapping. On CC, however, the gains are again apparent. Note that
Venezuela ((b) in Figure 4) represents some of the challenges of doing link
discovery versus just ER. The proposed BSL was able to overcome these challenges
by employing bagging and feature selection.</p>
        <p>Figure 5a shows the results on the ve-pair government budget data. For this
collection, the gains of dataset mapping are ampli ed. This is because there are
more datasets in the collection, so the dataset mappings are particularly useful.
We ran further experiments on the full government data (from years 2003-2013)
and con rmed this. This time, Hetero also shows noticeable gains, with the
curve shifting to the right when dataset mapping is employed. Figure 5b shows
the results for learning blocking schemes for ER. The BSL is able to signi cantly
outperform CC, regardless of whether it is completely unsupervised or with a
provided perfectly labeled training set.</p>
        <p>Finally, we repeated the experiments above but without bagging. Highest
fscores13 of PC and RR declined on all cases by at least 5%, with 95% statistical
signi cance using Student's distribution. Otherwise, the graphical trends were
similar. We do not repeat the gures here.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Future Work and Conclusion</title>
      <p>In this paper, a link-discovery blocking scheme learner was proposed. The rst
step of the method operates in an unsupervised fashion and performs dataset
mapping by employing document-level similarity measures. It is compatible with
existing clustering and blocking algorithms, experimental savings demonstrated
on two such methods. The second step is a heterogeneous BSL that uses
techniques like bagging to achieve robust performance, even as the training sets
remain constant and the datasets grow in size.</p>
      <p>
        Future work will evaluate the dataset mapping step and the accompanying
con dence strategies more extensively, and develop parameter tuning techniques
for the learner itself. Another important aspect is investigating scalability of the
learner; in particular, we are developing techniques for `pruning' property tables
so that the learner can e ciently scale by learning schemes in a reduced feature
space. We believe that this provides an excellent opportunity for cross-fertilizing
ongoing scalability e orts in the ontology matching community [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Acknowledgments. The authors would like to thank Juan Sequeda for
providing the Constitute and Case Law datasets.
13 Given by 2.RR.PC/(PC+RR)
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Baxter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Churches</surname>
          </string-name>
          .
          <article-title>A comparison of fast blocking methods for record linkage</article-title>
          .
          <source>In ACM SIGKDD</source>
          , volume
          <volume>3</volume>
          , pages
          <fpage>25</fpage>
          {
          <fpage>27</fpage>
          .
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Bilenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kamath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          .
          <article-title>Adaptive blocking: Learning to scale up record linkage</article-title>
          .
          <source>In Data Mining</source>
          ,
          <year>2006</year>
          . ICDM'
          <fpage>06</fpage>
          . Sixth International Conference on, pages
          <volume>87</volume>
          {
          <fpage>96</fpage>
          . IEEE,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          .
          <article-title>A survey of indexing techniques for scalable record linkage and deduplication. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on,
          <volume>24</volume>
          (
          <issue>9</issue>
          ):
          <volume>1537</volume>
          {
          <fpage>1555</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fokoue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kementsietsidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Ward</surname>
          </string-name>
          .
          <article-title>Instance-based matching of large ontologies using locality-sensitive hashing</article-title>
          .
          <source>In The Semantic Web{ISWC</source>
          <year>2012</year>
          , pages
          <fpage>49</fpage>
          {
          <fpage>64</fpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>A. K. Elmagarmid</surname>
            ,
            <given-names>P. G.</given-names>
          </string-name>
          <string-name>
            <surname>Ipeirotis</surname>
            , and
            <given-names>V. S.</given-names>
          </string-name>
          <string-name>
            <surname>Verykios</surname>
          </string-name>
          .
          <article-title>Duplicate record detection: A survey. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>16</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J.</given-names>
            <surname>Euzenat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shvaiko</surname>
          </string-name>
          , et al.
          <source>Ontology matching</source>
          , volume
          <volume>18</volume>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Learning expressive linkage rules using genetic programming</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>5</volume>
          (
          <issue>11</issue>
          ):
          <volume>1638</volume>
          {
          <fpage>1649</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>E cient multidimensional blocking for link discovery without losing recall</article-title>
          .
          <source>In WebDB</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>R.</given-names>
            <surname>Jonker</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Volgenant</surname>
          </string-name>
          .
          <article-title>Improving the hungarian assignment algorithm</article-title>
          .
          <source>Operations Research Letters</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ):
          <volume>171</volume>
          {
          <fpage>175</fpage>
          ,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Miranker</surname>
          </string-name>
          .
          <article-title>An unsupervised algorithm for learning blocking schemes</article-title>
          .
          <source>In Data Mining</source>
          ,
          <year>2013</year>
          . ICDM'
          <fpage>13</fpage>
          . Thirteenth International Conference on. IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. H.-s. Kim and
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Harra: fast iterative hashed record linkage for large-scale data collections</article-title>
          .
          <source>In Proceedings of the 13th International Conference on Extending Database Technology</source>
          , pages
          <volume>525</volume>
          {
          <fpage>536</fpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Y. Ma, T. Tran, and
          <string-name>
            <given-names>V.</given-names>
            <surname>Bicer</surname>
          </string-name>
          .
          <article-title>Typi er: Inferring the type semantics of structured data</article-title>
          .
          <source>In Data Engineering (ICDE)</source>
          ,
          <year>2013</year>
          IEEE 29th International Conference on, pages
          <volume>206</volume>
          {
          <fpage>217</fpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>A. McCallum</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Nigam</surname>
            , and
            <given-names>L. H.</given-names>
          </string-name>
          <string-name>
            <surname>Ungar</surname>
          </string-name>
          .
          <article-title>E cient clustering of high-dimensional data sets with application to reference matching</article-title>
          .
          <source>In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>169</volume>
          {
          <fpage>178</fpage>
          . ACM,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>M.</given-names>
            <surname>Michelson</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          .
          <article-title>Learning blocking schemes for record linkage</article-title>
          .
          <source>In Proceedings of the National Conference on Arti cial Intelligence</source>
          , volume
          <volume>21</volume>
          , page 440. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press;
          <year>1999</year>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>A</surname>
            .-
            <given-names>C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Ngomo</surname>
          </string-name>
          .
          <article-title>A time-e cient hybrid approach to link discovery</article-title>
          .
          <source>Ontology Matching, page 1</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>A</surname>
          </string-name>
          .
          <string-name>
            <surname>-C. N. Ngomo</surname>
            and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lyko. Eagle</surname>
          </string-name>
          :
          <article-title>E cient active learning of link speci cations using genetic programming</article-title>
          .
          <source>In The Semantic Web: Research and Applications</source>
          , pages
          <volume>149</volume>
          {
          <fpage>163</fpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. G. Papadakis,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ioannou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Niederee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Fankhauser</surname>
          </string-name>
          .
          <article-title>E cient entity resolution for large heterogeneous information spaces</article-title>
          .
          <source>In Proceedings of the fourth ACM international conference on Web search and data mining</source>
          , pages
          <volume>535</volume>
          {
          <fpage>544</fpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. F. Schar e, Y. Liu, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Rdf-ai: an architecture for rdf datasets matching, fusion and interlink</article-title>
          .
          <source>In Proc. IJCAI 2009 workshop on Identity</source>
          , reference, and
          <article-title>knowledge representation (IR-KR)</article-title>
          ,
          <source>Pasadena (CA US)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          and
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>He in. Automatically generating data linkages using a domainindependent candidate selection approach</article-title>
          .
          <source>In The Semantic Web{ISWC</source>
          <year>2011</year>
          , pages
          <fpage>649</fpage>
          {
          <fpage>664</fpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>A.</given-names>
            <surname>Verikas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelzinis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Bacauskiene</surname>
          </string-name>
          .
          <article-title>Mining data with random forests: A survey and results of new tests</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <volume>44</volume>
          (
          <issue>2</issue>
          ):
          <volume>330</volume>
          {
          <fpage>349</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>J. Volz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gaedke</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kobilarov</surname>
          </string-name>
          .
          <article-title>Discovering and maintaining links on the web of data</article-title>
          .
          <source>In The Semantic Web-ISWC</source>
          <year>2009</year>
          , pages
          <fpage>650</fpage>
          {
          <fpage>665</fpage>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>K.</given-names>
            <surname>Wilkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sayers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Kuno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          , et al.
          <article-title>E cient rdf storage and retrieval in jena2</article-title>
          .
          <source>In SWDB</source>
          , volume
          <volume>3</volume>
          , pages
          <fpage>131</fpage>
          {
          <fpage>150</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>