<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Linking Heterogeneous Dataset Collections</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Texas at Austin</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Link discovery is the problem of linking entities between two or more datasets, based on some (possibly unknown) speci cation. A blocking scheme is a one-to-many mapping from entities to blocks. Blocking methods avoid O(n2) comparisons by clustering entities into blocks, and limiting the evaluation of link speci cations to entity pairs within blocks. Current link-discovery blocking methods explicitly assume that two RDF datasets are provided as input, and need to be linked. In this paper, we assume instead that two heterogeneous dataset collections, comprising arbitrary numbers of RDF and tabular datasets, are provided as input. We show that data model heterogeneity can be addressed by representing RDF datasets as property tables. We also propose an unsupervised technique called dataset mapping that maps datasets from one collection to the other, and is shown to be compatible with existing clustering methods. Dataset mapping is empirically evaluated on three real-world test collections ranging over government and constitutional domains, and shown to improve two established baselines.</p>
      </abstract>
      <kwd-group>
        <kwd>Heterogeneous Blocking</kwd>
        <kwd>Instance Matching</kwd>
        <kwd>Link Discovery</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the advent of Linked Data, discovering links between entities emerged
as an active research area [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Given a link speci cation, a naive approach would
discover links by conducting O(n2) comparisons on the set of n entities. In the
Entity Resolution (ER) community, a preprocessing technique called blocking
mitigates full pairwise comparisons by clustering entities into blocks. Only
entities within blocks are paired and compared. ER is critical in data integration
systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the Semantic Web, the problem has received attention as scalably
discovering owl:sameAs links between RDF datasets [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>In the Big Data era, scalability and heterogeneity are essential components
of systems and hence, practical requirements for real-world link discovery.
Scalability is addressed by blocking, but current work assumes that the dataset pairs
between which entities are to be linked are provided. In other words, datasets
A and B are input to the pipeline, and entities in A need to be linked to
entities in B. Investigations in some important real-world domains show that pairs
of dataset collections also need to undergo linking. Each collection is a set of
datasets. An example is government data. Recent government e orts have led to
release of public data as batches of les, both across related domains and time, as
one of our real-world test sets demonstrates. Thus, there are (at least) two
scalability issues: at the collection level, and at the dataset level. That is, datasets
in one collection rst need to be mapped to datasets in the second collection,
after which a blocking scheme is learned and applied on each mapped pair. The
problem of blocking two collections is exacerbated by data model heterogeneity,
where some datasets are RDF and the others are tabular.</p>
      <p>We note that data model heterogeneity has larger implications, since it also
applies in the standard case where two datasets are provided, but one is RDF
and the other, tabular. In recent years, the growth of both Linked Open Data
and the Deep Web have been extensively documented. Datasets in the former
are in RDF, while datasets in the latter are typically relational. Because of
data model heterogeneity, both communities have adopted di erent techniques
for performing link discovery (typically called record linkage in the relational
community). There is a clear motivation, therefore, in addressing this particular
type of heterogeneity, since it would enable signi cant cross-fertilization between
both communities. We will show an example of this empirically.</p>
      <p>
        The intuition behind our proposed solution to data model heterogeneity is
to represent the RDF dataset as an information-preserving table, not as a set of
triples or a directed graph. The literature shows that such a table has previously
been proposed as a physical data structure, for e cient implementation of triple
stores [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. An example of this table, called a property table, is shown in Figure 1.
We note that this is the rst application of property tables as logical data
structures in the link-discovery context. The table is information-preserving because
the original set of triples can be reconstructed from the table.
      </p>
      <p>Note that the property table builds a schema (in the form of a set of
properties) for the RDF le, regardless of whether it has accompanying RDFS or OWL
metadata. Thus, it applies to arbitrary les on Linked Open Data. Secondly,
numerous techniques in relational data integration can handle datasets with
different schemas (called structural heterogeneity ). By representing RDF datasets
in the input collections as property tables, data model heterogeneity is reduced
to structural heterogeneity in the tabular domain.</p>
      <p>Figure 2 shows the overall framework of link-discovery. The rst step,
proposed in this paper for collections, is called the dataset mapping step. It takes
two collections A and B of heterogeneous datasets as input and produces a set
of mappings between datasets. Let such a mapping be (a; b) where a 2 A; b 2 B.
For each such mapping, the subsequent blocking process is invoked. Blocking has
been extensively researched, with even the least expensive blocking methods
having complexity O(n), where n is the total number of entities in the input datasets.
Blocking generates a candidate set of entity pairs, , with j j &lt;&lt; O(n2). Thus,
blocking provides complexity improvements over brute-force linkage. To
understand the savings of dataset mapping, assume that each collection contains q
datasets, and each dataset contains n entities. Without dataset mapping, any
blocking method would be at least O(qn). With mapping, there would be q
instances of complexity O(n) each. Since depends heavily on n, the savings carry
over to the nal quadratic process (but which cannot be quanti ed without
assumptions about the blocking process). We empirically demonstrate these gains.
An added bene t is that there is now scope for parallelization.</p>
      <p>The mapping process itself relies on document similarity measures developed
in the information retrieval community, by representing each dataset as bag of
tokens. Intuitively, mapped datasets should have relatively high document
similarity to each other. Empirically, we found a tailored version of cosine similarity
to work best. Many packages exist for e ciently computing it. Computing
similarities between all pairs of datasets, we get a jAj jBj matrix. A straightforward
approach would use a threshold to output many-many mappings, or a graph
bipartite matcher to output one-one mappings. The former requires a parameter
speci cation, while the latter is cubic (O(q3)). Therefore, we opted for a
dominating strategy, which can be computed in the same time it takes to build the
matrix. Namely, a mapping (a; b) is chosen if the score in the cell of (a; b)
dominates, that is, it is the highest in its constitutent row and column. This has
the advantage of being conservative against false positives. The method applies
even when jAj 6= jBj. In our experiments, we used cosine document similarity
combined with the dominating strategy.</p>
      <p>
        Experiments: Some results are demonstrated in Figure 3. We use three
realworld test cases. The rst two test cases (a and b in the gure) comprise RDF
dataset collections describing court cases decided in Colombia and Venezuela
respectively, along with Constitution articles. The third test set consists of ten US
government budget dataset collections from 2009 to 2013 1. Other such
collections can also be observed on the same website, providing motivation for dataset
mapping. We have released publicly available datasets on a single page2 with
1 http://www.pewstates.org/research/reports/
2 https://sites.google.com/a/utexas.edu/mayank-kejriwal/datasets
ground-truths. We used two popular methods as baselines, a state-of-the-art
unsupervised clustering method called Canopy Clustering (CC in gure) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as
well as an extended feature-selection based blocking method (Hetero in gure)
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The gains produced by dataset mapping are particularly large on CC. More
importantly, we found that the dataset mapping algorithm was able to deduce
the correct mappings without introducing false positives or negatives, and with
run-time negligible compared to the subsequent blocking procedures.
      </p>
      <p>Future Work: We continue to investigate dataset mapping, including other
document similarity measures, task domains and mapping strategies. We are
also investigating supervised versions of the problem, particularly in cases where
token overlap is low. Finally, we are investigating the property table further.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          .
          <article-title>Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection</article-title>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Learning expressive linkage rules using genetic programming</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>5</volume>
          (
          <issue>11</issue>
          ):
          <volume>1638</volume>
          {
          <fpage>1649</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Miranker</surname>
          </string-name>
          .
          <article-title>An unsupervised algorithm for learning blocking schemes</article-title>
          .
          <source>In Data Mining</source>
          ,
          <year>2013</year>
          . ICDM'
          <fpage>13</fpage>
          . Thirteenth International Conference on. IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nigam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Ungar</surname>
          </string-name>
          .
          <article-title>E cient clustering of high-dimensional data sets with application to reference matching</article-title>
          .
          <source>In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>169</volume>
          {
          <fpage>178</fpage>
          . ACM,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. F. Schar e, Y. Liu, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Rdf-ai: an architecture for rdf datasets matching, fusion and interlink</article-title>
          .
          <source>In Proc. IJCAI 2009 workshop on Identity</source>
          , reference, and
          <article-title>knowledge representation (IR-KR)</article-title>
          ,
          <source>Pasadena (CA US)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>K.</given-names>
            <surname>Wilkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sayers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Kuno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          , et al.
          <article-title>E cient rdf storage and retrieval in jena2</article-title>
          .
          <source>In SWDB</source>
          , volume
          <volume>3</volume>
          , pages
          <fpage>131</fpage>
          {
          <fpage>150</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>