-

On Linking Heterogeneous Dataset Collections

0 University of Texas at Austin , USA

Link discovery is the problem of linking entities between two or more datasets, based on some (possibly unknown) speci cation. A blocking scheme is a one-to-many mapping from entities to blocks. Blocking methods avoid O(n2) comparisons by clustering entities into blocks, and limiting the evaluation of link speci cations to entity pairs within blocks. Current link-discovery blocking methods explicitly assume that two RDF datasets are provided as input, and need to be linked. In this paper, we assume instead that two heterogeneous dataset collections, comprising arbitrary numbers of RDF and tabular datasets, are provided as input. We show that data model heterogeneity can be addressed by representing RDF datasets as property tables. We also propose an unsupervised technique called dataset mapping that maps datasets from one collection to the other, and is shown to be compatible with existing clustering methods. Dataset mapping is empirically evaluated on three real-world test collections ranging over government and constitutional domains, and shown to improve two established baselines.

Heterogeneous Blocking Instance Matching Link Discovery

With the advent of Linked Data, discovering links between entities emerged as an active research area [ 2 ]. Given a link speci cation, a naive approach would discover links by conducting O(n2) comparisons on the set of n entities. In the Entity Resolution (ER) community, a preprocessing technique called blocking mitigates full pairwise comparisons by clustering entities into blocks. Only entities within blocks are paired and compared. ER is critical in data integration systems [ 1 ]. In the Semantic Web, the problem has received attention as scalably discovering owl:sameAs links between RDF datasets [ 5 ].

In the Big Data era, scalability and heterogeneity are essential components of systems and hence, practical requirements for real-world link discovery. Scalability is addressed by blocking, but current work assumes that the dataset pairs between which entities are to be linked are provided. In other words, datasets A and B are input to the pipeline, and entities in A need to be linked to entities in B. Investigations in some important real-world domains show that pairs of dataset collections also need to undergo linking. Each collection is a set of datasets. An example is government data. Recent government e orts have led to release of public data as batches of les, both across related domains and time, as one of our real-world test sets demonstrates. Thus, there are (at least) two scalability issues: at the collection level, and at the dataset level. That is, datasets in one collection rst need to be mapped to datasets in the second collection, after which a blocking scheme is learned and applied on each mapped pair. The problem of blocking two collections is exacerbated by data model heterogeneity, where some datasets are RDF and the others are tabular.

We note that data model heterogeneity has larger implications, since it also applies in the standard case where two datasets are provided, but one is RDF and the other, tabular. In recent years, the growth of both Linked Open Data and the Deep Web have been extensively documented. Datasets in the former are in RDF, while datasets in the latter are typically relational. Because of data model heterogeneity, both communities have adopted di erent techniques for performing link discovery (typically called record linkage in the relational community). There is a clear motivation, therefore, in addressing this particular type of heterogeneity, since it would enable signi cant cross-fertilization between both communities. We will show an example of this empirically.

The intuition behind our proposed solution to data model heterogeneity is to represent the RDF dataset as an information-preserving table, not as a set of triples or a directed graph. The literature shows that such a table has previously been proposed as a physical data structure, for e cient implementation of triple stores [ 6 ]. An example of this table, called a property table, is shown in Figure 1. We note that this is the rst application of property tables as logical data structures in the link-discovery context. The table is information-preserving because the original set of triples can be reconstructed from the table.

Note that the property table builds a schema (in the form of a set of properties) for the RDF le, regardless of whether it has accompanying RDFS or OWL metadata. Thus, it applies to arbitrary les on Linked Open Data. Secondly, numerous techniques in relational data integration can handle datasets with different schemas (called structural heterogeneity ). By representing RDF datasets in the input collections as property tables, data model heterogeneity is reduced to structural heterogeneity in the tabular domain.

Figure 2 shows the overall framework of link-discovery. The rst step, proposed in this paper for collections, is called the dataset mapping step. It takes two collections A and B of heterogeneous datasets as input and produces a set of mappings between datasets. Let such a mapping be (a; b) where a 2 A; b 2 B. For each such mapping, the subsequent blocking process is invoked. Blocking has been extensively researched, with even the least expensive blocking methods having complexity O(n), where n is the total number of entities in the input datasets. Blocking generates a candidate set of entity pairs, , with j j << O(n2). Thus, blocking provides complexity improvements over brute-force linkage. To understand the savings of dataset mapping, assume that each collection contains q datasets, and each dataset contains n entities. Without dataset mapping, any blocking method would be at least O(qn). With mapping, there would be q instances of complexity O(n) each. Since depends heavily on n, the savings carry over to the nal quadratic process (but which cannot be quanti ed without assumptions about the blocking process). We empirically demonstrate these gains. An added bene t is that there is now scope for parallelization.

The mapping process itself relies on document similarity measures developed in the information retrieval community, by representing each dataset as bag of tokens. Intuitively, mapped datasets should have relatively high document similarity to each other. Empirically, we found a tailored version of cosine similarity to work best. Many packages exist for e ciently computing it. Computing similarities between all pairs of datasets, we get a jAj jBj matrix. A straightforward approach would use a threshold to output many-many mappings, or a graph bipartite matcher to output one-one mappings. The former requires a parameter speci cation, while the latter is cubic (O(q3)). Therefore, we opted for a dominating strategy, which can be computed in the same time it takes to build the matrix. Namely, a mapping (a; b) is chosen if the score in the cell of (a; b) dominates, that is, it is the highest in its constitutent row and column. This has the advantage of being conservative against false positives. The method applies even when jAj 6= jBj. In our experiments, we used cosine document similarity combined with the dominating strategy.

Experiments: Some results are demonstrated in Figure 3. We use three realworld test cases. The rst two test cases (a and b in the gure) comprise RDF dataset collections describing court cases decided in Colombia and Venezuela respectively, along with Constitution articles. The third test set consists of ten US government budget dataset collections from 2009 to 2013 1. Other such collections can also be observed on the same website, providing motivation for dataset mapping. We have released publicly available datasets on a single page2 with 1 http://www.pewstates.org/research/reports/ 2 https://sites.google.com/a/utexas.edu/mayank-kejriwal/datasets ground-truths. We used two popular methods as baselines, a state-of-the-art unsupervised clustering method called Canopy Clustering (CC in gure) [ 4 ] as well as an extended feature-selection based blocking method (Hetero in gure) [ 3 ]. The gains produced by dataset mapping are particularly large on CC. More importantly, we found that the dataset mapping algorithm was able to deduce the correct mappings without introducing false positives or negatives, and with run-time negligible compared to the subsequent blocking procedures.

Future Work: We continue to investigate dataset mapping, including other document similarity measures, task domains and mapping strategies. We are also investigating supervised versions of the problem, particularly in cases where token overlap is low. Finally, we are investigating the property table further.

Christen . Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection . Springer, 2012 .

Isele and

Bizer . Learning expressive linkage rules using genetic programming . Proceedings of the VLDB Endowment , 5 ( 11 ): 1638 { 1649 , 2012 .

Kejriwal and

D. P.

Miranker . An unsupervised algorithm for learning blocking schemes . In Data Mining , 2013 . ICDM' 13 . Thirteenth International Conference on. IEEE, 2013 .

McCallum ,

Nigam , and

L. H.

Ungar . E cient clustering of high-dimensional data sets with application to reference matching . In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining , pages 169 { 178 . ACM, 2000 .

5. F. Schar e, Y. Liu, and

Zhou . Rdf-ai: an architecture for rdf datasets matching, fusion and interlink . In Proc. IJCAI 2009 workshop on Identity , reference, and knowledge representation (IR-KR) , Pasadena (CA US) , 2009 .

Wilkinson ,

Sayers ,

H. A.

Kuno ,

Reynolds , et al. E cient rdf storage and retrieval in jena2 . In SWDB , volume 3 , pages 131 { 150 , 2003 .