On Linking Heterogeneous Dataset Collections Mayank Kejriwal and Daniel P. Miranker University of Texas at Austin {kejriwal,miranker}@cs.utexas.edu Abstract. Link discovery is the problem of linking entities between two or more datasets, based on some (possibly unknown) specification. A blocking scheme is a one-to-many mapping from entities to blocks. Blocking methods avoid O(n2 ) comparisons by clustering entities into blocks, and limiting the evaluation of link specifications to entity pairs within blocks. Current link-discovery blocking methods explicitly assume that two RDF datasets are provided as input, and need to be linked. In this paper, we assume instead that two heterogeneous dataset collections, comprising arbitrary numbers of RDF and tabular datasets, are provided as input. We show that data model heterogeneity can be addressed by representing RDF datasets as property tables. We also propose an un- supervised technique called dataset mapping that maps datasets from one collection to the other, and is shown to be compatible with existing clustering methods. Dataset mapping is empirically evaluated on three real-world test collections ranging over government and constitutional domains, and shown to improve two established baselines. Keywords: Heterogeneous Blocking, Instance Matching, Link Discov- ery With the advent of Linked Data, discovering links between entities emerged as an active research area [2]. Given a link specification, a naive approach would discover links by conducting O(n2 ) comparisons on the set of n entities. In the Entity Resolution (ER) community, a preprocessing technique called blocking mitigates full pairwise comparisons by clustering entities into blocks. Only en- tities within blocks are paired and compared. ER is critical in data integration systems [1]. In the Semantic Web, the problem has received attention as scalably discovering owl:sameAs links between RDF datasets [5]. In the Big Data era, scalability and heterogeneity are essential components of systems and hence, practical requirements for real-world link discovery. Scal- ability is addressed by blocking, but current work assumes that the dataset pairs between which entities are to be linked are provided. In other words, datasets A and B are input to the pipeline, and entities in A need to be linked to enti- ties in B. Investigations in some important real-world domains show that pairs of dataset collections also need to undergo linking. Each collection is a set of datasets. An example is government data. Recent government efforts have led to release of public data as batches of files, both across related domains and time, as 2 Mayank Kejriwal and Daniel P. Miranker Fig. 1. The property table representation. For subjects that don’t have a property, the reserved keyword null is entered. ; is a reserved delimiter that allows each field value to have set semantics. one of our real-world test sets demonstrates. Thus, there are (at least) two scal- ability issues: at the collection level, and at the dataset level. That is, datasets in one collection first need to be mapped to datasets in the second collection, after which a blocking scheme is learned and applied on each mapped pair. The problem of blocking two collections is exacerbated by data model heterogeneity, where some datasets are RDF and the others are tabular. We note that data model heterogeneity has larger implications, since it also applies in the standard case where two datasets are provided, but one is RDF and the other, tabular. In recent years, the growth of both Linked Open Data and the Deep Web have been extensively documented. Datasets in the former are in RDF, while datasets in the latter are typically relational. Because of data model heterogeneity, both communities have adopted different techniques for performing link discovery (typically called record linkage in the relational community). There is a clear motivation, therefore, in addressing this particular type of heterogeneity, since it would enable significant cross-fertilization between both communities. We will show an example of this empirically. The intuition behind our proposed solution to data model heterogeneity is to represent the RDF dataset as an information-preserving table, not as a set of triples or a directed graph. The literature shows that such a table has previously been proposed as a physical data structure, for efficient implementation of triple stores [6]. An example of this table, called a property table, is shown in Figure 1. We note that this is the first application of property tables as logical data struc- tures in the link-discovery context. The table is information-preserving because the original set of triples can be reconstructed from the table. Note that the property table builds a schema (in the form of a set of proper- ties) for the RDF file, regardless of whether it has accompanying RDFS or OWL metadata. Thus, it applies to arbitrary files on Linked Open Data. Secondly, numerous techniques in relational data integration can handle datasets with dif- ferent schemas (called structural heterogeneity). By representing RDF datasets in the input collections as property tables, data model heterogeneity is reduced to structural heterogeneity in the tabular domain. Figure 2 shows the overall framework of link-discovery. The first step, pro- posed in this paper for collections, is called the dataset mapping step. It takes two collections A and B of heterogeneous datasets as input and produces a set On Linking Heterogeneous Dataset Collections 3 Fig. 2. The overall link-discovery framework. Dataset mapping is our contribution. of mappings between datasets. Let such a mapping be (a, b) where a ∈ A, b ∈ B. For each such mapping, the subsequent blocking process is invoked. Blocking has been extensively researched, with even the least expensive blocking methods hav- ing complexity O(n), where n is the total number of entities in the input datasets. Blocking generates a candidate set of entity pairs, Γ , with |Γ | << O(n2 ). Thus, blocking provides complexity improvements over brute-force linkage. To under- stand the savings of dataset mapping, assume that each collection contains q datasets, and each dataset contains n entities. Without dataset mapping, any blocking method would be at least O(qn). With mapping, there would be q in- stances of complexity O(n) each. Since Γ depends heavily on n, the savings carry over to the final quadratic process (but which cannot be quantified without as- sumptions about the blocking process). We empirically demonstrate these gains. An added benefit is that there is now scope for parallelization. The mapping process itself relies on document similarity measures developed in the information retrieval community, by representing each dataset as bag of tokens. Intuitively, mapped datasets should have relatively high document simi- larity to each other. Empirically, we found a tailored version of cosine similarity to work best. Many packages exist for efficiently computing it. Computing simi- larities between all pairs of datasets, we get a |A|×|B| matrix. A straightforward approach would use a threshold to output many-many mappings, or a graph bi- partite matcher to output one-one mappings. The former requires a parameter specification, while the latter is cubic (O(q 3 )). Therefore, we opted for a domi- nating strategy, which can be computed in the same time it takes to build the matrix. Namely, a mapping (a, b) is chosen if the score in the cell of (a, b) dom- inates, that is, it is the highest in its constitutent row and column. This has the advantage of being conservative against false positives. The method applies even when |A| = 6 |B|. In our experiments, we used cosine document similarity combined with the dominating strategy. Experiments: Some results are demonstrated in Figure 3. We use three real- world test cases. The first two test cases (a and b in the figure) comprise RDF dataset collections describing court cases decided in Colombia and Venezuela re- spectively, along with Constitution articles. The third test set consists of ten US government budget dataset collections from 2009 to 2013 1 . Other such collec- tions can also be observed on the same website, providing motivation for dataset mapping. We have released publicly available datasets on a single page2 with 1 http://www.pewstates.org/research/reports/ 2 https://sites.google.com/a/utexas.edu/mayank-kejriwal/datasets 4 Mayank Kejriwal and Daniel P. Miranker Fig. 3. RR (Reduction Ratio) quantifies efficiency and is given by 1 − |Γ |/|Ω|, with Ω the set of all O(n2 ) entity pairs, while PC (Pairs Completeness) measures recall of Γ with respect to the ground-truth set, Ωm (⊆ Ω). SS indicates if dataset mapping (equivalently denoted Source Selection) was used. ground-truths. We used two popular methods as baselines, a state-of-the-art unsupervised clustering method called Canopy Clustering (CC in figure) [4] as well as an extended feature-selection based blocking method (Hetero in figure) [3]. The gains produced by dataset mapping are particularly large on CC. More importantly, we found that the dataset mapping algorithm was able to deduce the correct mappings without introducing false positives or negatives, and with run-time negligible compared to the subsequent blocking procedures. Future Work: We continue to investigate dataset mapping, including other document similarity measures, task domains and mapping strategies. We are also investigating supervised versions of the problem, particularly in cases where token overlap is low. Finally, we are investigating the property table further. References 1. P. Christen. Data matching: concepts and techniques for record linkage, entity res- olution, and duplicate detection. Springer, 2012. 2. R. Isele and C. Bizer. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment, 5(11):1638–1649, 2012. 3. M. Kejriwal and D. P. Miranker. An unsupervised algorithm for learning blocking schemes. In Data Mining, 2013. ICDM’13. Thirteenth International Conference on. IEEE, 2013. 4. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169–178. ACM, 2000. 5. F. Scharffe, Y. Liu, and C. Zhou. Rdf-ai: an architecture for rdf datasets matching, fusion and interlink. In Proc. IJCAI 2009 workshop on Identity, reference, and knowledge representation (IR-KR), Pasadena (CA US), 2009. 6. K. Wilkinson, C. Sayers, H. A. Kuno, D. Reynolds, et al. Efficient rdf storage and retrieval in jena2. In SWDB, volume 3, pages 131–150, 2003.