=Paper=
{{Paper
|id=Vol-1486/paper_34
|storemode=property
|title=ScSLINT: Time and Memory Efficient Interlinking Framework for Linked Data
|pdfUrl=https://ceur-ws.org/Vol-1486/paper_34.pdf
|volume=Vol-1486
|dblpUrl=https://dblp.org/rec/conf/semweb/NguyenI15
}}
==ScSLINT: Time and Memory Efficient Interlinking Framework for Linked Data==
ScSLINT: Time and Memory Efficient Interlinking Framework for Linked Data Khai Nguyen and Ryutaro Ichise The Graduate University for Advanced Studies, Japan National Institute of Informatics, Japan {nhkhai,ichise}@nii.ac.jp Abstract. Data interlinking is the problem of detecting the instances of different repositories but co-refer to the same topic. The large scale of linked data pushes a challenge to current interlinking algorithms. Sc- SLINT is an extension of SLINT+ [4] and its focus is scalability. ScSLINT also includes several modified features for the resolution sub-steps. The impressive performance of ScSLINT is validated by an experiment on very large datasets. 1 Introduction The goal of data interlinking is to detect all instances that co-refer to the same objects in two repositories, the source and the target. The interlinking problem on web-based data, such like linked data, is considered as more challenging than other types of data because of the heterogeneity and scalability issues. Linked data interlinking is a well-studied problem [1] because of its importance in data integration and indispensable role in linked data publication. Among many proposed solutions, SILK [6] is known for its frontier in linked data interlinking framework. Unfortunately, SILK is not well optimized for large scale dataset. LIMES [2] is a state-of-the-art framework. LIMES focuses on speeding up the interlinking process by reducing the number of real compar- isons using the characteristics of metric space. However, on really large dataset, LIMES still shows the limitation in scalability. Following the success of SLINT+ [4, 3], we develop its new release, ScSLINT, as a highly scalable interlinking framework. Compared to SLINT+, ScSLINT also comes with more advanced built-in features. 2 The ScSLINT The architecture of ScSLINT is depicted by Fig. 1. Rsource and Rtarget are the input repositories. We describe the detail of each component in their order in the interlinking process. 2.1 Property alignment generator This component creates the property alignments between Rsource and Rtarget . A property alignment is expected to describe the same attribute. ScSLINT consid- ers only the properties satisfying the requirement of coverage, discriminability 2 K. Nguyen, R. Ichise Property alignment Property Similarity function Initial similarity Rsource generator alignments generator functions Candidate Configuration Rtarget generator Candidates creator Co-reference Matching Similarity Co-references Configuration filter scores aggregator Fig. 1. The architecture of ScSLINT. [5], and type compatibility (string, decimal, date, URI ). An overlap measure on the tokens of the values described by the properties is used as the confidence of each alignment. The computation is as follows: |Opsource ∩ Optarget | conf ([psource , ptarget ]) = (1) |Opsource | Opk = {E(o)|x ∈ Rk , < s, pk , o >∈ x} where < s, p, o > stands for a RDF triple and E is a preprocessing function that extracts the tokens or normalizes the data format (e.g., date, time, and decimal) of the given RDF objects. ScSLINT is more scalable than SLINT+ because it does not need to consider all elements of Optarget as SLINT+ does. 2.2 Similarity function generator This component uses the property alignments generated previously to create a list of initial similarity functions. A similarity function is specified by two pieces of information: a property alignment [psource , ptarget ] and a similarity measure S. For two instances x ∈ Rsource and y ∈ Rtarget , a similarity function calculates the similarity S(value(x, psource ), value(y, ptarget )) where value(a, p) extracts all RDF objects of the triples declared by p. Compared to SLINT+, this new module enables user to install any new similarity measure into ScSLINT. 2.3 Candidate generator The mission of the candidate generator is to reduce the huge number of pairwise alignments between instances, at |Rsource | × |Rtarget | pairs. This component detects the candidates of potentially co-referent instances. SLINT+ computes the rough similarity of instances using a weighted matrix structure, which is not scalable and inaccurate on ambiguous data. We recommend using token-based prefix blocking without weighting and currently install it in ScSLINT by default. 2.4 Configuration creator A configuration contains the parameters for further components. It describes similarity functions, similarity aggregator, and co-reference filter. If no interven- tion (e.g., user and configuration learning algorithms) to this component is de- clared, a default configuration will be construct by taking all generated similarity functions and picking the frequently used similarity aggregator and co-reference filter. ScSLINT: Time and Memory Efficient Interlinking Framework 3 2.5 Similarity aggregator In this component, the similarity functions are executed and their results are ac- cumulated into one final matching score. Currently, for computing the matching score of two instances x and y, ScSLINT supports the following equation: 1 X scoreFsim (x, y) = v k × weight(y) (2) valid(UFsim (x, y)) v∈UFsim (x,y) UFsim (x, y) = {sim(x, y)|sim(x, y) ≥ σsim , sim ∈ Fsim } where Fsim is the similarity functions specified by the configuration. k ∈ {1, 2} controls the transformation for each similarity v. weight is a function weighting the target instance, which is logmaxt∈Rtarget size(t) size(y), where size(y) counts the number of RDF triples of y. valid returns the number of elements in UFsim (x, y). σsim is the acceptance threshold for the respective similarity function sim. This is the most expensive component in the interlinking process. However, since the candidates can be read and processed independently, ScSLINT applies parallel processing technique for this component. 2.6 Co-reference filter This component uses the matching scores of the candidates to construct the final co-references. In this step, acceptance threshold δ is applied onto the matching scores in order to remove candidates with low similarity. In addition, for highly ambiguous data, filtering is recommended in order to obtain the high quality co- references. ScSLINT currently supports stable filtering, which applies the idea of stable marriage problem [4]. Implementation note. In order to optimize the memory load, ScSLINT uses pre-indexed structures of input RDF repositories. The complexity of indexing algorithm is small and is linear to the size of repository. Note that these indexes are created only one time and are reusable. ScSLINT is developed in C++. The source code this framework is available at http://ri-www.nii.ac.jp/ScSLINT. 3 Performance We evaluate ScSLINT on a computer equipped with one Intel core i7 4770K CPU and 8GB memory. We enable multi-threading for the similarity aggregator. We test ScSLINT on 7 real datasets with very large size. The source repositories are three subsets of NYTimes, whose domain are locations (nyt.loc), organiza- tions (nyt.org), and people (nyt.peo). The target repositories are Dbpedia (db), Freebase (fr), and Geonames (gn). We use the default token blocking, linear similarity aggregator (k = 1), and two complex similarity measures for strings, TF-IDF Cosine and Levenshtein. We use reverse distance for numeric values and exact matching for other data types. Using this configuration, in average, 90.57% of actual co-references are successfully detected. Table 1 reports the complexity of these datasets and the execution time of each component of ScSLINT. In this table, the columns from 2 to 4 describe 4 K. Nguyen, R. Ichise Table 1. Complexity (column 2 to 4) and processing time in second (column 5 to 8). Dataset Size Candidates Similarity Comp. Comp. Comp. Comp. (×109 ) (×106 ) functions 1 3 5 6 nyt.loc-gn 32.69 32.2 12 37 7 70 3 nyt.loc-db 16.06 38.2 25 43 8 268 6 nyt.org-db 25.47 61.7 17 46 11 404 6 nyt.peo-db 41.66 46.9 22 46 11 251 6 nyt.loc-fr 154.97 222.7 23 14 111 641 29 nyt.org-fr 245.70 357.4 16 14 268 1023 46 nyt.peo-fr 401.89 620.1 18 15 507 1578 78 the parameters that make impacts to the complexity of the interlinking pro- cess. Column 2 is |Rsource | × |Rtarget |, which reflects the complexity of property alignment generator (Comp. 1) and candidate generator (Comp. 3). The multi- plication of column 3 and 4 is the number of comparisons, which directly defines the complexity of the similarity aggregator (Comp. 5) and co-reference filter (Comp. 6). In our experiment, while ScSLINT achieves a very impressive speed, LIMES and SILK fail to finish the interlinking task even within 10 times longer period for each dataset (and we terminate those programs in this case). The longest interlinking time is about 36 minutes, recorded on nyt.peo-fr, which requires 11.16 × 109 comparisons. On a dataset that is about 2000 times smaller than nyt.peo-fr, the required time for LIMES and SILK are almost 30 minutes and 9.5 hours, respectively [2]. ScSLINT can process large data in very short time when running on a usual personal computer. That is, the framework expresses its compatibility for even huge scale datasets. ScSLINT can far benefit from being conjuncted with other big data processing framework or deployed on a powerful machine. Also, more advanced candidate generators, similarity metrics, aggregation strategies, and learning algorithm can be implemented in ScSLINT easily. References [1] Ferrara, A., Nikolov, A., Scharffe, F.: Data linking for the semantic web. Interna- tional Journal of Semantic Web and Information System 7(3), 46–76 (2011) [2] Ngomo, A.C.N., Auer, S.: LIMES: A time-efficient approach for large-scale link discovery on the web of data. In: 22nd IJCAI. pp. 2312–2317 (2011) [3] Nguyen, K., Ichise, R.: SLINT+ results for OAEI 2013 instance matching. In: 8th ISWC workshop on Ontology Matching. pp. 177–183 (2013) [4] Nguyen, K., Ichise, R., Le, B.: Interlinking linked data sources using a domain- independent system. In: 2nd JIST. LNCS, vol. 7774, pp. 113–128. Springer (2013) [5] Song, D., Heflin, J.: Automatically generating data linkages using a domain- independent candidate selection approach. In: 10th ISWC. LNCS, vol. 7031, pp. 649–664. Springer (2011) [6] Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maintaining links on the web of data. In: 8th ISWC. LNCS, vol. 5823, pp. 650–665. Springer (2009)