-

RecLAK: Analysis and Recommendation of Interlinking Datasets

Giseli Rabello Lopes

grlopes@inf.puc-rio.br 1

Bernardo Pereira Nunes

bnunes@inf.puc-rio.br 1

Luiz André P. Paes Leme

lapaesleme@ic.uff.br 0

Marco A. Casanova

casanova@inf.puc-rio.br 1 0 Computer Science Institute , UFF, Niterói/RJ , Brazil 1 Departament of Informatics , PUC-Rio, Rio de Janeiro/RJ , Brazil

This paper presents the RecLAK , a Web application developed for the LAK Challenge 2014. RecLAK focuses on the analysis of the LAK dataset metadata and provides recommendations of potential candidate datasets to be interlinked with the LAK dataset. RecLAK follows an approach to generate recommendations based on Bayesian classi ers and on Social Networks Analysis measures. Furthermore, RecLAK generates graph visualizations that explore the LAK dataset over other datasets in the Linked Open Data cloud. The results of the experiments contribute to the understanding and improvement of the LAK dataset. Furthermore, it can also help researchers of the elds covered by LAK dataset, such as learning analytics and educational data mining.

Q1. For a dataset d, published in the LOD, is it interesting for the publisher of d to try to link it to lak ? Q2. For a dataset d, published in the LOD, is it interesting for the lak administrator to try to link his dataset to d? In more detail, let t and di be two datasets. A link from t to di is a triple of the form (s; p; o) such that s is de ned in t and o is de ned in di. We say that t is linked to di, or that di is linked from t, i there is at least a link from t to di. We also say that di is relevant for t i there is at least a resource de ned in di that can be linked from a resource de ned in t. Questions Q1 and Q2 are special cases of the dataset interlinking recommendation problem posed as follows: Given a nite set of datasets D and a dataset t, compute a rank score for each dataset di 2 D such that the rank score of di increases with the chances of di being relevant for t.

In this paper, we rst introduce two rank score functions to address the dataset interlinking recommendation problem. Then, we apply the functions to answer question Q2. The remainder of this paper is organized as follows. Section 2 presents related work. Section 3 brie y describes the recommendation approaches. Section 4 shows the result analysis of the metadata exploration and the generated recommendations. Finally, Section 5 presents some nal remarks.

2. RELATED WORK

In this paper, we use an extended version [ 5 ] of previous work [ 3, 4 ], that introduced the rank score functions based on the Bayesian and the Social Network approaches. The extended version also explores di erent sets of features related to the metadata of the datasets, such as properties, classes and vocabularies, to compute the rank score functions. Nikolov et al. [ 9, 10 ] propose an approach to identify relevant datasets for interlinking applying keywords searches and ontology matching techniques. Kuznetsov [ 2 ] describes a linking system, which is responsible for discovering relevant datasets for a given dataset and for creating instance level linkage. When compared with these approaches, the rank score functions applied in this paper use only metadata and are, therefore, much simpler to compute and yet achieve a good performance [ 5 ].

Loscio et al. [ 6 ] and Wagner et al. [ 15 ] propose techniques to nd relevant datasets for user queries. The rst approach is based on information quality criteria of correctness, schema completeness and data completeness while the second one is based on the overlapping of sets of instances of datasets. Oliveira et al. [ 13 ] use application queries and user feedback to discover relevant datasets. These papers aim at recommending datasets with respect to user queries, which is a problem close, but not identical to the problem discussed in this paper.

Nunes et al. [ 11, 12 ] performed several analysis on lak but their focus was mainly in the dataset content. They also proposed other datasets to be interlinked with lak considering their links with DBpedia. By contrast, this paper focuses on analyzing the metadata for creating rankings of candidate datasets to be interlinked with lak using di erent recommendation techniques.

3. RECOMMENDATION APPROACHES 3.1 Bayesian ranking

A rank score function, inspired on conditional probabilities, that induces the ranking of the datasets in D (from the largest to the smallest score), can be de ned as follows: pa(t; di) jDj

X dj2St\Pdi Pdj 1 score(di; t) = ! X log(P (FjjDi)) + log(P (Di)) (1) j=1::n Based on the maximum likelihood estimate of the probabilities [ 8 ] in a training set of datasets, the above probabilities can be estimated as follows: P (FjjDi) =

count(fj; di) Pn j=1 count(fj; di) ; P (Di) =

count(di) Pm i=1 count(di) where count(fj; di) is the number of datasets in the training set that have feature fj and that are linked to di, and count(di) is the number of datasets in the training set that are linked to di, disregarding the feature set.

For the score function computation, some auxiliary functions help to avoid computing log(0) replacing this value by c, which is a constant small enough to penalize the datasets di that do not have datasets with features Fj linked to them or that do not have links from other datasets [ 5 ]. Thus, the idea is that, if the set of features of t is very often correlated with datasets that are linked to di and t is not already linked to di, then it is recommended to try to link t to di.

3.2 Social Network-based ranking

We propose to analyze the dataset interlinking recommendation problem in much the same way as the link prediction problem in Social Networks [ 7 ]. Analogously, the Linked Data network for D is a directed graph such that the nodes are the datasets in D and there is an edge between datasets u and v in D i there is a link from u to v. To obtain more accurate results, we combine two measures, Preferential Attachment (pa) and Resource Allocation (ra), into a single score [ 5 ], de ned as follows: score(t; di) = ra(t; di) + (2) pa(t; di) = jPdi j ; ra(t; di) = where Pdi is the popularity set of a dataset di 2 D, that is, the set of all datasets in D that have links to di, and St is the similarity set of a dataset t, that is, the set of all datasets in D that have features in common with t. The combined score induces the ranking of the datasets in D (from the largest to the smallest score) and gives priority to the ra score; the pa score, normalized by the total number of datasets to be ranked (jDj), will play a role when there is a tie or when the ra value is zero.

4. RESULT ANALYSIS 4.1 Data used in the experiments

We selected a subset of the datasets indexed by the DataHub, using the Learning Analytics and Knowledge dataset [ 14 ] as the target of the recommendation. From the DataHub catalog, we managed to obtain 295 datasets with at least one lak eu-parliament-media eu-institutions b3kat istat-immigration lobid-resources oecd-linked-data world-bank-linked-data

the-eurostat-linked-data eprtr

norwegian-geo-divisions educationalprograms_sisvu ontos-news-portal grrp red-uno-internacional-santil ana global-hunger-index-2011

euskadi-farmacias eurostat-rdf open-data-risp dblp-deusto-gnoss rechtspraak interactivemaps-gnoss eea-rod sandrart-net national-diet-library-authorities rkb-explorer-kaunas rkb-explorer-darmstadt rkb-explorer-jisc rkb-explorer-nsf rkb-explorer-ibm rkb-explorer-risks enakting-energy

museosespania-gnoss proyectoapadrina lobid-organisaftaiormnsers-markets-geographpirco-sdpaetac-tsu-naitnedd--tsretantdess-gnoss nextweb-gnoss rkb-explorer-epsrc knoesis-linked-sensor-data gnoss nytimes-linked-open-data geonames-semantic-web uk-legislation-api rkb-explorer-ieee rkb-explorer-pisa rkb-erxkpbl-oerxepr-lolaraesr-southampton

rkb-explorer-roma rkb-explorer-wiki telegraphis rkb-explorer-newcastle rkb-explorer-deploy rkb-explorer-eprints rkb-explorer-kistirkb-explorer-irit rkb-explorer-ft rkb-explorer-rae2001 rkb-explorer-resex rkb-explorer-citeserkebr-explorer-dblp

rkb-explorer-acm rkb-explorer-eurerckobm-explorer-lisbon rkb-explorer-curriculum rkb-explorer-ulm

rkb-explorer-budapest rkb-explorer-cordis environment-agency-bathing-water-quality miguiadeviajes-gnoss garnicaplywood fao-linked-data

event-media aegp-spanish-textile-and-clothing-association deustoentrepreneurship

green-competitiveness-gnoss ordnance-survey-linked-data open-data-euskadi

museums-in-italy rkb-explorer-ecs

biographical-directory-of-the-united-states-congress southampton-ecs-eprints rkb-explorer-italy rkb-explorer-dotac rkb-explorer-courseware ecs my-experiment feature (class, property or vocabulary). Among the datasets with links de ned, there are 139 datasets with 697 known links. Figure 1 presents a graph representing the datasets and their known links. In this graph, the size of a dataset node is proportional to the number of datasets linked to it (in-degree).

The number of distinct features between classes and properties was 11,868. The number of relations between datasets and classes or properties was 16,750, where 6,447 were references to classes and 10,303 were references to properties. For the details on how we extracted metadata from the DataHub catalog, see [ 5 ].

4.2 LAK features

As features of lak, we used a selected set of classes and properties obtained from the lak and from the LinkedUp project Web site. We ltered out, from 51 initial features, those that were not related to the content of the dataset and that are used in many datasets, such as owl:sameAs, rdf:Property, rdfs:Resource, among others. The core of the selected set comes from the SWC ontology7 (Semantic Web Conference), which describes academic conferences and establishes a convention on how to use classes and properties from other ontologies, mostly FOAF (Friend of a Friend ), for people and organizations, and SWRC (Semantic Web for Research Communities), for papers. It also includes metadata from other ontologies, such as SIOC (SemanticallyInterlinked Online Communities) and DC (Dublin Core). The selected lak features added to 37, where 31 of them are shared by other datasets in our set of data. A preview of the RecLAK interface showing the selected lak classes is presented in Figure 2.

4.3 Datasets with LAK features

The set of datasets (represented by their id in DataHub) that have at least one feature in common with lak consists 7http://data.semanticweb.org/ns/swc/ontology of 132 datasets, with 376 associations between datasets and lak features. Figure 3 presents a graph representing the datasets and their associated lak features. In this graph, the size of a feature node is proportional to the number of datasets having it.

Among the lak features, the most popular are from DC: dc:title, shared by 60 datasets, and dc:creator, with 56 datasets references, and from FOAF: foaf:name and foaf:homepage with, respectively, 41 and 36 other datasets beyond lak referring to them. The least popular features are metadata directly from SWC and SWRC ontologies (some of them used by only 1 dataset other than lak ). The datasets with more than 5 features shared with lak are shown in Table 1. The more expressive result is obtained by the rkb-explorer-webconf dataset which shares 31 features with lak. This was the most correlated dataset with the selected classes and properties of lak. The rkb-explorerwebconf is a semantic repository that publishes RDF linked data and co-reference information from the RKB Explorer initiative. This dataset includes information about authors and publications in several conferences, such as ESWC.

4.4 Dataset Interlinking recommendations

Using the score functions, brie y described in Section 3, we generated recommendations for lak. A preview of the RecLAK interface presenting the recommendations for LAK is presented in Figure 4.

The top 10 recommendations generated by each of the two approaches (Bayesian and Social Network-based rankings) and the respective score values estimated for each recommended dataset are presented in Table 2. The top 10 ranked datasets for each approach will be brie y described below. Bayesian ranking. The topmost-ranked is a generic dataset with concepts from the Semantic Web community. Dataset #2 is a well-known lexical database of English. Datasets from #3 to #6 positions of the Bayesian ranking presented tied scores. Dataset #3 is a dataset with concepts from tags generated by human annotators. Dataset #4 describes people, research groups and publications of the members of the Computer Science Department at the University of She eld. Dataset #5 is maintained by the chamber of deputies in Italy, which is working to publish quality linked data in several domains, including research. Dataset #6 describes the DBLP digital library, which provides bibliographic information on major computer science journals and proceedings. dblp also indexes the papers published in the LAK and EDM conferences. Dataset #7 is the Geonames dataset, which contains information about geographical locations. Dataset #8 contains information about languages, words, characters, and other human language-related entities to the Linked Data Web and Semantic Web. lexvo has links to WordNet and thesauris. Dataset #9 is a Linked Data version of the Association for Computing Machinery (ACM) digital library. Finally, dataset #10 is a dataset of the Library of Congress Subject Headings (LCSH), which catalogs materials stored by the Library of Congress and other libraries around the United States.

Social Network-based ranking. Since, there is some overlap between the top 10 recommendations of Social Network-based (SN-based) and Bayesian ranking, we will comment the top 10 datasets ranked only by the SN-based approach. Dataset #2 publishes the news vocabularies used by The New York Times as Linked Open Data. It covers data and resources about people, locations and organizations. Dataset #3 covers topics related to innovation, technology, business and education. Dataset #6 has links catalogued in the DataHub for other bibliographic datasets such as Citeseer, DBLP, ACM, IEEE and EPrints. Dataset #7 was created with the objective of being capable of networking the wide range of resources and information held by libraries and other cultural institutions in German-speaking countries. This dataset uses established vocabularies, such as FOAF. Dataset #9 describes e-prints and has links catalogued in the DataHub for other bibliographic datasets such as Citeseer, DBLP, ACM and IEEE. Dataset #10 is also a Linked Data version of publications information of the DBLP digital library, similar to sweto-dblp.

Discussion. Based on the top 10 rankings of both approaches, we identi ed three main groups of candidate datasets that were recommended to be interlinked with lak : generic: semanticweb-org, w3c-wordnet, tags2conrkb-explorer-acm

rkb-explorer-roma rkb-explorer-italy osm-semantic-network

rkb-explorer-lisbon sweto-dblp secold acorn-sat ifpri-linked-open-data-global-hunger-index

miguiadeviajes-gnoss rkb-explorer-newcatostxlecast-toxrefdb rkb-explorer-ibm rkb-explorer-cordis

twc-healthdata rkb-explorer-irit vivo-indiana-university rkb-explorer-deploy rkb-explorer-deepblue rkb-explorer-ieee rkb-explorer-ulm debian-package-tracking-nsoysbtelmprizes event-media open-food-facts foaf:name swrc:Proceedings aksworg greek-legal-entities foaf:homseoupthaamgpeton-ac-uk-prdocfisle-sheffieldkrystian-pietruszka swc:hasRelatedDocument swc:ConferenceEvent foaf:membfoearf:mbox_sha1seuumskadi-farmacias foaf:Person moviinesst-aanrgceen-htiunba-us-congressional-committees southampton-ac-uk-apps rkb-explorer-webscience lod2 southampton-ac-uk-phonebook foaf:Organization swrc:series cablegate twc-dbaetan-egfoicviaries-of-the-european-commission transparency-linked-data sparql-endpoint-staintsutsance-hub-people foaf:lastName

semanticweb-org foaf:firstName national-diet-library-authoeruit-ipeasrliament-media kdata eurostat-rdf national-diet-library-subject-headings instance-hub-organizations vivo-weil -cornel -medical-col ege vivo qualitywebdata-org arrayexpress_e-mtab-104 vivo-scripps-research-institute enipedia

southampton-ac-uk-services taxonconcept sandrart-net

southampton-ac-uk-org instance-hub-us-federal-agencies vivo-ponce delicious, geonames-semantic-web, lexvo, nytimeslinked-open-data, rkb-explorer-wiki with smaller popularity and having at least one feature of lak. bibliographic: dcs-she edl, linked-open-camera, sweto-dblp, rkb-explorer-acm, lcsh, dnb-gemeinsamenormdatei, rkb-explorer-eprints, rkb-explorer-dblp educational area: gnoss.

The top 10 recommendations of the rankings di er in some aspects. Considering the groups identi ed above, the Bayesian ranking contains a higher number of generic datasets, while the Social Network-based ranking contains a higher number of bibliographic datasets. This probably happens because Bayesian ranking prioritizes recommendations for lak of datasets linked from the larger number of other datasets having the larger number of lak features. On the other hand, the Social Network-based ranking prioritizes the datasets pointed by the larger number of other datasets The results also indicate that the selection of the feature set is very important because it directly in uences the generated rankings and can lead to recommendations of datasets which are more as well as less generic. In our experiments with lak, we ltered out some generic features (e.g., owl:sameAs), but included DC and FOAF elements. Thus, we expected that both generic and speci c datasets from our set of datasets were recommended. As the metadata used to triplify lak were not using classes and properties speci cally related to the application domain, this characteristic was not evidenced in the recommendation results.

5. CONCLUSIONS

This paper presented a detailed analysis, based on Bayesian classi ers and on Social Network Analysis techniques, to address the dataset interlinking recommendation problem for lak, using only metadata. Thus, the rank score functions are potentially useful to reduce the cost of dataset interlinking. For more information, including the full set of data used in the experiments, graphical visualizations and detailed results, we refer to the RecLAK Web application, avaliable at http://www.inf.puc-rio.br/~grlopes/RecLAK.

6. ACKNOWLEDGMENTS

This work was partly funded by CNPq, under grants 160326/2012-5, 301497/2006-0 and 57128/2009-9, and by FAPERJ, under grants E-26/170028/2008 and E26/103.070/2011.

[1]

Berners-Lee . Linked Data . In Design Issues. W3C , July 2006 .

[2]

K. A.

Kuznetsov . Scienti c data integration system in the linked open data space . Programming and Computer Software , 39 ( 1 ): 43 { 48 , Jan . 2013 .

[3]

L. A. P. P.

Leme ,

G. R.

Lopes ,

B. P.

Nunes ,

M. A.

Casanova , and

Dietze . Identifying candidate datasets for data interlinking . In ICWE'13 , pages 354 { 366 , 2013 .

[4]

G. R.

Lopes ,

L. A. P. P.

Leme ,

B. P.

Nunes ,

M. A.

Casanova , and

Dietze . Recommending tripleset interlinking through a social network approach . In WISE'13 , pages 149 { 161 , 2013 .

[5]

G. R.

Lopes ,

L. A. P.

Paes ,

B. P.

Nunes ,

M. A.

Casanova , and

Dietze . Comparing recommendation approaches for dataset interlinking . Technical report , Department of Informatics, PUC-Rio, 2013 .

[6]

B. F.

Loscio ,

Batista , and

Souza . Using information quality for the identi cation of relevant web data sources . In IIWAS'12 , pages 36 { 44 , New York, NY, USA, 2012 . ACM.

[7]

Lu , C.-H. Jin , and T. Zhou . Similarity index based on local paths for link prediction of complex networks . Physical Review E , 80 ( 4 ): 046122 , 2009 .

[8]

C. D.

Manning and

Schu tze. Foundations of Statistical Natural Language Processing . MIT Press, 2002 .

[9]

Nikolov and M. d'Aquin. Identifying Relevant Sources for Data Linking using a Semantic Web Index . In LDOW'11 , 2011 .

[10]

Nikolov , M. d'Aquin,

and E.

Motta . What Should I Link to? Identifying Relevant Sources and Classes for Data Linking . In JIST'12 , pages 284 { 299 . Springer Berlin Heidelberg, 2012 .

[11]

B. P.

Nunes ,

Fetahu , and

M. A.

Casanova . Cite4me: Semantic retrieval and analysis of scienti c publications . In LAK (Data Challenge) , volume 974 of CEUR Workshop Proceedings. CEUR-WS.org , 2013 .

[12]

B. P.

Nunes ,

Fetahu ,

Dietze , and

M. A.

Casanova . Cite4me: A semantic search and retrieval web application for scienti c publications . In ISWC (Posters & Demos) , volume 1035 of CEUR Workshop Proceedings , pages 25 { 28 . CEUR-WS.org, 2013 .

[13]

H. R. d.

Oliveira ,

A. T.

Tavares , and

B. F.

Loscio . Feedback-based data set recommendation for building linked data applications . In I-SEMANTICS'12 , pages 49 { 55 , 2012 .

[14]

Taibi and

Dietze . Fostering analytics on learning analytics research: the lak dataset . In LAK (Data Challenge) , volume 974 of CEUR Workshop Proceedings. CEUR-WS.org , 2013 .

[15]

Wagner ,

Haase ,

Rettinger , and

Lamm . Discovering related data sources in data-portals . In SemStats workshop, ISWC' 13 , 2013 .