=Paper= {{Paper |id=Vol-1137/lakdatachallenge2014_submission_2 |storemode=property |title=RecLAK: Analysis and Recommendation of Interlinking Datasets |pdfUrl=https://ceur-ws.org/Vol-1137/lakdatachallenge2014_submission_2.pdf |volume=Vol-1137 |dblpUrl=https://dblp.org/rec/conf/lak/LopesLNC14 }} ==RecLAK: Analysis and Recommendation of Interlinking Datasets== https://ceur-ws.org/Vol-1137/lakdatachallenge2014_submission_2.pdf
                RecLAK: Analysis and Recommendation of
                         Interlinking Datasets

                    Giseli Rabello Lopes                                 Luiz André P. Paes Leme
             Departament of Informatics, PUC-Rio                        Computer Science Institute, UFF
                  Rio de Janeiro/RJ, Brazil                                   Niterói/RJ, Brazil
                   grlopes@inf.puc-rio.br                                  lapaesleme@ic.uff.br
                  Bernardo Pereira Nunes                                    Marco A. Casanova
             Departament of Informatics, PUC-Rio                     Departament of Informatics, PUC-Rio
                  Rio de Janeiro/RJ, Brazil                               Rio de Janeiro/RJ, Brazil
                   bnunes@inf.puc-rio.br                                 casanova@inf.puc-rio.br

ABSTRACT                                                       collection of open data available on the Web. One of the
This paper presents the RecLAK , a Web application devel-      datasets covered by the LinkedUp project is the Learning
oped for the LAK Challenge 2014. RecLAK focuses on the         Analytics and Knowledge (LAK) dataset. The LAK dataset,
analysis of the LAK dataset metadata and provides recom-       referred to as lak, provides access to structured fulltext and
mendations of potential candidate datasets to be interlinked   metadata from key research publications in the field of learn-
with the LAK dataset. RecLAK follows an approach to gen-       ing analytics and educational data mining6 . lak is regu-
erate recommendations based on Bayesian classifiers and on     larly updated with data, for instance, from the LAK (Learn-
Social Networks Analysis measures. Furthermore, RecLAK         ing Analytics and Knowledge) and EDM (Educational Data
generates graph visualizations that explore the LAK dataset    Mining) conference series. According to the DataHub meta-
over other datasets in the Linked Open Data cloud. The re-     data, lak was not linked to other datasets, except DBpedia.
sults of the experiments contribute to the understanding and   However, an exploratory search in the DataHub in fact re-
improvement of the LAK dataset. Furthermore, it can also       vealed related datasets that lak could be linked to, such as
help researchers of the fields covered by LAK dataset, such    other bibliographic datasets.
as learning analytics and educational data mining.
                                                               This scenario is very common. Most of the published datasets
                                                               are still awaiting to be linked and, therefore, they do not ful-
1.   INTRODUCTION                                              fill the requirements to be considered 5-star [1] and fail to
The effort of publishing Linked Data has been accompanied      take advantage of other data. Basically, as argued in [10],
by the creation of catalogs of Linked Data datasets, such as   the linkage to popular datasets is favoured for two main
the DataHub 1 , to make data findable and reusable. How-       reasons: the difficulty of finding related open datasets; and
ever, despite the fact that extensive lists of open datasets   the strenuous task of discovering instance mappings between
are available in these catalogs, most of the data publishers   different datasets.
typically link their datasets only to popular ones, such as
DBpedia2 , Freebase3 and Geonames4 . Although the link-        In this sense, lak will be explored as a case study. The rec-
age to popular datasets allows the exploration of external     ommendation challenge associated to the interlinking of lak
resources, it fails to cover more specialized data.            in the LOD can be posed by considering two main questions:
As a practical example of this scenario, we may highlight
the LinkedUp project5 , which is an initiative that aims at    Q1. For a dataset d, published in the LOD, is it interesting
providing educational organizations and institutions with a        for the publisher of d to try to link it to lak ?
1
  http://datahub.io/
2
  http://dbpedia.org                                           Q2. For a dataset d, published in the LOD, is it interesting
3
  http://www.freebase.com                                          for the lak administrator to try to link his dataset to
4
  http://www.geonames.org/                                         d?
5
  http://linkedup-project.eu

                                                               In more detail, let t and di be two datasets. A link from t
                                                               to di is a triple of the form (s, p, o) such that s is defined
                                                               in t and o is defined in di . We say that t is linked to di , or
                                                               that di is linked from t, iff there is at least a link from t to
                                                               di . We also say that di is relevant for t iff there is at least
                                                               a resource defined in di that can be linked from a resource
                                                               defined in t.
                                                               6
                                                                   http://lak.linkededucation.org
Questions Q1 and Q2 are special cases of the dataset inter-
linking recommendation problem posed as follows:                                        X
                                                                                                                   !
                                                                    score(di , t) =             log(P (Fj |Di ))       + log(P (Di ))   (1)
                                                                                       j=1..n
     Given a finite set of datasets D and a dataset
     t, compute a rank score for each dataset di ∈ D              Based on the maximum likelihood estimate of the probabil-
     such that the rank score of di increases with the            ities [8] in a training set of datasets, the above probabilities
     chances of di being relevant for t.                          can be estimated as follows:

In this paper, we first introduce two rank score functions to                     count(fj , di )                     count(di )
                                                                  P (Fj |Di ) = Pn                      ; P (Di ) = Pm
address the dataset interlinking recommendation problem.                         j=1 count(f  j , d i )              i=1 count(di )
Then, we apply the functions to answer question Q2.
                                                                  where count(fj , di ) is the number of datasets in the train-
The remainder of this paper is organized as follows. Sec-         ing set that have feature fj and that are linked to di , and
tion 2 presents related work. Section 3 briefly describes         count(di ) is the number of datasets in the training set that
the recommendation approaches. Section 4 shows the re-            are linked to di , disregarding the feature set.
sult analysis of the metadata exploration and the generated
recommendations. Finally, Section 5 presents some final re-       For the score function computation, some auxiliary functions
marks.                                                            help to avoid computing log(0) replacing this value by c,
                                                                  which is a constant small enough to penalize the datasets di
2.   RELATED WORK                                                 that do not have datasets with features Fj linked to them
In this paper, we use an extended version [5] of previous work    or that do not have links from other datasets [5]. Thus, the
[3, 4], that introduced the rank score functions based on the     idea is that, if the set of features of t is very often correlated
Bayesian and the Social Network approaches. The extended          with datasets that are linked to di and t is not already linked
version also explores different sets of features related to the   to di , then it is recommended to try to link t to di .
metadata of the datasets, such as properties, classes and
vocabularies, to compute the rank score functions.                3.2    Social Network-based ranking
                                                                  We propose to analyze the dataset interlinking recommen-
Nikolov et al. [9, 10] propose an approach to identify rel-       dation problem in much the same way as the link prediction
evant datasets for interlinking applying keywords searches        problem in Social Networks [7]. Analogously, the Linked
and ontology matching techniques. Kuznetsov [2] describes         Data network for D is a directed graph such that the nodes
a linking system, which is responsible for discovering rele-      are the datasets in D and there is an edge between datasets
vant datasets for a given dataset and for creating instance       u and v in D iff there is a link from u to v. To obtain more
level linkage. When compared with these approaches, the           accurate results, we combine two measures, Preferential At-
rank score functions applied in this paper use only meta-         tachment (pa) and Resource Allocation (ra), into a single
data and are, therefore, much simpler to compute and yet          score [5], defined as follows:
achieve a good performance [5].

Lóscio et al. [6] and Wagner et al. [15] propose techniques to                                                 pa(t, di )
                                                                                score(t, di ) = ra(t, di ) +                            (2)
find relevant datasets for user queries. The first approach is                                                    |D|
based on information quality criteria of correctness, schema
completeness and data completeness while the second one                                                            X             1
is based on the overlapping of sets of instances of datasets.             pa(t, di ) = |Pdi | ; ra(t, di ) =
                                                                                                               dj ∈St ∩Pd
                                                                                                                                Pdj
Oliveira et al. [13] use application queries and user feedback                                                              i
to discover relevant datasets. These papers aim at recom-
mending datasets with respect to user queries, which is a         where Pdi is the popularity set of a dataset di ∈ D, that
problem close, but not identical to the problem discussed in      is, the set of all datasets in D that have links to di , and
this paper.                                                       St is the similarity set of a dataset t, that is, the set of all
                                                                  datasets in D that have features in common with t.
Nunes et al. [11, 12] performed several analysis on lak but
their focus was mainly in the dataset content. They also pro-     The combined score induces the ranking of the datasets in D
posed other datasets to be interlinked with lak considering       (from the largest to the smallest score) and gives priority to
their links with DBpedia. By contrast, this paper focuses on      the ra score; the pa score, normalized by the total number
analyzing the metadata for creating rankings of candidate         of datasets to be ranked (|D|), will play a role when there
datasets to be interlinked with lak using different recom-        is a tie or when the ra value is zero.
mendation techniques.
                                                                  4. RESULT ANALYSIS
3. RECOMMENDATION APPROACHES                                      4.1 Data used in the experiments
3.1 Bayesian ranking                                              We selected a subset of the datasets indexed by the DataHub,
A rank score function, inspired on conditional probabilities,     using the Learning Analytics and Knowledge dataset [14] as
that induces the ranking of the datasets in D (from the           the target of the recommendation. From the DataHub cat-
largest to the smallest score), can be defined as follows:        alog, we managed to obtain 295 datasets with at least one
                                                                                                                                                                                                                                                                        21/02/14 12




                                                                                                        transparency-linked-data
                                                                                                                             ecb-linked-data bfs-linked-data
                                                                                           eu-who-is-who
                                                                                                                                                                  ontos-news-portal
                                                                         eu-institutions                                                                                                    grrp
                                                                                                         oecd-linked-data              world-bank-linked-data
                                                       eu-parliament-media                                           the-eurostat-linked-data
                                                                                                                                                                                             red-uno-internacional-santillana
                                                                                     istat-immigration
                                                                                                                                                          global-hunger-index-2011
                                                                                                                                                                             euskadi-farmacias
                                                 lak
                                                                        b3kat                               eprtr                        norwegian-geo-divisions                                                       rechtspraak
                                                                                                                    educationalprograms_sisvu

                                                                                    lobid-resources                                                                                             dblp-deusto-gnoss
                                 eea-rod                                                                                                                                       open-data-risp
                                                       sandrart-net                                                                                          eurostat-rdf
                                                                                                                                                                                                                             interactivemaps-gnoss
                                                                                                         rkb-explorer-darmstadt
                                                             national-diet-library-authorities                                           enakting-energy
                                                                                                                                                                                                           museosespania-gnoss
                                                                                           rkb-explorer-kaunas
                                                                                                                                                                                            proyectoapadrina
                 rkb-explorer-jisc                                                                                    lobid-organisations                  prospects-and-trends-gnoss
                                     rkb-explorer-nsf                                                                               farmers-markets-geographic-data-united-states                                                      nextweb-gnoss
                                                   rkb-explorer-ibm
                                                                            rkb-explorer-risks
                                                                                                      rkb-explorer-epsrc                                                                                             gnoss
                                                                                                                                 knoesis-linked-sensor-data
                                                                                                                                                                                              nytimes-linked-open-data
                             rkb-explorer-ieee                                                                                                                                                                                                      didactalia
                                                                                                                                                      geonames-semantic-web
             rkb-explorer-pisa         rkb-explorer-laas
                                            rkb-explorer-southampton                                                       telegraphis                    uk-legislation-api
                                                                                              rkb-explorer-roma                                                                                                                     ineverycrea
                                                                       rkb-explorer-wiki                               environment-agency-bathing-water-quality
                                                                                                                                                                                         miguiadeviajes-gnoss
                                                                                                                                                                                                              garnicaplywood
                   rkb-explorer-newcastle                                                                                                                                                                                                         enakting-population
                                                                                                                                                                             event-media
        rkb-explorer-deploy
                                      rkb-explorer-eprints
                                                             rkb-explorer-kistirkb-explorer-irit               rkb-explorer-ft
                                                                                                                                         aegp-spanish-textile-and-clothing-association
                                                                                                                                                                                                                           artenuevosmedios-gnoss
                                                                                                                                                                                        deustoentrepreneurship
                 rkb-explorer-rae2001                                                                                              fao-linked-data
        rkb-explorer-resexrkb-explorer-citeseer
                                            rkb-explorer-dblp                                                                                                                                         green-competitiveness-gnoss             ineverycrea-argentina
                                                                                                                                                                ordnance-survey-linked-data
                                                   rkb-explorer-acm                           rkb-explorer-eurecom
                                                                                                              rkb-explorer-lisbon                   open-data-euskadi
                                                                                                                                                                                                                             chronicling-america
                         rkb-explorer-curriculum                                                                                                                                           museums-in-italy
               rkb-explorer-ulm                                  rkb-explorer-budapest                                           rkb-explorer-ecs
                                                  rkb-explorer-cordis                                                                        biographical-directory-of-the-united-states-congress               idreffr                    japan-radioactivity-stat
                                                                                             southampton-ecs-eprints
                        rkb-explorer-dotac                                                                                 ecs
                                                                                                                                                  my-experiment                     hellenic-police
                                                                                                                                                                                                                               geowordnet
                                                               rkb-explorer-courseware

                                                                                      rkb-explorer-italy                                                                                           jamendo-dbtune
           psh-subject-headings                                                                                                                                hellenic-fire-brigade
                                           dnb-gemeinsame-normdatei                                                     john-goodwins-family-tree                                                                                          sweto-dblp
                                                                                                                                                                                                                          sudocfr
                                   lcsh                          rkb-explorer-deepblue                rkb-explorer-unlocode
                                                                                                                                                   geospecies                       enakting-mortality


                                                                                              eunis                        semanticweb-org                                                            enakting-crime
                                                                                                                                                                      diavgeia                                                dcs-sheffield
                                      national-diet-library-subject-headings
                                     icane                                                            rkb-explorer-wordnet
                                                              hebis-bibliographic-resources                                                       enakting-nhs
                                                                                                                                                                                    rkb-explorer-webconf
                                                                                                                             w3c-wordnet                                                                           agris
                                                                                            msc                                                                             lexvo
                                                  rkb-explorer-os

                                                                                                  stw-thesaurus-for-economics                        sztaki-lod                            fao-geopolitical-ontology
                                                                      oclc-fast
                                                                                                                             glottolog-langdoc
                                                                                                                                                                               linked-open-camera
                                                                              deutsche-biographie

                                                                                                    taxonconcept                                             tags2con-delicious

                                                                                                                     gesis-thesoz                  asjp




                                                                                  Figure 1: The datasets and their links.


feature (class, property or vocabulary). Among the datasets                                                                                   and that are used in many datasets, such as owl:sameAs,
with links defined, there are 139 datasets with 697 known                                                                                     rdf:Property, rdfs:Resource, among others. The core of the
links. Figure 1 presents a graph representing the datasets                                                                                    selected set comes from the SWC ontology7 (Semantic Web
and their known links. In this graph, the size of a dataset                                                                                   Conference), which describes academic conferences and es-
node is proportional to the number of datasets linked to it                                                                                   tablishes a convention on how to use classes and properties
(in-degree).                                                                                                                                  from other ontologies, mostly FOAF (Friend of a Friend ),
                                                                                                                                              for people and organizations, and SWRC (Semantic Web for
The number of distinct features between classes and proper-                                                                                   Research Communities), for papers. It also includes meta-
ties was 11,868. The number of relations between datasets                                                                                     data from other ontologies, such as SIOC (Semantically-
and classes or properties was 16,750, where 6,447 were refer-                                                                                 Interlinked Online Communities) and DC (Dublin Core).
ences to classes and 10,303 were references to properties. For                                                                                The selected lak features added to 37, where 31 of them
the details on how we extracted metadata from the DataHub                                                                                     are shared by other datasets in our set of data. A preview
catalog, see [5].                                                                                                                             of the RecLAK interface showing the selected lak classes is
                                                                                                                                              presented in Figure 2.

4.2   LAK features                                                                                                                            4.3             Datasets with LAK features
As features of lak, we used a selected set of classes and
                                                                                                                                              The set of datasets (represented by their id in DataHub)
properties obtained from the lak and from the LinkedUp
                                                                                                                                              that have at least one feature in common with lak consists
project Web site. We filtered out, from 51 initial features,
                                                                                                                                              7
those that were not related to the content of the dataset                                                                                         http://data.semanticweb.org/ns/swc/ontology
                                                                RecLAK interface presenting the recommendations for LAK
                                                                is presented in Figure 4.

                                                                The top 10 recommendations generated by each of the two
                                                                approaches (Bayesian and Social Network-based rankings)
                                                                and the respective score values estimated for each recom-
                                                                mended dataset are presented in Table 2. The top 10 ranked
                                                                datasets for each approach will be briefly described below.

                                                                Bayesian ranking. The topmost-ranked is a generic
                                                                dataset with concepts from the Semantic Web community.
                                                                Dataset #2 is a well-known lexical database of English.
                                                                Datasets from #3 to #6 positions of the Bayesian ranking
                                                                presented tied scores. Dataset #3 is a dataset with concepts
                                                                from tags generated by human annotators. Dataset #4 de-
                                                                scribes people, research groups and publications of the mem-
                                                                bers of the Computer Science Department at the University
Figure 2: Preview of the RecLAK interface showing               of Sheffield. Dataset #5 is maintained by the chamber of
the selected lak classes.                                       deputies in Italy, which is working to publish quality linked
                                                                data in several domains, including research. Dataset #6
                                                                describes the DBLP digital library, which provides biblio-
Table 1: Top 10 datasets sharing features with lak.             graphic information on major computer science journals and
  Dataset id                   # shared features                proceedings. dblp also indexes the papers published in the
  rkb-explorer-webconf                31                        LAK and EDM conferences. Dataset #7 is the Geonames
  linked-open-vocabularies-lov         8                        dataset, which contains information about geographical lo-
  krystian-pietruszka                  7                        cations. Dataset #8 contains information about languages,
  aksworg                              7                        words, characters, and other human language-related enti-
  dcs-sheffield                        6                        ties to the Linked Data Web and Semantic Web. lexvo has
  southampton-ac-uk-profile            6                        links to WordNet and thesauris. Dataset #9 is a Linked
  jamendo-dbtune                       6                        Data version of the Association for Computing Machinery
  sudocfr                              6                        (ACM) digital library. Finally, dataset #10 is a dataset of
  rkb-explorer-webscience              6                        the Library of Congress Subject Headings (LCSH), which
  msc                                  6                        catalogs materials stored by the Library of Congress and
                                                                other libraries around the United States.

of 132 datasets, with 376 associations between datasets and     Social Network-based ranking. Since, there is some
lak features. Figure 3 presents a graph representing the        overlap between the top 10 recommendations of Social
datasets and their associated lak features. In this graph,      Network-based (SN-based) and Bayesian ranking, we will
the size of a feature node is proportional to the number of     comment the top 10 datasets ranked only by the SN-based
datasets having it.                                             approach. Dataset #2 publishes the news vocabularies used
                                                                by The New York Times as Linked Open Data. It cov-
Among the lak features, the most popular are from               ers data and resources about people, locations and orga-
DC: dc:title, shared by 60 datasets, and dc:creator, with       nizations. Dataset #3 covers topics related to innovation,
56 datasets references, and from FOAF: foaf:name and            technology, business and education. Dataset #6 has links
foaf:homepage with, respectively, 41 and 36 other datasets      catalogued in the DataHub for other bibliographic datasets
beyond lak referring to them. The least popular features are    such as Citeseer, DBLP, ACM, IEEE and EPrints. Dataset
metadata directly from SWC and SWRC ontologies (some            #7 was created with the objective of being capable of net-
of them used by only 1 dataset other than lak ).                working the wide range of resources and information held by
                                                                libraries and other cultural institutions in German-speaking
The datasets with more than 5 features shared with lak are      countries. This dataset uses established vocabularies, such
shown in Table 1. The more expressive result is obtained        as FOAF. Dataset #9 describes e-prints and has links cata-
by the rkb-explorer-webconf dataset which shares 31 fea-        logued in the DataHub for other bibliographic datasets such
tures with lak. This was the most correlated dataset with       as Citeseer, DBLP, ACM and IEEE. Dataset #10 is also
the selected classes and properties of lak. The rkb-explorer-   a Linked Data version of publications information of the
webconf is a semantic repository that publishes RDF linked      DBLP digital library, similar to sweto-dblp.
data and co-reference information from the RKB Explorer
initiative. This dataset includes information about authors
and publications in several conferences, such as ESWC.          Discussion. Based on the top 10 rankings of both ap-
                                                                proaches, we identified three main groups of candidate
                                                                datasets that were recommended to be interlinked with lak :
4.4   Dataset Interlinking recommendations
Using the score functions, briefly described in Section 3,
we generated recommendations for lak. A preview of the             • generic:    semanticweb-org, w3c-wordnet, tags2con-
                                                                                                                                                                                                                                                                    21/02/14




                                                                                                                  secold            acorn-sat
                                                                                                                                     ifpri-linked-open-data-global-hunger-index
                                                                                           rkb-explorer-lisbon
                                                                                                                                                                  miguiadeviajes-gnoss
                                                                                sweto-dblp
                                                                                                                  rkb-explorer-newcastle
                                                                                                      rkb-explorer-ibm             toxcast-toxrefdb                                   proyectoapadrina
                                                             rkb-explorer-acm                                                                          rkb-explorer-cordis

                                                                                         rkb-explorer-roma                                                               twc-healthdata
                                                                                                                                                                                                        chronicling-america
                                                                       osm-semantic-network                             rkb-explorer-irit
                                                 rkb-explorer-italy                                  vivo-indiana-university           rkb-explorer-deploy                                    bibsonomy
                                                                                                                                                       rkb-explorer-deepblue

                                                        rkb-explorer-unlocode                                                                                                     libver
                                vivo-university-of-florida                           rkb-explorer-courseware                                                                                                      datagov-catalog
                                                                          rkb-explorer-pisa                      educationalprograms_sisvu                                                       southampton-ac-uk-jargon
                                                                                                           geospecies
                                                                                                                                   business_terms
                                                                                                                                               rkb-explorer-epsrc
                                               rkb-explorer-ulm                                                                                                                            hellenic-police
                           rkb-explorer-ieee                                                                                                                                                                                       swrc:abstract
                                                                  rkb-explorer-ft rkb-explorer-nsf
                                                                                                 dce:creator            jita
                                                                                                                                                                         bibo:authorList
                                                                                                                                                                                                             swc:relatedToEvent

                                                                                                                                           productontology
                                     rkb-explorer-era
                                                                                                                                                                                               swc:isPartOf
                  rkb-explorer-rae2001            rkb-explorer-budapest                                                                                    dce:subject                                                        swrc:year
                                                                                                                                                                                                                                             swc:hasPart
                                                                  rkb-explorer-darmstadt
                                                                                               dce:titleglottolog-langdoc newsweek
                                                                                                                                                                         oecd-linked-data                      swrc:month
                          rkb-explorer-southampton
                                             rkb-explorer-laas                                                                             open-data-euskadi
                                                             rkb-explorer-risks                                                                                                   hebis-bibliographic-resources
                                                                                                                                                           iso-3166-2-data                                                    swc:hasAcronym
              world-bank-linked-data                                   temple-ov-thee-lemur-datasets
                                                                                        rkb-explorer-digitaleconomy                                                                                                                       swrc:InProceedings
                                                                                                                 hedatuz
                                                                                                                                                                             jamendo-dbtune
                                                                                                                                                                                                                 foaf:made
                                         iris2                                                                                   landscape-portrait
                                                   rkb-explorer-kaunas
                                                                   rkb-explorer-eurecom                                                      linked-open-vocabularies-lov
                                                                                             ndaa2011 psh-subject-headings                                                                    glastonbury-2011
              garnicaplywood                                                                                                                                                                                                   swc:completeGraph
                                                                                                                                                                                                                                              swrc:booktitle
                                                                                                                                                                                lak
                            bizkaisense                                                                                 swrc:affiliation                                                                       foaf:based_near
                              public-record-office-victoria-semantic-wiki                                                                    sudocfr
                                                         interactivemaps-gnoss                                                                                          rkb-explorer-webconf
                                                                     instance-hub-us-civil-servants                                               southampton-ac-uk-pressinfo
                                                                                               rkb-explorer-jisc
                                                                                                                                                                                                                                          hellenic-fire-brigade
                                                                                                                                                                                                        rdfohloh             foaf:maker
     geological-survey-of-austria-thesaurus
                            nytimes-linked-open-data                                                                   aksworg     greek-legal-entities
                                              rkb-explorer-citeseer
                                                                           idreffrsouthampton-ac-uk-photos                                         foaf:homepage
                                                                                                                                                           southampton-ac-uk-profile
                                                                                                                                                                                           dcs-sheffield                             swc:hasRelatedDocument
                                                                                                       dct:subject                                                                                         krystian-pietruszka
                                                                                                                                                                                                                          swc:ConferenceEvent
              camera-deputati-linked-data
                          linked-structured-product-labels
                                             fao-geopolitical-ontology
                                                                                                                            foaf:name
                                                                   ontos-news-portal
                                                                             dbtune-john-peel-sessions                                                     foaf:member
                                                                                                                                                                   foaf:mbox_sha1sum
                                                                                                                 msc                       foaf:Person                                         euskadi-farmacias
                                                                                                                                                                                                                         instance-hub-us-congressional-committees
                                                                                                                                                                                                                      movies-argentina
                          courts-thesaurus
                                                                                                                           swrc:Proceedings
                                       debian-package-tracking-system
                                                               nobelprizes                                                                                   southampton-ac-uk-apps
                                                                                 event-media open-food-facts                                                                rkb-explorer-webscience
                                                                                                                                             foaf:Organization                                                 lod2
                                                                                                                                                                                                                                 cablegate
                                                                                                       southampton-ac-uk-phonebook
                           german-labor-law-thesaurus                                                                          swrc:series

                                                                 twc-data-gov                                                                                            semanticweb-org
                                                                      beneficiaries-of-the-european-commission
                                         transparency-linked-data                                                                                       foaf:firstName                                      eu-parliament-media
                                                                                                                                                                                  national-diet-library-authorities
                                                                                                 sparql-endpoint-status
                                                                                                                   instance-hub-people
                                                                                                                                        foaf:lastName
                                                                          eurostat-rdf                                                                           southampton-ac-uk-services            taxonconcept
                                                               kdata         national-diet-library-subject-headings
                                                                                                    instance-hub-organizations                            sandrart-net
                                                                                                                vivo-weill-cornell-medical-college                                southampton-ac-uk-org

                                                                          qualitywebdata-org                                                    vivo
                                                                                                                                                             instance-hub-us-federal-agencies
                                                                                          arrayexpress_e-mtab-104
                                                                                                      vivo-scripps-research-institute                   vivo-ponce
                                                                                                                                  enipedia




                                                           Figure 3: The datasets and their associated lak features.


     delicious, geonames-semantic-web, lexvo, nytimes-                                                                                     with smaller popularity and having at least one feature of
     linked-open-data, rkb-explorer-wiki                                                                                                   lak.

   • bibliographic:     dcs-sheffiedl, linked-open-camera,                                                                                 The results also indicate that the selection of the feature set
     sweto-dblp, rkb-explorer-acm, lcsh, dnb-gemeinsame-                                                                                   is very important because it directly influences the generated
     normdatei, rkb-explorer-eprints, rkb-explorer-dblp                                                                                    rankings and can lead to recommendations of datasets which
                                                                                                                                           are more as well as less generic. In our experiments with lak,
   • educational area: gnoss.                                                                                                              we filtered out some generic features (e.g., owl:sameAs), but
                                                                                                                                           included DC and FOAF elements. Thus, we expected that
                                                                                                                                           both generic and specific datasets from our set of datasets
The top 10 recommendations of the rankings differ in                                                                                       were recommended. As the metadata used to triplify lak
some aspects. Considering the groups identified above,                                                                                     were not using classes and properties specifically related
the Bayesian ranking contains a higher number of generic                                                                                   to the application domain, this characteristic was not ev-
datasets, while the Social Network-based ranking contains                                                                                  idenced in the recommendation results.
a higher number of bibliographic datasets. This probably
happens because Bayesian ranking prioritizes recommenda-                                                                                   5.          CONCLUSIONS
tions for lak of datasets linked from the larger number of                                                                                 This paper presented a detailed analysis, based on Bayesian
other datasets having the larger number of lak features. On                                                                                classifiers and on Social Network Analysis techniques, to ad-
the other hand, the Social Network-based ranking prioritizes                                                                               dress the dataset interlinking recommendation problem for
the datasets pointed by the larger number of other datasets                                                                                lak, using only metadata. Thus, the rank score functions are
                                 Table 2: Top 10 ranked recommendations for lak.
                   # Bayesian ranking                score∗ # SN-based ranking                      score
                    1 semanticweb-org              -162.025    1 geonames-semantic-web             13.738
                    2 w3c-wordnet                  -162.236    2 nytimes-linked-open-data            3.558
                    3 tags2con-delicious           -163.025    3 gnoss                               3.051
                    4 dcs-sheffield                -163.025    4 lcsh                                3.017
                    5 linked-open-camera           -163.025    5 rkb-explorer-acm                    2.430
                    6 sweto-dblp                   -163.025    6 rkb-explorer-wiki                   2.408
                    7 geonames-semantic-web -3281.339          7 dnb-gemeinsame-normdatei            2.020
                    8 lexvo                       -4107.754    8 lexvo                               2.017
                    9 rkb-explorer-acm            -4114.493    9 rkb-explorer-eprints                1.632
                   10 lcsh                        -4273.558 10 rkb-explorer-dblp                     1.466
             ∗
               Estimated using log2 , c=-170 and considering only lak features shared with at least one dataset.


                                                                      Casanova, and S. Dietze. Recommending tripleset
                                                                      interlinking through a social network approach. In
                                                                      WISE’13, pages 149–161, 2013.
                                                                  [5] G. R. Lopes, L. A. P. Paes, B. P. Nunes, M. A.
                                                                      Casanova, and S. Dietze. Comparing recommendation
                                                                      approaches for dataset interlinking. Technical report,
                                                                      Department of Informatics, PUC-Rio, 2013.
                                                                  [6] B. F. Lóscio, M. Batista, and D. Souza. Using
                                                                      information quality for the identification of relevant
                                                                      web data sources. In IIWAS’12, pages 36–44, New
                                                                      York, NY, USA, 2012. ACM.
                                                                  [7] L. Lü, C.-H. Jin, and T. Zhou. Similarity index based
                                                                      on local paths for link prediction of complex networks.
                                                                      Physical Review E, 80(4):046122, 2009.
                                                                  [8] C. D. Manning and H. Schütze. Foundations of
                                                                      Statistical Natural Language Processing. MIT Press,
                                                                      2002.
                                                                  [9] A. Nikolov and M. d’Aquin. Identifying Relevant
                                                                      Sources for Data Linking using a Semantic Web Index.
                                                                      In LDOW’11, 2011.
Figure 4: Preview of the RecLAK recommendation
                                                                 [10] A. Nikolov, M. d’Aquin, and E. Motta. What Should I
interface.
                                                                      Link to? Identifying Relevant Sources and Classes for
                                                                      Data Linking. In JIST’12, pages 284–299. Springer
potentially useful to reduce the cost of dataset interlinking.        Berlin Heidelberg, 2012.
For more information, including the full set of data used        [11] B. P. Nunes, B. Fetahu, and M. A. Casanova.
in the experiments, graphical visualizations and detailed re-         Cite4me: Semantic retrieval and analysis of scientific
sults, we refer to the RecLAK Web application, avaliable at           publications. In LAK (Data Challenge), volume 974 of
http://www.inf.puc-rio.br/~grlopes/RecLAK.                            CEUR Workshop Proceedings. CEUR-WS.org, 2013.
                                                                 [12] B. P. Nunes, B. Fetahu, S. Dietze, and M. A.
6.   ACKNOWLEDGMENTS                                                  Casanova. Cite4me: A semantic search and retrieval
                                                                      web application for scientific publications. In ISWC
This work was partly funded by CNPq, under grants
                                                                      (Posters & Demos), volume 1035 of CEUR Workshop
160326/2012-5, 301497/2006-0 and 57128/2009-9, and
                                                                      Proceedings, pages 25–28. CEUR-WS.org, 2013.
by FAPERJ, under grants E-26/170028/2008 and E-
26/103.070/2011.                                                 [13] H. R. d. Oliveira, A. T. Tavares, and B. F. Lóscio.
                                                                      Feedback-based data set recommendation for building
                                                                      linked data applications. In I-SEMANTICS’12, pages
7.   REFERENCES                                                       49–55, 2012.
 [1] T. Berners-Lee. Linked Data. In Design Issues. W3C,         [14] D. Taibi and S. Dietze. Fostering analytics on learning
     July 2006.                                                       analytics research: the lak dataset. In LAK (Data
 [2] K. A. Kuznetsov. Scientific data integration system in           Challenge), volume 974 of CEUR Workshop
     the linked open data space. Programming and                      Proceedings. CEUR-WS.org, 2013.
     Computer Software, 39(1):43–48, Jan. 2013.                  [15] A. Wagner, P. Haase, A. Rettinger, and H. Lamm.
 [3] L. A. P. P. Leme, G. R. Lopes, B. P. Nunes, M. A.                Discovering related data sources in data-portals. In
     Casanova, and S. Dietze. Identifying candidate                   SemStats workshop, ISWC’13, 2013.
     datasets for data interlinking. In ICWE’13, pages
     354–366, 2013.
 [4] G. R. Lopes, L. A. P. P. Leme, B. P. Nunes, M. A.