=Paper= {{Paper |id=Vol-538/paper-21 |storemode=property |title=DING! Dataset Ranking using Formal Descriptions |pdfUrl=https://ceur-ws.org/Vol-538/ldow2009_paper21.pdf |volume=Vol-538 |dblpUrl=https://dblp.org/rec/conf/www/ToupikovUDHT09 }} ==DING! Dataset Ranking using Formal Descriptions== https://ceur-ws.org/Vol-538/ldow2009_paper21.pdf
          DING! Dataset Ranking using Formal Descriptions

                                             Nickolai Toupikov, Jürgen Umbrich,
                                            Renaud Delbru, Michael Hausenblas,
                                                    Giovanni Tummarello
                                            DERI, National University of Ireland, Galway,
                                               IDA Business Park, Galway, Ireland
                                                firstname.lastname@deri.org

ABSTRACT                                                                Our thesis at hand now is that, based on a formal (high-
Considering that thousands if not millions of linked datasets        level) description of a dataset’s content and interlinking pro-
will be published soon, we motivate in this paper the need           vided by voiD, Semantic Web clients can effectively rank
for an efficient and effective way to rank interlinked datasets      datasets using well-known strategies such as PageRank [4]
based on formal descriptions of their characteristics. We            or HITS [5] in a very efficient way. Without such high-
propose DING (from Dataset RankING) as a new approach                level descriptions, the client would have to “crawl” a large
to rank linked datasets using information provided by the            number of documents in order to analyze and derive precise
voiD vocabulary. DING is a domain-independent link anal-             statistics about the content of a dataset, hereby requiring
ysis that measures the popularity of datasets by considering         an excessive amount of time and resources.
the cardinality and types of the relationships. We propose              The rest of the paper is structured as follows: the next
also a methodology to automatically assign weights to link           section discusses exitsing approaches. Then, we lay out
types. We evaluate the proposed ranking algorithm against            the foundations regarding the formal description of linked
other well known ones, such as PageRank or HITS, using               datasets in sec. 3 and render our proposal in detail (sec. 4).
synthetic voiD descriptions. Early results show that DING            Further, in sec. 5, we report on early findings when com-
performs better than the standard Web ranking algorithms.            paring our approach to widely used ones such as PageRank
                                                                     or HITS. We conclude in sec. 6 by discussing the proposed
                                                                     ranking methodology and point out possible future steps.
1.   MOTIVATION
   Following Marshall and Shipman [1], we understand linked
datasets in terms of the distributed database perspective.
                                                                     2.   EXISTING WORKS
The primary targeted consumers are expected to be ma-                   Link analysis has proven to be effective for query indepen-
chines; a fair degree of automation needs to be guaranteed           dent quality web search. PageRanks [4] and HITS [5] have
in order to enable new types of Web applications. While              been successfully applied to measure the importance of web
nowadays the number of datasets—published in accordance              pages by analysing their link structure. These two algo-
to the linked data principles [2]—is somewhat limited, this          rithms consider only one type of links, i.e. hyperlinks, but
is expected to change soon. Considering thousands if not             has been shown to improve the effectiveness of web search
millions of linked datasets1 , one can expect to get lost, soon      engines [6, 7].
when trying to identify appropriate datasets for a certain              When working on a finer granularity level - such as en-
task.                                                                tity level - with more heterogeneous links, the previous ap-
   Two issues come to mind when talking about selecting              proaches are no longer applicable. In such condition, by
(possibly many) datasets: efficiency and effectiveness.              assuming that links are equivalent, the analysis of entity
While the former basically refers to how fast certain datasets       relationships does not provide accurate results since links
can be identified, the latter focuses on the relevancy, that         of different types can have various impact on the ranking
is how well the dataset fulfills the stated requirements in a        computation.
certain context (the domain of the Web application). When               Recent works [8, 9] have extended PageRank to consider
faced with a list of potential candidates, one usually wants         different types of relations between entities or objects. Pop-
to rank them according to certain criteria in order to select        Rank [10], a domain independent object-level link analysis,
the most relevant ones.                                              proposes a machine learning approach to automatically as-
                                                                     signs a “popularity propagation factor” to each type of rela-
1
 A simple estimation might support this argument: take for           tions. ObjectRank [11] goes further by applying authority-
example relational databases such as MySQL found in nearly           based ranking to keyword search in databases where various
every modern Web application, or the manifold reposito-              objects are connected with semantic relations.
ries in the software development domain (for example, CVS               The Swoogle search engine [12] was the first one to pro-
or SVN) or registries (LDAP, OPACs, etc.)—each of them,
once on the Web of Data (using out-of-the-box linked data            pose, OntoRank, an adaptation of PageRank for Semantic
publishing tools such as Triplify [3]) represents at least one       Web resources. In their work, they compute popularity of
dataset.                                                             resources based on three levels of granularity: documents,
                                                                     terms and RDF graphs. In [13], a link analysis is applied
Copyright is held by the author/owner(s).
LDOW2009, April 20, 2009, Madrid, Spain                              at query time for computing the popularity of resources
.                                                                    and contexts (which can be seen as documents or datasets).
Their approach differentiates two levels of link analysis, re-        • The interlinking of a dataset in voiD, that is, its
sources and context graphs, and the different relationships             outgoing and incoming links, is represented using the
between them.                                                           void:linkPredicate property. We identify two poten-
   In this paper, we are studying how to improve search re-             tial dimensions that might be exploited for ranking:
sults by ranking datasets according to their popularity. Our
approach is based on link analysis between datasets by us-                 – regarding the semantics of the links (such as rdfs-
ing the information provided by the voiD descriptions. We                    :seeAlso vs. foaf:knows) and
consider the types of relationships but also the cardinality of            – on a quantitative level, that is regarding the num-
link sets. We propose also an automatic weighting scheme                     ber of interlinking triples.
to find appropriate weights for relation types.
                                                                      • The kind of and number of used vocabularies in a
                                                                        dataset can be seen from the void:vocabulary prop-
3.    DESCRIBING DATASETS                                               erty value.
   In order to realise our vision of a semantic ranking, we
build upon a formal description of the datasets and their             • Other voiD characteristics such as void:uriRegexPa-
interlinking. Only recently the Vocabulary of Interlinked               ttern or the technical features of a dataset (such as
Datasets (voiD) [14] has been released; voiD is an RDFS                 available serialisations) via void:TechnicalFeature)
vocabulary for describing linked datasets. A dataset in voiD            can not directly be used for ranking, though perfectly
is “a collection of data, published and maintained by a sin-            for filtering (as in case with categorisation).
gle provider, available as RDF, and accessible, for example,
                                                                     The following example in Fig. 1 may help highlight our
through dereferenceable HTTP URIs or a SPARQL end-
                                                                   thinking:
point”. Interlinking in voiD is modeled utilising a so called
linksets. A linkset in voiD is “a subset of a dataset, used
for storing triples to express the interlinking relationship be-
tween datasets; in each interlinking triple, the subject is a
resource hosted in one dataset and the object is a resource
hosted in another dataset”.
   Given that such voiD descriptions are published alongside
with the datasets, they can be collected via pings, by crawl-
ing, or simply follow-your-nose by a semantic indexer such
as Sindice [15] or the Yahoo! Search Monkey [16]. We as-
sume such a collection of voiD descriptions in the following.
We note further that, as voiD being metadata about linked
data, is RDF-grounded, we can use all current RDF tools
and libraries to process, store and visualise it. Further, it
is perfectly possible to go from the meta-level to the meta-
meta-level, that is having a voiD description about voiD
descriptions.
                                                                   Figure 1: Exemplary collection of four voiD descrip-
4.    DING—DATASET RANKING                                         tions.
  Our proposal for a semantic ranking of RDF datasets is
called DING (from Dataset RankING) and is based on voiD
descriptions of the datasets.
                                                                   4.2   DING! Implementation
                                                                      In this section, we present how we adapted the weighted
4.1     Exploiting voiD’s characteristics                          PageRank algorithm in order to perform the Dataset rank-
  Based on the voiD guide [17] we will review the relevant         ing based on their interconnection. We then explain how it
features of voiD in the following and discuss their suitability    is possible to assign automatically a weight to a certain link
with respect to dataset selection and ranking.                     type.
                                                                      PageRank is a ranking system that originates from Web
     • The size of the dataset, that is, for example the
                                                                   Search Engines using a random walk algorithm. The Rank-
       number of triples or the number of distinct subjects
                                                                   ing system evaluates the probability of finding the web surfer
       can be used for ranking. In voiD this is a void:statItem
                                                                   on any given page. This algorithm is based on the assump-
       property along with one of five predefined dimensions
                                                                   tion that when someone publishes a resource on the web, he
       such as void:numberOfTriples or void:numberOfDocu-
                                                                   will do his best to link the published resource — be it a web
       ments. We have argued in [18] recently that the sheer
                                                                   page, or in our case — a dataset — to the most relevant
       numbers of triples is likely not a good measure for its
                                                                   and trustworthy resources availiable on the web. Hence the
       value.
                                                                   relevancy is assumed to be related to a high degree of inlinks
     • Categorisation of datasets in voiD is done using dc-        from other web resources. And from a probabilistic point of
       terms:subject along with DBpedia [19] resources. This       view — the more inlinks a dataset has, the most likely the
       can be used in a first step to massively decrease the       ’random surfer’ will be lead to it in his journey.
       search space. It acts as a sort of lexicon allowing to         The original PageRank r(Pi ) of a web page i is given by
       lookup a category and find related datasets. As a sec-                                      X r(Pj )
       ond step, DING can be used to rank the list of datasets                          r(Pi ) =                              (1)
       matching a certain category.                                                              P ∈B
                                                                                                         |Pj |
                                                                                                j   Pi
1    : DS1 a void : Dataset ;                                                         probability of going to DS4 - since the predicate and num-
2          foaf : homepage < http :// example . org / cats / > ;                      ber of links associated to L1→3 are not the same as the ones
3          dcterms : subject
                 < http :// dbpedia . org / r e s o u r c e / Cats > ;
                                                                                      associated to L1→4 .
4          void : subset : DS1toDS3 ;                                                   The goal will hence be to define a weight function w(Li→j ).
5          void : subset : DS1toDS4 .                                                 The weight will then be normalized in order to generate the
6
                                                                                      transition probability pi→j as follows.
7    : DS2 a void : Dataset ;
8          foaf : homepage < http :// petfood . example . org / > ;
                                                                                                                      w(Li→j )
9          dcterms : subject                                                                            pi→j = P                                  (2)
                 < http :// dbpedia . org / r e s o u r c e / Cats > ;                                              k∈O(i) w(Li→k )
10         dcterms : subject
                 < http :// dbpedia . org / r e s o u r c e / P e t _ f o o d s > ;   The first approach is simply to define w(Li→j ) = n(Li→j ) .
11         void : subset : DS2toDS1 .                                                                                   600
12
                                                                                      In the case of Fig. 1, p1→3 = 2000+600   ' 0.23 and p1→4 =
                                                                                         2000
13   : DS1toDS3 a void : Linkset ;
                                                                                      2000+600
                                                                                                ' 0.77. However, this definition does not take into
14              void : sub jectsTa rget : DS1 ;
15              void : objectsTarget : DS3 ;
                                                                                      account the nature of the link, and the likelihood that the
16              void : linkPredicate foaf : interest ;                                user may well chose foaf:interest above dc:author to browse
17              void : statItem [                                                     into another dataset. As a result, additional weights can be
18                   rdf : value 600;                                                 assigned based on the nature of the predicate involved in
19                   scovo : dimension void : n um be rO f Tr ip l es ;
20              ] .                                                                   the link.
21                                                                                      The values assigned can be either statically predefined, or
22   : DS2toDS1 a void : Linkset ;                                                    computed dynamically, given the accumulated voiD infor-
23              void : target : DS1 ;
24              void : target : DS2 ;                                                 mation. We present our approach, based on TF-IDF a well
25              void : linkPredicate owl : sameAs ;                                   known algorithm when it comes to weight the relevance of
26              void : statItem [                                                     a term(in our case - the predicate), given its frequency in a
27                   rdf : value 10000;
28                   scovo : dimension void : n um be rO f Tr ip l es ;               data collection. Hence, the weight, given by TF-IDF would
29              ] .                                                                   be
                                                                                                                        n(Li→j )
                                                                                                    T F (Li→j ) =                                 (3)
          Listing 1: An exemplary voiD description.                                                                 maxk∈O(i) n(Li→k )

                                                                                                                                N
                                                                                                IDF (s(Li→j )) = log                              (4)
 Where BPi is the set of pages linking to Pi and |Pj | is the                                                          1 + f req(s(Li→j ))
 total number of pages linked by Pj . Hence, |P1j | is in fact                        Where f req(s(Li→j )) is the frequency of occurrence of linksets
 the probability for the random surfer to choose to go from Pj                        using the predicate of Li→j in the collection’s datasets. Fi-
 to Pi out of all pages linked by Pj . This probability referred                      nally, we define w as
 to as pj→i , can be modified in order to provide a weighting
 of the ”importance” of the hyperlink.                                                          w(Li→j ) = T F (Li→j ) × IDF (s(Li→j ))           (5)
    The parallel from web documents to voiD descriptions
 is done in a naive way. The web pages are now datasets,                              5.    EXPERIMENTS AND EARLY FINDINGS
 and the hyperlinks correspond to linksets joining the dataset                           In order to verify our thesis that formal descriptions of
 they belong to another:                                                              linked datasets help yielding better results for the ranking
       • Pi corresponds to an element Di define by void:dataset.                      of the datasets, we have set up an evaluation framework that
                                                                                      executes various ranking algorithms on a synthetic voiD de-
       • A hyperlink form in the page Pi pointing to the page Pj                      scription2 (see Fig. 2). It is composed of 15 artificial dataset
         will correspond to a void:linkset element connecting                         descriptions interlinked using 8 different predicates and par-
         Di and Dj , defined as void:subset of a dataset Di .                         titioned into two clouds (datasets 1 to 9 and 10 to 15). The
         The linkset will be referred to as Li→j . We also define                     experiment used several ranking algorithms to estimate the
         n(Li→j ) as the number of relations in the linkset, and                      generic relevancy of every artificial dataset within the syn-
         s(Li→j ) as the predicate declared in the linkset. For                       thetic cloud.
         example, in Fig. 1 n(L1→3 ) = 600 and s(L1→3 ) =
         "foaf:interest". L is the set of linksets defined in                         5.1    The setup
         the entire data collection.                                                     For the evaluation we use the Java Universal Network/-
                                                                                      Graph Framework (JUNG)3 to compare the DING algo-
       • Similarly to the set BPi of pages linking to Pi , we                         rithm with other established and well known ranking al-
         define O(i) = {j|∃Li→j ∈ L} as the indices of datasets                       gorithms. Further, we use a naive link-sum rank function
         linked from Di .                                                             (DRank ) as a baseline to discuss the results. Three out
                                                                                      of the four ranking algorithms are also extended with the
       • pi→j can be modified according to the information
                                                                                      DING link weight function. In detail we evaluate and com-
         available about the linkest Li→j , such as s(Li→j ) or
                                                                                      pare the following ranking algorithms:
         n(Li→j ), as well as general statistics over L.
                                                                                      2
                                                                                        The full benchmark data is available at http://sw.deri.
   Like in the web page link analysis, the links between                              org/2009/02/DING/example-void-collection.ttl. Un-
 datasets deserve a deeper analysis in order to obtain a finer                        fortunately no real-world voiD cloud was readily availiable
 ranking. For example in Fig. 1 the probability of the user                           for the experiment at the time of writing.
                                                                                      3
 going from DS1 to DS3 is likely to be different from the                               http://jung.sourceforge.net/
         Rank                    1               2              3               4              5              6
         STD PageRank        DS1 (0.20)      DS3 (0.15)     DS4 (0.14)     DS11 (0.12)    DS13 (0.072)   DS10 (0.056)
         DING PageRank       DS4 (0.18)      DS1 (0.14)     DS11 (0.12)    DS13 (0.091)   DS3 (0.081)    DS10 (0.074)
         DRank               DS10 (0.35)     DS13 (0.16)    DS12 (0.16)    DS11 (0.12)    DS15 (0.11)    DS14 (0.044)
         HITS                DS1 (0.43)      DS4 (0.28)     DS2 (0.11)     DS3 (0.094)    DS11 (0.022)   DS10 (0.017)


     Table 1: Evaluation results: the top 6 datasets for each ranking algorithm with their normalized score


     • DRank: A baseline ranking algorithm using a naive
       approach. The datasets are ranked according to the
       number of links they have with other datasets.

     • PageRank Google’s page rank algorithm [4].

     • DING PageRank modification of the PageRank Al-
       gorithm as described in Sec 4.2

     • HITS Another well known ranking algorithm is HITS
       [5]. For each data set in the voiD graph a “hubs-and-
       authorities” importance measure is calculated.

5.2     Results
   Table 1 lists the results of the evaluation. The naive rank-
ing approach, DRank, completely leaves out the first cloud
for having much less links in its linksets than the second
cloud. We see that standard PageRank and HITS algo-
rithms do not take into account the nature of the links and
rank DS1 first. Although DS1 is indeed heavily linked by
other datasets, it is mostly inlinked by ”weak” links like
owl:sameAs or rdfs:seeAlso. The information-theory view            Figure 2: Visualisation of the synthetical dataset
defined in tfidf suggests that these links — being the most        network
common ones — do not hold as much information content
as less common ones, and are therefore less significant. For
example while looking for information about an article, the        Acknowledgements
user will get more precise information following dcterms:-         Our work has partly been supported by the European Com-
author than a generic property such as rdfs:seeAlso, and           mission under Grant No. 217031, FP7/ICT-2007.1.2, project
is hence more likely to follow the former. As a result a           “Domain Driven Design and Mashup Oriented Development
dataset linked by uncommon links will likely be more signif-       based on Open Source Java Metaframework for Pragmatic,
icant than one linked by common ones - and should have a           Reliable and Secure Web Development” (Romulus)5 , by the
higher voiD ranking.                                               European FP7 project Okkam - Enabling a Web of Enti-
   Another advantage of PageRank that makes it very rel-           ties (contract no. ICT-215032), and by Science Foundation
evant for the Linked Data approach is that it gives a low          Ireland under Grant No. SFI/02/CE1/I131.
ranking to datasets that do not have inlinks. The value of
a dataset within the cloud is dependent on how well it is
linked by other datasets.                                          7.     REFERENCES
                                                                       [1] C.C. Marshall and F.M. Shipman. Which Semantic
6.    CONCLUSION                                                           Web? In HYPERTEXT ’03: Proceedings of the
   We have presented DING, a new approach to rank linked                   fourteenth ACM conference on Hypertext and
datasets based on voiD descriptions. Though one might ob-                  hypermedia, pages 57–66, New York, NY, USA, 2003.
ject that currently there are not many voiD descriptions                   ACM.
available4 we argued that this is very likely to change soon.          [2] T. Berners-Lee. Linked Data.
Further, the infrastructure to collect voiD descriptions is in             http://www.w3.org/DesignIssues/LinkedData.html,
place (voiD being RDF, the requirements to do so are min-                  2007.
imal).                                                                 [3] S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, and
   We have motivated the need for a efficient and effective                D. Aumueller. Triplify - Lightweight Linked Data
way to rank datasets based on their characteristics (content-              Publication from Relational Databases. In
wise and with respect to the interlinking). Finally we have                International World Wide Web Conference (WWW
shown how DING performs in relation to existing ranking                    09), Madrid, Spain, page upcoming, 2009.
algorithms and discussed the results.                                  [4] Lawrence Page, Sergey Brin, Rajeev Motwani, and
                                                                           Terry Winograd. The pagerank citation ranking:
4
  Indeed one finds voiD descriptions at time of writing, al-
                                                                   5
ready; see for example http://void.rkbexplorer.com/.                   http://www.ict-romulus.eu/
     Bringing order to the web. Technical Report 1999-66,          group, 2009.
     November 1999.                                           [18] Michael Hausenblas, Wolfgang Halb, Yves Raimond,
 [5] Jon M. Kleinberg. Authoritative sources in a                  and Tom Heath. What is the Size of the Semantic
     hyperlinked environment. J. ACM, 46(5):604–632,               Web. In Proceedings of I-Semantics 2008, Graz,
     1999.                                                         Austria, 2008.
 [6] Sergey Brin and Lawrence Page. The anatomy of a          [19] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann,
     large-scale hypertextual web search engine. Comput.           R. Cyganiak, and Z. G. Ives. DBpedia: A Nucleus for
     Netw. ISDN Syst., 30(1-7):107–117, 1998.                      a Web of Open Data. In The Semantic Web, 6th
 [7] Soumen Chakrabarti, Byron E. Dom, David Gibson,               International Semantic Web Conference, 2nd Asian
     Jon Kleinberg, Ravi Kumar, Prabhakar Raghavan,                Semantic Web Conference, ISWC 2007 + ASWC
     Sridhar Rajagopalan, and Andrew Tomkins. Mining               2007, pages 722–735, 2007.
     the web’s link structure. Computer, 32(8):60–67, 1999.
 [8] Wenpu Xing and Ali Ghorbani. Weighted pagerank
     algorithm. In CNSR ’04: Proceedings of the Second
     Annual Conference on Communication Networks and
     Services Research, volume 0, pages 305–314,
     Washington, DC, USA, 2004. IEEE Computer Society.
 [9] Ricardo Baeza-Yates and Emilio Davis. Web page
     ranking using link attributes. In WWW Alt. ’04:
     Proceedings of the 13th international World Wide Web
     conference on Alternate track papers & posters, pages
     328–329, New York, NY, USA, 2004. ACM.
[10] Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, and
     Wei-Ying Ma. Object-level ranking: bringing order to
     Web objects. In Proceedings of the 14th international
     conference on World Wide Web - WWW 05 WWW
     05, page 567. ACM, 2005.
[11] Andrey Balmin, Vagelis Hristidis, and Yannis
     Papakonstantinou. Objectrank: authority-based
     keyword search in databases. In VLDB ’04:
     Proceedings of the Thirtieth international conference
     on Very large data bases, pages 564–575. VLDB
     Endowment, 2004.
[12] Li Ding, Rong Pan, Timothy W. Finin, Anupam Joshi,
     Yun Peng, and Pranam Kolari. Finding and ranking
     knowledge on the semantic web. In International
     Semantic Web Conference, pages 156–170, 2005.
[13] Aidan Hogan, Andreas Harth, and Stefan Decker.
     Reconrank: A scalable ranking method for semantic
     web data with context. In Proceedings of Second
     International Workshop on Scalable Semantic Web
     Knowledge Base Systems (SSWS 2006), Athens, GA,
     USA., 11 2006.
[14] K. Alexander, R. Cyganiak, M. Hausenblas, and
     J. Zhao. Describing Linked Datasets - On the Design
     and Usage of voiD, the ’Vocabulary of Interlinked
     Datasets’. In WWW 2009 Workshop: Linked Data on
     the Web (LDOW2009), Madrid, Spain, 2009.
[15] G. Tummarello, R. Delbru, and E. Oren. Sindice.com:
     Weaving the Open Linked Data. In The Semantic
     Web, 6th International Semantic Web Conference,
     2nd Asian Semantic Web Conference, ISWC 2007 +
     ASWC 2007, pages 552–565, 2007.
[16] P. Mika. Microsearch: An Interface for Semantic
     Search. In Semantic Search, International Workshop
     located at the 5th European Semamntic Web
     Conference (ESWC 2008), volume 334 of CEUR
     Workshop Proceedings, pages 79–88. CEUR-WS.org,
     2008.
[17] K. Alexander, R. Cyganiak, M. Hausenblas, and
     J. Zhao. voiD guide—Using the Vocabulary of
     Interlinked Datasets. Community Draft, voiD working