Motivation

How much navigable is the Web of Linked Data?*

Valeria Fionda

Enrico Malizia

0 0 DIMES, University of Calabria , Italy 1 Department of Mathematics, University of Calabria , Italy

2010

Millions of RDF links connect data providers on the Web of Linked Data. Here, navigability is a key issue. This poster provides a preliminary quantitative analysis of this fundamental feature. Linked Data are published on the Web following the Linked Data principles [2]. One of them states that RDF links must be used to allow clients to navigate the Web of Linked Data from dataset to dataset. In particular, RDF links allow: (i) data publishers to connect the data they provide to data already on the Web; and (ii) clients to discover new knowledge by traversing links and retrieving data. Hence, navigability is a key feature. The Web of Linked Data is growing rapidly and both the set of data providers and the structure of RDF links continuously evolve. Is this growth taking place preserving the basic navigability principle? In this poster we try to answer this question by analyzing the pay-level domain (PLD) networks extracted from the last three Billion Triple Challenge datasets. In addition, we also analyze the sameAs network obtained by considering only owl:sameAs links. Some recent works analyzed sameAs networks [1, 3] to provide some statistics on the deployment status and use of owl:sameAs links [3] and to evaluate their quality [1]. However, to the best of our knowledge, this is the rst attempt to use the PLD and sameAs networks to perform a quantitative analysis of the navigability of the Web of Linked Data. Navigability indices. We model the Web of Linked Data as a directed graph G = hV; Ei where V = fv1; : : : ; vng is the set of vertices and E V V is the set of edges. The vertices of G represent the pay-level domains that identify data publishers in the Web of Linked Data. The edges of G represent directed links between di erent pay-level domains (i.e., there are no loops) and are ordered *V. Fionda's work was supported by the European Commission, the European Social Fund and the Calabria region. E. Malizia's work was supported by the ERC grant 246858 (DIADEM), while he was visiting the Department of Computer Science of the University of Oxford.

Motivation

pairs of vertices (vi; vj), where vi is the source vertex and vj is the target one. Intuitively, an edge (vi, vj) models the fact that there is at least one URI having PLD vi by dereferencing which an RDF link to a URI having PLD vj is obtained.

We denote by vi ;G vj the existence of a path from vi to vj in G (otherwise we write vi 6;G vj). For a graph G = hV; Ei, G = hV; E i is the closure of G where (vi; vj) 2 E if and only if vi 6= vj and vi ;G vj. We de ne the reachability matrix RG 2 f0; 1gn n such that RG[i; j] = 1 if and only if (vi; vj) 2 E (i.e., vi ;G vj). Moreover, we de ne the distance matrix DG 2 Nn n such that DG[i; j] is the length of the shortest path between vi and vj (DG[i; j] = 1 if vi 6;G vj). When G is understood we simply write vi ; vj, R[i; j], and D[i; j].

To evaluate the navigability of the Web of Linked Data we use two indices. The rst one is the reachability index (G), corresponding to the edge density of G . The reachability index is the probability that between any two given vertices of G there exists a path. In particular, (G) = n(n1 1) Pvi;vj2V;vi6=vj R[i; j] = jE j1) . This index takes into account only the reachability between vertices and n(n implies that (G1) = (G2) for any pair of graphs G1=hV; E1i and G2=hV; E2i such that G1 = G2, even if E1 E2 (or E2 E1).

To take into account di erences in graph topologies, we use the e ciency index e(G) [4]. This index exploits the distance matrix D to weight the reachability by the (inverse of the) length of the shortest path between vertices. Given a graph G, e(G) = n(n1 1) Pvi;vj2V;vi6=vj DR[[ii;;jj]] , where DR[[ii;;jj]] = 0 when vi 6; vj. It can be shown that for any graph G, e(G) (G), and given two graphs G1 = hV; E1i and G2 = hV; E2i such that E1 E2 then (G1) (G2), while e(G1) < e(G2). The index e( ) has been used in literature to measure how e ciently small-world networks exchange information [4].

Intuitively, the closer (G) is to 1, the more G is similar to a strongly connected graph; on the other hand, the closer e(G) is to 1, the more G is similar to a complete graph. Note that, e combines information on reachability and information about the distances between pairs of vertices and it is not simply the arithmetic mean of the inverse of the shortest paths lengths.

Datasets. To perform our analysis we used the Billion Triple Challenge (BTC) datasets of 2010, 201 1 and 2012 3. Unfortunately, the BTC dataset for 2013 was not crawled and a dataset for 2014 was not available to the date of submission. We decided to use the pay-level domain (PLD) granularity to build our networks, where the PLD of a URI is a sub-domain (generally one level below) of a generic public top-level domain, for which users usually pay for. PLD domains in the Web of Linked Data are often in one-to-one correspondence with Linked Open Data datasets. We extracted the PLD network from each BTC dataset by considering each RDF quad and adding an edge between the PLD of the context URI and the PLD of the subject and object. In particular, we extracted two PLD networks: the rst one (denoted by All) considers all types of links and the second one (denoted by SA) considers only owl:sameAs links.

Since BTC datasets are obtained by crawling the LOD cloud starting from a set of seed URIs, we also extracted from each network the largest internal connected component obtained by ignoring the PLDs \on the border" (i.e., those without any outgoing edge) that are probably those where the crawling stopped. We denote by All-I and SA-I are internal subnetworks extracted from All and SA, respectively. Fig. 1 shows the SA-I network of the BTC 2012 dataset. 3

Evaluation

Table 1 reports our results. The table shows that and e on the complete network, for both All and SA, decrease in 2011 with respect to 2010 and are still decresing for the All network even in 2012. Moreover, the values obtained for both and e are very small. For example, (All) = 4:899 10 4 for the BTC 2012 dataset means that given a random pair of PLDs from the All network the probability that they are connected by a path is less than 0:5 . Translated to the Web, this means that starting from a given a URI and following RDF links only a very small portion of the Web of Linked Data can be reached. However, an explanation for such a behavior could be that the BTC datasets are obtained by crawling the Web and it is reasonable to think that PLDs at the \border" of the network are those at which the crawling process stopped. If this is the case, some links can actually be missing and our indices can be biased. In general, a decrease over time of and e on the full Web of Linked Data highlight a decrease in its navigability. Nevertheless, since in our case the BTC datasets are used as representative samples of the full Web of Linked Data, this decrease can be related to the fact that in 2012 and 2011 the crawler retrieved triples spanning more data providers than in 2010 and a large portion of them is on the All SA All-I SA-I border. Indeed, in general, both and e decrease if the proportion of the nodes on the border increase with respect to the total size of the network.

For this reason we decided to analyze the internal largest connected components All-I and SA-I. As for All-I, on one hand it can be observed that the di erence in the values of both indices between 201 1 and 2012 is negligible. There is, on the other hand, a big increase in both indices for the 2011 network with respect to the 2010 one. It is evident, from these results, that the Web of Linked Data gained a lot in navigability from 2010 to 2011, according to the All-I sample, while the navigability remained almost unchanged in 2012. A similar trend can be identi ed also on the SA-I network, apart from a noticeable decrease in the navigability of the 2012 network compared to the 2011 one. Our results show that, for example, in 2012 given a random pair of PLDs from the All-I network the probability that they are connected by a path is greater than 86% with an e ciency of 0:326. It is worth to point out that, in a distributed environment as the Web, e ciency is a fundamental property that, besides reachability, is related to the number of dereferences that must be performed to move from a source PLD to a target one. Roughly speaking, lower values of e ciency for the same value of reachability translate in more tra c on the network. 4

Conclusions

Navigability is a key feature of the Web of Linked Data. We introduced some indices to quantitatively measure the connectivity of Linked Data providers and the e ciency of their connections. The results obtained show that, as hoped, the navigability of the Web of Linked Data is increasing with its growth. However, in order not to be biased toward a certain interpretation, it is important to stress that the results obtained could have been in uenced by the crawling strategy used to build the BTC datasets used in our analysis. We plan to perform our analysis in the following years to monitor and hopefully con rm this trend.

Bartolomeo and

Salsano . A spectrometry of linked data . In LDOW , 2012 .

2. T. Berners-Lee. Linked data design issues , 2006 .

Ding ,

Shinavier ,

Shangguan , and

McGuinness . SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl:sameAs in Linked Data . In ISWC , 2010 .

Latora and

Marchiori . E cient behavior of small-world networks . Phys. Rev. Lett. , 87 ( 19 ): 198701 , 2001 .