<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How much navigable is the Web of Linked Data?*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valeria Fionda</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Malizia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIMES, University of Calabria</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics, University of Calabria</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <abstract>
        <p>Millions of RDF links connect data providers on the Web of Linked Data. Here, navigability is a key issue. This poster provides a preliminary quantitative analysis of this fundamental feature. Linked Data are published on the Web following the Linked Data principles [2]. One of them states that RDF links must be used to allow clients to navigate the Web of Linked Data from dataset to dataset. In particular, RDF links allow: (i) data publishers to connect the data they provide to data already on the Web; and (ii) clients to discover new knowledge by traversing links and retrieving data. Hence, navigability is a key feature. The Web of Linked Data is growing rapidly and both the set of data providers and the structure of RDF links continuously evolve. Is this growth taking place preserving the basic navigability principle? In this poster we try to answer this question by analyzing the pay-level domain (PLD) networks extracted from the last three Billion Triple Challenge datasets. In addition, we also analyze the sameAs network obtained by considering only owl:sameAs links. Some recent works analyzed sameAs networks [1, 3] to provide some statistics on the deployment status and use of owl:sameAs links [3] and to evaluate their quality [1]. However, to the best of our knowledge, this is the rst attempt to use the PLD and sameAs networks to perform a quantitative analysis of the navigability of the Web of Linked Data. Navigability indices. We model the Web of Linked Data as a directed graph G = hV; Ei where V = fv1; : : : ; vng is the set of vertices and E V V is the set of edges. The vertices of G represent the pay-level domains that identify data publishers in the Web of Linked Data. The edges of G represent directed links between di erent pay-level domains (i.e., there are no loops) and are ordered *V. Fionda's work was supported by the European Commission, the European Social Fund and the Calabria region. E. Malizia's work was supported by the ERC grant 246858 (DIADEM), while he was visiting the Department of Computer Science of the University of Oxford.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>pairs of vertices (vi; vj), where vi is the source vertex and vj is the target one.
Intuitively, an edge (vi, vj) models the fact that there is at least one URI having
PLD vi by dereferencing which an RDF link to a URI having PLD vj is obtained.</p>
      <p>We denote by vi ;G vj the existence of a path from vi to vj in G (otherwise
we write vi 6;G vj). For a graph G = hV; Ei, G = hV; E i is the closure of G
where (vi; vj) 2 E if and only if vi 6= vj and vi ;G vj. We de ne the reachability
matrix RG 2 f0; 1gn n such that RG[i; j] = 1 if and only if (vi; vj) 2 E (i.e.,
vi ;G vj). Moreover, we de ne the distance matrix DG 2 Nn n such that
DG[i; j] is the length of the shortest path between vi and vj (DG[i; j] = 1 if
vi 6;G vj). When G is understood we simply write vi ; vj, R[i; j], and D[i; j].</p>
      <p>To evaluate the navigability of the Web of Linked Data we use two indices.
The rst one is the reachability index (G), corresponding to the edge density of
G . The reachability index is the probability that between any two given vertices
of G there exists a path. In particular, (G) = n(n1 1) Pvi;vj2V;vi6=vj R[i; j] =
jE j1) . This index takes into account only the reachability between vertices and
n(n
implies that (G1) = (G2) for any pair of graphs G1=hV; E1i and G2=hV; E2i
such that G1 = G2, even if E1 E2 (or E2 E1).</p>
      <p>To take into account di erences in graph topologies, we use the e ciency
index e(G) [4]. This index exploits the distance matrix D to weight the
reachability by the (inverse of the) length of the shortest path between vertices. Given
a graph G, e(G) = n(n1 1) Pvi;vj2V;vi6=vj DR[[ii;;jj]] , where DR[[ii;;jj]] = 0 when vi 6; vj.
It can be shown that for any graph G, e(G) (G), and given two graphs
G1 = hV; E1i and G2 = hV; E2i such that E1 E2 then (G1) (G2), while
e(G1) &lt; e(G2). The index e( ) has been used in literature to measure how
e ciently small-world networks exchange information [4].</p>
      <p>Intuitively, the closer (G) is to 1, the more G is similar to a strongly
connected graph; on the other hand, the closer e(G) is to 1, the more G is similar
to a complete graph. Note that, e combines information on reachability and
information about the distances between pairs of vertices and it is not simply the
arithmetic mean of the inverse of the shortest paths lengths.</p>
      <p>
        Datasets. To perform our analysis we used the Billion Triple Challenge (BTC)
datasets of 2010, 201
        <xref ref-type="bibr" rid="ref1">1 and 2012</xref>
        3. Unfortunately, the BTC dataset for 2013 was
not crawled and a dataset for 2014 was not available to the date of submission.
We decided to use the pay-level domain (PLD) granularity to build our networks,
where the PLD of a URI is a sub-domain (generally one level below) of a generic
public top-level domain, for which users usually pay for. PLD domains in the Web
of Linked Data are often in one-to-one correspondence with Linked Open Data
datasets. We extracted the PLD network from each BTC dataset by considering
each RDF quad and adding an edge between the PLD of the context URI and the
PLD of the subject and object. In particular, we extracted two PLD networks:
the rst one (denoted by All) considers all types of links and the second one
(denoted by SA) considers only owl:sameAs links.
      </p>
      <p>Since BTC datasets are obtained by crawling the LOD cloud starting from
a set of seed URIs, we also extracted from each network the largest internal
connected component obtained by ignoring the PLDs \on the border" (i.e., those
without any outgoing edge) that are probably those where the crawling stopped.
We denote by All-I and SA-I are internal subnetworks extracted from All and
SA, respectively. Fig. 1 shows the SA-I network of the BTC 2012 dataset.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Evaluation</title>
      <p>Table 1 reports our results. The table shows that and e on the complete
network, for both All and SA, decrease in 2011 with respect to 2010 and are
still decresing for the All network even in 2012. Moreover, the values obtained
for both and e are very small. For example, (All) = 4:899 10 4 for the BTC
2012 dataset means that given a random pair of PLDs from the All network the
probability that they are connected by a path is less than 0:5 . Translated to
the Web, this means that starting from a given a URI and following RDF links
only a very small portion of the Web of Linked Data can be reached. However,
an explanation for such a behavior could be that the BTC datasets are obtained
by crawling the Web and it is reasonable to think that PLDs at the \border" of
the network are those at which the crawling process stopped. If this is the case,
some links can actually be missing and our indices can be biased. In general,
a decrease over time of and e on the full Web of Linked Data highlight a
decrease in its navigability. Nevertheless, since in our case the BTC datasets
are used as representative samples of the full Web of Linked Data, this decrease
can be related to the fact that in 2012 and 2011 the crawler retrieved triples
spanning more data providers than in 2010 and a large portion of them is on the
All
SA
All-I
SA-I
border. Indeed, in general, both and e decrease if the proportion of the nodes
on the border increase with respect to the total size of the network.</p>
      <p>
        For this reason we decided to analyze the internal largest connected
components All-I and SA-I. As for All-I, on one hand it can be observed that the
di erence in the values of both indices between 201
        <xref ref-type="bibr" rid="ref1">1 and 2012</xref>
        is negligible. There
is, on the other hand, a big increase in both indices for the 2011 network with
respect to the 2010 one. It is evident, from these results, that the Web of Linked
Data gained a lot in navigability from 2010 to 2011, according to the All-I
sample, while the navigability remained almost unchanged in 2012. A similar trend
can be identi ed also on the SA-I network, apart from a noticeable decrease in
the navigability of the 2012 network compared to the 2011 one. Our results show
that, for example, in 2012 given a random pair of PLDs from the All-I network
the probability that they are connected by a path is greater than 86% with an
e ciency of 0:326. It is worth to point out that, in a distributed environment
as the Web, e ciency is a fundamental property that, besides reachability, is
related to the number of dereferences that must be performed to move from a
source PLD to a target one. Roughly speaking, lower values of e ciency for the
same value of reachability translate in more tra c on the network.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>Navigability is a key feature of the Web of Linked Data. We introduced some
indices to quantitatively measure the connectivity of Linked Data providers and
the e ciency of their connections. The results obtained show that, as hoped, the
navigability of the Web of Linked Data is increasing with its growth. However, in
order not to be biased toward a certain interpretation, it is important to stress
that the results obtained could have been in uenced by the crawling strategy
used to build the BTC datasets used in our analysis. We plan to perform our
analysis in the following years to monitor and hopefully con rm this trend.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>G.</given-names>
            <surname>Bartolomeo</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Salsano</surname>
          </string-name>
          .
          <article-title>A spectrometry of linked data</article-title>
          .
          <source>In LDOW</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>T.</surname>
          </string-name>
          Berners-Lee.
          <article-title>Linked data design issues</article-title>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>L.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shinavier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shangguan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          .
          <article-title>SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl:sameAs in Linked Data</article-title>
          .
          <source>In ISWC</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>V.</given-names>
            <surname>Latora</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Marchiori</surname>
          </string-name>
          .
          <article-title>E cient behavior of small-world networks</article-title>
          .
          <source>Phys. Rev. Lett.</source>
          ,
          <volume>87</volume>
          (
          <issue>19</issue>
          ):
          <fpage>198701</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>