<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analyses of RDF Triples in Sample Datasets?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jakub Starka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Svoboda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irena Mlynkova</string-name>
          <email>mlynkovag@ksi.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Economics</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>XML and Web Engineering Research Group Faculty of Mathematics and Physics, Charles University in Prague Malostranske namest 25</institution>
          ,
          <addr-line>118 00 Prague 1</addr-line>
          ,
          <country>Czech Republic Contact</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Linked Data principles supported especially by RDF triples appeared recently to enrich the Web of Documents by the Web of Data. However, each application that wants to process RDF triples has to deal with their distribution, dynamics and scaling. Thus, having understood structural and other features of such data, we may have better chances to propose these applications more e ciently. Especially when we consider issues of data storing, indexing and querying. The aim of this paper is to propose characteristics that appropriately capture and describe such features of RDF triples, and to provide experimental results over a few selected real-world RDF datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Linked Data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is not any particular standard, it is just a set of common practices
and general rules using which we can contribute to the Web of Data that emerged
recently to enrich the traditional Web of Documents. So, what are these rules?
First of all, each real-world entity should be assigned a unique URL identi er;
these identi ers should be dereferenceable by HTTP to obtain information about
these entities; and, nally, these entity representations should be interlinked
together to form a global Linked Data cloud.
      </p>
      <p>
        Nevertheless, despite there are also other ways how to follow the mentioned
Linked Data principles, the most promising is obviously the RDF standard [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It
assumes data modelled as triples with three components: subject, predicate and
object. These triples can also be viewed as graphs, where vertices correspond to
subjects and objects, while labelled edges represent the triples themselves.
      </p>
      <p>
        One of our ongoing research e orts should result into a proposal of a new
querying system dealing with large amounts of distributed and dynamic RDF
data { issues we previously identi ed as open problems of the existing approaches
from the area of RDF triples storing, indexing and querying [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It is apparent
that, having the knowledge about structural and other features of data we want
to process, we are able to manage such data more e ciently.
      </p>
      <p>In fact, this idea predetermines the aim of this paper { we propose a set of
characteristics of RDF triples and provide experimental results over several
selected datasets. These characteristics capture features of individual triple
components, triples themselves and also structural features of RDF graphs, while
performed experiments attempt to outline the nature of real-world RDF data.</p>
      <p>Outline First of all, in Section 2 we explain the motivation for this paper.
Section 3 provides basic theoretical background and de nitions of proposed RDF
characteristics, while Section 4 presents results of performed experiments. In
Section 5 we shortly discuss the related work, and, nally, Section 6 concludes.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Motivation</title>
      <p>
        If we knew characteristics about data we want to process, we would have better
chances to propose algorithms and data structures that could be more e cient
with respect to our expectations. In other words, this idea justi es the aim of
this paper. Having understood RDF triples we want to store, index and query,
we can, hopefully, achieve better results. Moreover, we can also come across
approaches that require sort of a con guration (e.g. Structure Index by Tran
and Ladwig [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or Summary Index by Harth et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). But how can we provide
required parameters, if we do not know enough about data or queries?
      </p>
      <p>
        Therefore, we have proposed several characteristics we nd interesting to
study. First of all, the majority of indexing approaches (e.g. Hexastore by Weiss
et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or BitMat Index by Atre et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) proposes to store components of
RDF triples and triples themselves separately (even using fairly di erent
structures) in order to reduce space requirements. Knowledge of string features of
these component values could support this practice.
      </p>
      <p>
        The second group of characteristics worth of studying is related to query
evaluation and, in particular, access patterns to individual triple components.
In case of full-text querying, we usually do not care which particular triple
component should match the queried value, but in case of structural querying
like SPARQL [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we need to have suitable indices allowing us to e ciently access
particular components according to the prompted query. These indices can be
built, for example, on nested lists (Hexastore [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]) or B+-trees (RDF-3X by
Neumann and Weikum [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
      </p>
      <p>
        Finally, we can even attempt to study more complex characteristics based on
structure of RDF graphs. When using SPARQL with queries based on graph
patterns, we often need to do operations similar to traditional joining in relational
databases, only with the di erence that we are working with RDF triples, i.e.
graph data. This joining can be supported by appropriate indices as well. Like,
for example, precomputed paths (RDF-3X [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) or stars (Structure Index [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]).
      </p>
      <p>It is apparent that this paper cannot encompass all possible features of RDF
data that in uence possibilities of their processing. So, as we will see in the
following section, we have proposed at least several of them (those we treat as
the most important ones with respect to our research intent) and attempted to
compute them over particular selected real-world datasets.</p>
    </sec>
    <sec id="sec-3">
      <title>Analyses</title>
      <p>Having described our motivation, we can move forward to the core part of this
paper. First, we provide some essential de nitions in order to describe basic
knowledge and theoretical background we need to understand to correctly
introduce characteristics of RDF triples and datasets we want to study.
3.1</p>
      <sec id="sec-3-1">
        <title>Basic De nitions</title>
        <p>RDF triples are composed from three components: a subject, a predicate and
an object. Beside literal values, the main building block for components of these
triples is based on URI (uniform resource identi er) references as they are
expected by the RDF standard. However, we assume that these references are
always automatically translated to full URIs.</p>
        <p>Thus, we can introduce U as a domain of all possible URI values, i.e.
identiers of resources. Analogously, assume that B is a domain for blank nodes and
L a domain for literals. We do not need to study the content of these domains,
we only use them to restrain the allowed values of individual triple components.
De nition 1 (RDF Triple). We say that t = (s, p, o) is an RDF triple (or
just a triple), if s 2 U [ B is a subject, p 2 U is a predicate, and o 2 U [ B [ L
is an object. We say that t is a data triple if o 2 L.</p>
        <p>All values (we call them terms) from domains U, B and L are seen as
ordinary strings. This allows us to get deeper insight into the internal structure
of URIs, generally conforming to SchemeN ame : HierarchicalP art [ ? Query ]
[ # F ragment ] scheme (we came across and studied only URLs, thus we could
make this simpli cation). First of all, having any term x, length(x) denotes a
length of x, i.e. number of symbols it is composed of.</p>
        <p>Now, we describe how to split URI terms into two parts. Assume that x 2 U
and p is a position of the last # symbol in x. Then we de ne pref ix(x) as a
substring of x before p and suf f ix(x) as a substring after p. If there is no
F ragment part, then we analogously use the last occurrence of / symbol from the
hierarchical part instead. This approach should capture the way how URI terms
are usually used and designed by creators of data documents and ontologies.</p>
        <p>Sets of RDF triples are commonly modelled as RDF graphs.</p>
        <p>De nition 2 (RDF Graph). Given a set of triples T , we de ne G = (V , T )
to be an RDF graph (or just a graph) as follows:
{ V is a set of graph vertices, where V = f x j 9 t 2 T , t = (s, p, o) such that
x = s or x = o g, and
{ T as a set of directed graph edges corresponds to the underlying set of triples.</p>
        <p>Although we use a term graph, RDF graphs are in fact directed multigraphs
since there can be more edges between the same vertices. Next, given a vertex
v 2 V and an edge e = (s, p, o) 2 T , we say that e is an ingoing edge to v if
v = o, and that e is an outgoing edge from v if v = s.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Proposed Characteristics</title>
        <p>According to the discussed motivation, we are now able to propose several
characteristics that may be useful to know about RDF data we want to store, query
or process in a di erent way.</p>
        <p>Term Features The rst group of proposed characteristics is connected with
features of individual terms in triples. First of all, the majority of existing
approaches for indexing and storing RDF data attempts to nd methods of
reducing the space required to store the triples. For this aim we can exploit an
idea that terms often repeat, or at least their substrings may often repeat across
triples in a dataset.</p>
        <p>In other words, we can inspect lengths of particular terms, either with respect
to their type (U, B and L domains), or altogether. Next, we can split terms
according to our de nition of their pre x and su x parts, exploring one suitable
way of nding shared substrings.</p>
        <p>Triple Features Now, we focus on characteristics of triple components and
their categorisation. Suppose that we have a set of triples T . Given a particular
term x (regardless its type), we may be interested how many triples contain this
term at a particular component (subject, predicate or object). In other words,
given a suitable term x, we can de ne P rojections=x(T ) = f t j t 2 T , t = (s, p,
o) and s = x g as a subject projection (or just S projection) corresponding to the
set of all triples in T having the given xed subject value equal to x. Analogously,
we can de ne P rojectionp=x(T ) and P rojectiono=x(T ) as P projection and O
projection respectively. If we model T as a graph, S and P projections correspond
to sets of outgoing and ingoing edges respectively.</p>
        <p>Moreover, there is no problem extending this idea to projections on two
components concurrently. Therefore, we can de ne SP projection, PO projection
and SO projection analogously. For example, P rojections=x;p=y(T ) = f t j t 2 T ,
t = (s, p, o), s = x and p = y g for two suitable terms x and y. In particular,
the SP projection is directly connected with the issue of multivalue properties
of RDF triples causing problems in relational databases.</p>
        <p>Star Patterns Let G = (V , T ) be a graph and v 2 V a vertex. We de ne a
graph star to be a set of edges Sv = Svin [ Svout, where Svin = f e j e 2 T , e =
(s, p, o) and v = o g is an ingoing star around v composed from ingoing edges to
v, and, analogously, Svout = f e j e 2 T , e = (s, p, o) and v = s g is an outgoing
star around v.</p>
        <p>Next, we de ne sig(Sv) as a signature of star Sv (regardless full, ingoing or
outgoing) to be a set of all predicates involved in a given star; in other words,
sig(Sv) = f x j t 2 Sv, t = (s, p, o) and x = p g.</p>
        <p>Given a graph G, we can split its vertices V into disjoint sets according to
star signatures. This means that two vertices v1, v2 2 V belong to the same set,
if sig(Sv1 ) = sig(Sv2 ). Since this classi cation is an equivalency relation over
V , we can call these sets as star classes. Analogously, we could introduce
ingoing/outgoing star classes considering only ingoing/outgoing edges respectively.</p>
        <p>
          Star classes and their sizes can describe uniformity of graph vertices, thus,
we can base additional characteristics on the notion of stars. Apparently, their
idea is connected (and inspired) by Tran et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and their Structure Index.
Path Patterns Let G = (V , T ) be a graph for a set of triples T and vS , vT 2 V
two vertices. We say that a sequence of edges PvS;vT = he1, :::, eni with length
n 2 N0 is a directed path from the source vertex vS to the target vertex vT , if
the following conditions hold:
{ First, let 8 k 2 N, 1 k n: ek = (sk, pk, ok) and ek 2 T .
{ If n &gt; 0, then s1 = vS and on = vT . If n = 0, then necessarily vS = vT .
{ Next, 8 k 2 N, 1 k &lt; n: ok = sk+1, i.e. edges follow each other.
{ : 9 j, k 2 N, 1 j &lt; k n: sj = sk or oj = ok or sj = oj , in other words,
vertices do not repeat.
        </p>
        <p>Given a particular path PvS;vT , we can de ne its signature as a sequence of
predicates of its edges, i.e. sig(PvS;vT ) = hp1, :::, pni.</p>
        <p>Directed paths can serve as another characteristic that is closely related to
the process of evaluating queries based on SPARQL graphs patterns.
Features Summary The following listing provides a simpli ed overview of all
characteristics over RDF triples we have proposed in this paper:
{ Term lengths { length of U and L terms viewed as strings.
{ Term pre xes { length of pre xes and su xes of U terms.
{ Data triples { ratio of data and other triples in datasets.
{ Triple projections { cardinality of S, P, O and SP, PO, SO projections.
{ Star patterns { sizes of graph, ingoing and outgoing star classes.
{ Path patterns { path occurrences according to their signatures.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In this section, we rst describe publicly available datasets we have chosen for
our experiments, then we provide their implementation basics and, nally, we
present results over these datasets together with some general observations.
4.1</p>
      <sec id="sec-4-1">
        <title>Datasets Selection</title>
        <p>The selection of appropriate datasets is probably one of the most important
issues of any experiments. The rst option could be to download a
representative sample of RDF triples from the entire Linked Data cloud. However, with
respect to the planned usage of our querying framework, we have nally decided
to perform the experiments over a few selected datasets only. They are from
different sources, cover di erent thematic areas and they contain several millions of
triples. Although we cannot omit DBPedia as one of the most important Linked
Data sources, we selected also other interesting ones. In particular, datasets that
are listed in the following summary, including their abbreviations we will use in
the further text:
{ ACM (ACM publications3) { ACM proceedings dataset with author and
publication information.
{ DBCS (Czech DBPedia4) { information extracted from Czech Wikipedia
infoboxes. This dataset contains less clean data, which is actually a common
situation in sources that are automatically derived from non-structured data.
{ DBEN (English DBPedia5) { information about persons (records like date
and place of birth etc.) extracted from English and German Wikipedia,
represented using the FOAF vocabulary.
{ GO (Gene Ontology6) { one of the datasets of Bio2RDF project describing
publicly available DNA sequences.
{ MDB (Movie Database7) { database containing triples about actors, movies
and their relationships.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Implementation Basics</title>
        <p>We downloaded dumps of all the previously described dataset in one of these
formats: RDF/XML8, n-triples9 or Notation 310. Then we parsed these dumps
using scripts11 implemented in Java and Python.</p>
        <p>After necessary data cleaning (some datasets contained syntax errors), we
stored all obtained triples into MySQL database using Percona Server 5.512
running on Debian operating system.</p>
        <p>Since we wanted to achieve e cient computation of the proposed
characteristics, we designed the database schema so as to be based on three tables: the rst
table contains all URI pre xes, the second one full URI values, and, nally, the
third one contains triples themselves. However, instead of URI terms we stored
references to the second table and instead of literals it contains their MD5 hashed
values together with original lengths. The simpli ed schema is shown in Figure 1.
3 http://acm.rkbexplorer.com/models/acm-proceedings.rdf
4 http://downloads.dbpedia.org/3.7-i18n/cs/infobox properties cs.nt.bz2
5 http://downloads.dbpedia.org/3.7/en/persondata en.nt.bz2
6 http://s4.semanticscience.org/bio2rdf download/rdf/genbank
7 http://queens.db.toronto.edu/~oktie/linkedmdb/linkedmdb-latest-dump.nt
8 http://www.w3.org/TR/rdf-syntax-grammar/
9 http://www.w3.org/TR/rdf-testcases/
10 http://www.w3.org/TeamSubmission/n3/
11 http://ksi.mff.cuni.cz/~starka/ld parsers.zip
12 http://www.percona.com/software/percona-server/
The majority of proposed characteristics was computed using MySQL scripts13.
The description of the most interesting observations together with detailed
experiment results is the subject of the following text.</p>
        <p>General Characteristics Firstly, we present the basic characteristics of the
data, the number of unique pre xes, URIs and triples. These results show the
diversity of triples within particular datasets (see Table 1).</p>
        <p>We can see that there are only 11 unique pre xes in ACM dataset, whereas
there are 10,157 pre xes in DBCS dataset. These numbers suggest that ACM
dataset is relatively closed (it contains mainly entities within its own domain
{ publications and their authors), while DBCS contains many dirty triples, i.e.
triples where the object component is recognised as a URI but it is not a part
of DBPedia (and probably neither a part of the Linked Data cloud).</p>
        <p>The results also show the average lengths of URIs. Although there are no
extreme values, we can see that ACM dataset di ers. This is because the URIs (in
13 http://ksi.mff.cuni.cz/~starka/ld mysql.zip
all datasets) often contain arti cial and often automatically generated identi ers
combined with entity types and/or human readable names.</p>
        <p>The detailed distribution of both URI and literal term lengths with respect
to selected datasets can be seen in Figure 2. Since each dataset has a di erent
total number of triples, we normalised the computed lengths by the total number
of terms in each dataset.</p>
        <p>(a) Literals
(b) URIs</p>
        <p>As we can see, all datasets except ACM use URIs around 40 characters long.
This is because identi ers in ACM dataset are padded by numeric values which
causes these URIs are of the same length. On the other hand, the lengths of
literals mostly range from 1 to 20 characters. This is caused by the usage of
common values, i.e. person names, dates, numbers, etc. Only ACM dataset
contains textual literals, in particular, keywords and concatenated lists of authors.
Triple Projections Assume that T represents a particular dataset, is a
comparison operator over N (e.g. = or &gt;) and c 2 fs, p, og stands for particular
triple component. Then we can de ne size z;c = jf x : jP rojectionc=xj z gj as a
shortcut for the number of terms x whose projections P rojectionc=x according
to a particular component c have exactly z triples in case of =, more than z
triples in case of &gt; (and analogously for the other comparison operators).</p>
        <p>We can also de ne size z;c1;c2 = jf (x1, x2) : jP rojectionc1=x1;c2=x2 j z gj
for double projections with both c1, c2 2 fs, p, og and c1 6= c2 as expected.</p>
        <p>These two notions help us to present interesting features of the triple
projection characteristics. In other words, we study the distribution of terms (or pairs
of terms in case of double projections) according to their signi cance inside given
datasets. For example, having the condition z equal to = 1 and inspecting the
object components, we are interested in terms x such that there exists right one
triple in T with x at component o. Then, size=1;o gives us the number of such
x in T . Several projection results are presented in Table 2.</p>
        <p>The results show, that there are usually only few unique predicates which are
used in the triples. In DBCS, there are over 270 triples for each predicate, which
is the lowest ratio between all datasets. In other datasets, there are thousands
of triples per predicate. For subjects, the average number varies from 2 to 20.</p>
        <p>Unique subjects
Unique predicates
Unique objects
size=1;o
size&gt;1;o
size=1;s;p
size&gt;1;s;p
size=1;p;o
size&gt;1;p;o
size=1;s;o
size&gt;1;s;o</p>
        <p>In the second and third part of the table, we show projections for O and SP,
PO, SO respectively. In each case we split the entire space into two disjoint parts:
classes with size equal to 1 and classes with greater size. It is interesting that in
most cases the projections usually have right one triple. We can also say that a
typical dataset contains only a very limited number of predicates. Subjects are
used mostly more than once, but they do not form large hubs.</p>
        <p>Star Patterns Assume that T are triples of a particular dataset, then we can
split vertices V of the corresponding graph G = (V , T ) into star classes according
to signatures of their star patterns, as we already know. Figure 3 depicts the
distribution of star classes according to their sizes, separately for ingoing and
outgoing stars. In other words, e.g. for ingoing star patterns, the horizontal axis
represents di erent possible sizes of signatures (di erent numbers of predicates
on ingoing edges) and the vertical axis represents the overall number of ingoing
star classes having the given size.</p>
        <p>The values are normalised in the same way as in the term length
characteristic, i.e. normalised by the total number of distinct star signatures in the
particular dataset.</p>
        <p>(a) Outgoing
(b) Ingoing</p>
        <p>We can see that most of the unique outgoing stars have the size (i.e. number
of outgoing predicates) from 10 to 30. Similarly, most of the ingoing stars have
the size from 10 to 30, only except ACM dataset where sizes are distributed
uniformly. Moreover, for all datasets except DBCS, the rst 10% of star signatures
covers more than 80% of triples.</p>
        <p>Path Patterns Similarly to star patterns, we computed also the path pattern
characteristics. In particular, we considered paths of lengths equal to 2 and 3,
since longer paths were out of our computation possibilities. For each path length
we detected the number of unique path signatures and the overall number of all
paths conforming to them, as we can see in Table 3.</p>
        <p>Moreover, we also studied another aspect { having a particular number of
the most frequent path signatures, how many paths do these signatures conform
to? The number of paths with the most frequent signature is presented in the
mentioned table, while the entire dependency is depicted in Figure 4.</p>
        <p>GO</p>
        <p>MDB
2
3</p>
        <p>Unique signatures 7 33,394
Number of paths 3,382,538 1,191,731
Greatest class 1,026,874 27,786
Unique signatures
Number of paths
Greatest class
0
0
0</p>
        <p>67,107
1,428,871
15,531
14
178
38
0
0
0</p>
        <p>55 275
1,300,120 2,470,993
247,477 248,633</p>
        <p>206 664
26,863,416 15,804,941
2,754,908 550,887</p>
        <p>Finally, according to computed results, the ratio between unique path
signatures and all paths themselves is relatively low. In other words, having a
particular frequent signature, there are many paths conforming to it, which can
be exploited in indexing techniques dealing with precomputed paths.
(a) ACM
(b) DBCS</p>
        <p>(c) DBEN
(d) GO
(e) MDB
Although there exist several works about analyses of semantic documents and
Linked Data, there are still open questions that could be discussed.</p>
        <p>
          We start this overview of the related work with one of our previous
papers [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], where we proposed a system for automatic document acquisition and
analysis. Although we primarily focused on structural characteristics of XML
documents, some basic ideas and insight into the complexity of exported datasets
can be applied also in the context of Linked Data.
        </p>
        <p>
          Ding et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] described the analysis of more than 1.5 million FOAF
documents. In particular, they inspected the usage of the FOAF namespace, host
names and particular properties, as well as the relationships of a person in a
group and other components of a social network. In general, this work describes
several interesting characteristics, but its impact and context is very restrained.
        </p>
        <p>
          Both the previous works assumed analyses at the document level, whereas
Rodriguez [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] looked at datasets from the Linked Data cloud in a more complex
way and computed some basic characteristics between them.
        </p>
        <p>
          The general statistics of the Linked Data cloud are described in Bizer et
al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The authors aimed at characteristics and link statistics between selected
datasets. These datasets were divided by di erent thematic domains, for which
several ingoing and outgoing statistics were computed. Provenance, licensing
and dataset-level metadata published together with these datasets were also
considered.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper we focused on several characteristics of publicly available Linked
Data datasets. The results show that although the datasets are from di erent
areas, published by di erent methods and institutions, some of their
characteristics are similar and, thus, the knowledge of these characteristics can be harnessed
to make the management of RDF data more e cient.</p>
      <p>We considered only a small sample of the Linked Data cloud as well as only a
limited set of proposed characteristics dealing primarily with RDF triple
components and structure only. On the other hand, we hope that despite this fact some
observations presented in this paper can be generalised, further extended and
appropriately exploited. In our future work, we plan to enrich these characteristics
and also encompass a wider set of datasets and triples themselves.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Atre</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaoji</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Matrix "Bit" loaded: A Scalable Lightweight Join Query Processor for RDF Data</article-title>
          .
          <source>In: Proc. of the 19th Int. Conf. on World Wide Web</source>
          . pp.
          <volume>41</volume>
          {
          <fpage>50</fpage>
          . WWW '10, ACM, NY, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
          </string-name>
          , R.:
          <source>State of the LOD Cloud (March</source>
          <year>2011</year>
          ), http://www4.wiwiss.fu-berlin.de/lodcloud/state/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked Data { The Story so far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <volume>1</volume>
          {
          <fpage>22</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>How the Semantic Web is Being Used: An Analysis of FOAF Documents</article-title>
          .
          <source>In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS'05) - Track 4 - Volume 04</source>
          . pp.
          <volume>113</volume>
          {
          <fpage>122</fpage>
          . HICSS '05, IEEE Computer Society, Washington, DC, USA (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karnstedt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>K.U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Data Summaries for On-demand Queries over Linked Data</article-title>
          .
          <source>In: Proc. of the 19th Int. Conf. on World Wide Web</source>
          . pp.
          <volume>411</volume>
          {
          <fpage>420</fpage>
          . WWW '10, ACM, NY, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Manola</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>E.: RDF</given-names>
          </string-name>
          <string-name>
            <surname>Primer</surname>
          </string-name>
          (
          <year>2004</year>
          ), http://www.w3.org/TR/rdf-primer/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>RDF-3X: A RISC-style Engine for RDF</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>1</volume>
          ,
          <issue>647</issue>
          {
          <fpage>659</fpage>
          (
          <year>August 2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Prud'hommeaux</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SPARQL Query Language for RDF (</article-title>
          <year>2008</year>
          ), http://www.w3.org/TR/rdf-sparql-query/
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>A Graph Analysis of the Linked Data Cloud</article-title>
          .
          <source>CoRR</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Starka</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Svoboda</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sochna</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schejbal</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mlynkova</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bednarek</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Analyzer { A Complex System for Data Analysis</article-title>
          .
          <source>The Computer Journal</source>
          (
          <year>2011</year>
          ),
          <source>Advance Access published October 13</source>
          ,
          <year>2011</year>
          , DOI: 10.1093/comjnl/bxr103
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Svoboda</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mlynkova</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Linked Data Indexing Methods: A Survey. In: On the Move to Meaningful Internet Systems: OTM 2011 Workshops</article-title>
          . pp.
          <volume>474</volume>
          {
          <fpage>483</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ladwig</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Structure Index for RDF Data</article-title>
          .
          <source>In: Workshop on Semantic Data Management (SemData@VLDB)</source>
          <year>2010</year>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karras</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Hexastore: Sextuple Indexing for Semantic Web Data Management</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>1</volume>
          ,
          <issue>1008</issue>
          {
          <fpage>1019</fpage>
          (
          <year>August 2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>