<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Analyses of RDF Triples in Sample Datasets</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jakub</forename><surname>Stárka</surname></persName>
							<email>starka@ksi.mff.cuni.cz</email>
							<affiliation key="aff0">
								<orgName type="department">XML and Web Engineering Research Group Faculty of Mathematics and Physics</orgName>
								<orgName type="institution">Charles University in</orgName>
								<address>
									<addrLine>Prague ; Malostranské náměstí 25</addrLine>
									<postCode>118 00</postCode>
									<settlement>Prague 1</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Economics</orgName>
								<address>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martin</forename><surname>Svoboda</surname></persName>
							<email>svoboda@ksi.mff.cuni.cz</email>
							<affiliation key="aff0">
								<orgName type="department">XML and Web Engineering Research Group Faculty of Mathematics and Physics</orgName>
								<orgName type="institution">Charles University in</orgName>
								<address>
									<addrLine>Prague ; Malostranské náměstí 25</addrLine>
									<postCode>118 00</postCode>
									<settlement>Prague 1</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Economics</orgName>
								<address>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Irena</forename><surname>Mlýnková</surname></persName>
							<email>mlynkova@ksi.mff.cuni.cz</email>
							<affiliation key="aff0">
								<orgName type="department">XML and Web Engineering Research Group Faculty of Mathematics and Physics</orgName>
								<orgName type="institution">Charles University in</orgName>
								<address>
									<addrLine>Prague ; Malostranské náměstí 25</addrLine>
									<postCode>118 00</postCode>
									<settlement>Prague 1</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Analyses of RDF Triples in Sample Datasets</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">25A6012EA0A07F18D311E0C2AED9734A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:05+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Linked Data principles supported especially by RDF triples appeared recently to enrich the Web of Documents by the Web of Data. However, each application that wants to process RDF triples has to deal with their distribution, dynamics and scaling. Thus, having understood structural and other features of such data, we may have better chances to propose these applications more efficiently. Especially when we consider issues of data storing, indexing and querying. The aim of this paper is to propose characteristics that appropriately capture and describe such features of RDF triples, and to provide experimental results over a few selected real-world RDF datasets.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Linked Data <ref type="bibr" target="#b2">[3]</ref> is not any particular standard, it is just a set of common practices and general rules using which we can contribute to the Web of Data that emerged recently to enrich the traditional Web of Documents. So, what are these rules? First of all, each real-world entity should be assigned a unique URL identifier; these identifiers should be dereferenceable by HTTP to obtain information about these entities; and, finally, these entity representations should be interlinked together to form a global Linked Data cloud.</p><p>Nevertheless, despite there are also other ways how to follow the mentioned Linked Data principles, the most promising is obviously the RDF standard <ref type="bibr" target="#b5">[6]</ref>. It assumes data modelled as triples with three components: subject, predicate and object. These triples can also be viewed as graphs, where vertices correspond to subjects and objects, while labelled edges represent the triples themselves.</p><p>One of our ongoing research efforts should result into a proposal of a new querying system dealing with large amounts of distributed and dynamic RDF data -issues we previously identified as open problems of the existing approaches from the area of RDF triples storing, indexing and querying <ref type="bibr" target="#b10">[11]</ref>. It is apparent that, having the knowledge about structural and other features of data we want to process, we are able to manage such data more efficiently.</p><p>In fact, this idea predetermines the aim of this paper -we propose a set of characteristics of RDF triples and provide experimental results over several selected datasets. These characteristics capture features of individual triple components, triples themselves and also structural features of RDF graphs, while performed experiments attempt to outline the nature of real-world RDF data.</p><p>Outline First of all, in Section 2 we explain the motivation for this paper. Section 3 provides basic theoretical background and definitions of proposed RDF characteristics, while Section 4 presents results of performed experiments. In Section 5 we shortly discuss the related work, and, finally, Section 6 concludes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Motivation</head><p>If we knew characteristics about data we want to process, we would have better chances to propose algorithms and data structures that could be more efficient with respect to our expectations. In other words, this idea justifies the aim of this paper. Having understood RDF triples we want to store, index and query, we can, hopefully, achieve better results. Moreover, we can also come across approaches that require sort of a configuration (e.g. Structure Index by Tran and Ladwig <ref type="bibr" target="#b11">[12]</ref> or Summary Index by Harth et al. <ref type="bibr" target="#b4">[5]</ref>). But how can we provide required parameters, if we do not know enough about data or queries?</p><p>Therefore, we have proposed several characteristics we find interesting to study. First of all, the majority of indexing approaches (e.g. Hexastore by Weiss et al. <ref type="bibr" target="#b12">[13]</ref> or BitMat Index by Atre et al. <ref type="bibr" target="#b0">[1]</ref>) proposes to store components of RDF triples and triples themselves separately (even using fairly different structures) in order to reduce space requirements. Knowledge of string features of these component values could support this practice.</p><p>The second group of characteristics worth of studying is related to query evaluation and, in particular, access patterns to individual triple components. In case of full-text querying, we usually do not care which particular triple component should match the queried value, but in case of structural querying like SPARQL <ref type="bibr" target="#b7">[8]</ref>, we need to have suitable indices allowing us to efficiently access particular components according to the prompted query. These indices can be built, for example, on nested lists (Hexastore <ref type="bibr" target="#b12">[13]</ref>) or B + -trees (RDF-3X by Neumann and Weikum <ref type="bibr" target="#b6">[7]</ref>).</p><p>Finally, we can even attempt to study more complex characteristics based on structure of RDF graphs. When using SPARQL with queries based on graph patterns, we often need to do operations similar to traditional joining in relational databases, only with the difference that we are working with RDF triples, i.e. graph data. This joining can be supported by appropriate indices as well. Like, for example, precomputed paths (RDF-3X <ref type="bibr" target="#b6">[7]</ref>) or stars (Structure Index <ref type="bibr" target="#b11">[12]</ref>).</p><p>It is apparent that this paper cannot encompass all possible features of RDF data that influence possibilities of their processing. So, as we will see in the following section, we have proposed at least several of them (those we treat as the most important ones with respect to our research intent) and attempted to compute them over particular selected real-world datasets.</p><p>Having described our motivation, we can move forward to the core part of this paper. First, we provide some essential definitions in order to describe basic knowledge and theoretical background we need to understand to correctly introduce characteristics of RDF triples and datasets we want to study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Basic Definitions</head><p>RDF triples are composed from three components: a subject, a predicate and an object. Beside literal values, the main building block for components of these triples is based on URI (uniform resource identifier) references as they are expected by the RDF standard. However, we assume that these references are always automatically translated to full URIs.</p><p>Thus, we can introduce U as a domain of all possible URI values, i.e. identifiers of resources. Analogously, assume that B is a domain for blank nodes and L a domain for literals. We do not need to study the content of these domains, we only use them to restrain the allowed values of individual triple components.</p><p>Definition 1 (RDF Triple). We say that t = (s, p, o) is an RDF triple (or just a triple),</p><formula xml:id="formula_0">if s ∈ U ∪ B is a subject, p ∈ U is a predicate, and o ∈ U ∪ B ∪ L is an object. We say that t is a data triple if o ∈ L.</formula><p>All values (we call them terms) from domains U, B and L are seen as ordinary strings. This allows us to get deeper insight into the internal structure of URIs, generally conforming to SchemeN ame : HierarchicalP art [ ? Query ] [ # F ragment ] scheme (we came across and studied only URLs, thus we could make this simplification). First of all, having any term x, length(x) denotes a length of x, i.e. number of symbols it is composed of. Now, we describe how to split URI terms into two parts. Assume that x ∈ U and p is a position of the last # symbol in x. Then we define pref ix(x) as a substring of x before p and suf f ix(x) as a substring after p. If there is no F ragment part, then we analogously use the last occurrence of / symbol from the hierarchical part instead. This approach should capture the way how URI terms are usually used and designed by creators of data documents and ontologies.</p><p>Sets of RDF triples are commonly modelled as RDF graphs.</p><p>Definition 2 (RDF Graph). Given a set of triples T , we define G = (V , T ) to be an RDF graph (or just a graph) as follows:</p><p>-V is a set of graph vertices, where</p><formula xml:id="formula_1">V = { x | ∃ t ∈ T , t = (s, p, o) such that x = s or x = o },</formula><p>and -T as a set of directed graph edges corresponds to the underlying set of triples.</p><p>Although we use a term graph, RDF graphs are in fact directed multigraphs since there can be more edges between the same vertices. Next, given a vertex v ∈ V and an edge e = (s, p, o) ∈ T , we say that e is an ingoing edge to v if v = o, and that e is an outgoing edge from v if v = s.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Proposed Characteristics</head><p>According to the discussed motivation, we are now able to propose several characteristics that may be useful to know about RDF data we want to store, query or process in a different way.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Term Features</head><p>The first group of proposed characteristics is connected with features of individual terms in triples. First of all, the majority of existing approaches for indexing and storing RDF data attempts to find methods of reducing the space required to store the triples. For this aim we can exploit an idea that terms often repeat, or at least their substrings may often repeat across triples in a dataset.</p><p>In other words, we can inspect lengths of particular terms, either with respect to their type (U, B and L domains), or altogether. Next, we can split terms according to our definition of their prefix and suffix parts, exploring one suitable way of finding shared substrings.</p><p>Triple Features Now, we focus on characteristics of triple components and their categorisation. Suppose that we have a set of triples T . Given a particular term x (regardless its type), we may be interested how many triples contain this term at a particular component (subject, predicate or object). In other words, given a suitable term x, we can define P rojection s=x (T ) = { t | t ∈ T , t = (s, p, o) and s = x } as a subject projection (or just S projection) corresponding to the set of all triples in T having the given fixed subject value equal to x. Analogously, we can define P rojection p=x (T ) and P rojection o=x (T ) as P projection and O projection respectively. If we model T as a graph, S and P projections correspond to sets of outgoing and ingoing edges respectively.</p><p>Moreover, there is no problem extending this idea to projections on two components concurrently. Therefore, we can define SP projection, PO projection and SO projection analogously. For example, P rojection s=x,p=y (T ) = { t | t ∈ T , t = (s, p, o), s = x and p = y } for two suitable terms x and y. In particular, the SP projection is directly connected with the issue of multivalue properties of RDF triples causing problems in relational databases.</p><p>Star Patterns Let G = (V , T ) be a graph and v ∈ V a vertex. We define a graph star to be a set of edges Next, we define sig(S v ) as a signature of star S v (regardless full, ingoing or outgoing) to be a set of all predicates involved in a given star; in other words,</p><formula xml:id="formula_2">S v = S in v ∪ S out v ,</formula><formula xml:id="formula_3">sig(S v ) = { x | t ∈ S v , t = (s, p, o) and x = p }.</formula><p>Given a graph G, we can split its vertices V into disjoint sets according to star signatures. This means that two vertices v 1 , v 2 ∈ V belong to the same set, if sig(S v1 ) = sig(S v2 ). Since this classification is an equivalency relation over V , we can call these sets as star classes. Analogously, we could introduce ingoing/outgoing star classes considering only ingoing/outgoing edges respectively.</p><p>Star classes and their sizes can describe uniformity of graph vertices, thus, we can base additional characteristics on the notion of stars. Apparently, their idea is connected (and inspired) by Tran et al. <ref type="bibr" target="#b11">[12]</ref> and their Structure Index.</p><p>Path Patterns Let G = (V , T ) be a graph for a set of triples T and v S , v T ∈ V two vertices. We say that a sequence of edges P v S ,v T = e 1 , ..., e n with length n ∈ N 0 is a directed path from the source vertex v S to the target vertex v T , if the following conditions hold:</p><formula xml:id="formula_4">-First, let ∀ k ∈ N, 1 ≤ k ≤ n: e k = (s k , p k , o k ) and e k ∈ T . -If n &gt; 0, then s 1 = v S and o n = v T . If n = 0, then necessarily v S = v T . -Next, ∀ k ∈ N, 1 ≤ k &lt; n: o k = s k+1 , i.e. edges follow each other. -¬ ∃ j, k ∈ N, 1 ≤ j &lt; k ≤ n: s j = s k or o j = o k or s j = o j ,</formula><p>in other words, vertices do not repeat.</p><p>Given a particular path P v S ,v T , we can define its signature as a sequence of predicates of its edges, i.e. sig(P v S ,v T ) = p 1 , ..., p n .</p><p>Directed paths can serve as another characteristic that is closely related to the process of evaluating queries based on SPARQL graphs patterns.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Features Summary</head><p>The following listing provides a simplified overview of all characteristics over RDF triples we have proposed in this paper:</p><p>-Term lengths -length of U and L terms viewed as strings.</p><p>-Term prefixes -length of prefixes and suffixes of U terms.</p><p>-Data triples -ratio of data and other triples in datasets.</p><p>-Triple projections -cardinality of S, P, O and SP, PO, SO projections.</p><p>-Star patterns -sizes of graph, ingoing and outgoing star classes.</p><p>-Path patterns -path occurrences according to their signatures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In this section, we first describe publicly available datasets we have chosen for our experiments, then we provide their implementation basics and, finally, we present results over these datasets together with some general observations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Datasets Selection</head><p>The selection of appropriate datasets is probably one of the most important issues of any experiments. The first option could be to download a representative sample of RDF triples from the entire Linked Data cloud. However, with respect to the planned usage of our querying framework, we have finally decided to perform the experiments over a few selected datasets only. They are from different sources, cover different thematic areas and they contain several millions of triples. Although we cannot omit DBPedia as one of the most important Linked Data sources, we selected also other interesting ones. In particular, datasets that are listed in the following summary, including their abbreviations we will use in the further text:</p><p>-ACM (ACM publications 3 ) -ACM proceedings dataset with author and publication information. -DBCS (Czech DBPedia 4 ) -information extracted from Czech Wikipedia infoboxes. This dataset contains less clean data, which is actually a common situation in sources that are automatically derived from non-structured data. -DBEN (English DBPedia 5 ) -information about persons (records like date and place of birth etc.) extracted from English and German Wikipedia, represented using the FOAF vocabulary. -GO (Gene Ontology 6 ) -one of the datasets of Bio2RDF project describing publicly available DNA sequences. -MDB (Movie Database 7 ) -database containing triples about actors, movies and their relationships.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Implementation Basics</head><p>We downloaded dumps of all the previously described dataset in one of these formats: RDF/XML 8 , n-triples 9 or Notation 3 10 . Then we parsed these dumps using scripts 11 implemented in Java and Python.</p><p>After necessary data cleaning (some datasets contained syntax errors), we stored all obtained triples into MySQL database using Percona Server 5.5 12  running on Debian operating system.</p><p>Since we wanted to achieve efficient computation of the proposed characteristics, we designed the database schema so as to be based on three tables: the first table contains all URI prefixes, the second one full URI values, and, finally, the third one contains triples themselves. However, instead of URI terms we stored references to the second table and instead of literals it contains their MD5 hashed values together with original lengths. The simplified schema is shown in Figure <ref type="figure" target="#fig_1">1</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Experiment Results</head><p>The majority of proposed characteristics was computed using MySQL scripts 13 . The description of the most interesting observations together with detailed experiment results is the subject of the following text.</p><p>General Characteristics Firstly, we present the basic characteristics of the data, the number of unique prefixes, URIs and triples. These results show the diversity of triples within particular datasets (see Table <ref type="table" target="#tab_0">1</ref>). We can see that there are only 11 unique prefixes in ACM dataset, whereas there are 10,157 prefixes in DBCS dataset. These numbers suggest that ACM dataset is relatively closed (it contains mainly entities within its own domain -publications and their authors), while DBCS contains many dirty triples, i.e. triples where the object component is recognised as a URI but it is not a part of DBPedia (and probably neither a part of the Linked Data cloud).</p><p>The results also show the average lengths of URIs. Although there are no extreme values, we can see that ACM dataset differs. This is because the URIs (in 13 http://ksi.mff.cuni.cz/˜starka/ld mysql.zip all datasets) often contain artificial and often automatically generated identifiers combined with entity types and/or human readable names.</p><p>The detailed distribution of both URI and literal term lengths with respect to selected datasets can be seen in Figure <ref type="figure" target="#fig_3">2</ref>. Since each dataset has a different total number of triples, we normalised the computed lengths by the total number of terms in each dataset.  As we can see, all datasets except ACM use around 40 characters long. This is because identifiers in ACM dataset are padded by numeric values which causes these URIs are of the same length. On the other hand, the lengths of literals mostly range from 1 to 20 characters. This is caused by the usage of common values, i.e. person names, dates, numbers, etc. Only ACM dataset contains textual literals, in particular, keywords and concatenated lists of authors.</p><p>Triple Projections Assume that T represents a particular dataset, Θ is a comparison operator over N (e.g. = or &gt;) and c ∈ {s, p, o} stands for particular triple component. Then we can define size Θz,c = |{ x : |P rojection c=x | Θ z }| as a shortcut for the number of terms x whose projections P rojection c=x according to a particular component c have exactly z triples in case of =, more than z triples in case of &gt; (and analogously for the other comparison operators).</p><p>We can also define size Θz,c1,c2 = |{ (x 1 , x 2 ) : |P rojection c1=x1,c2=x2 | Θ z }| for double projections with both c 1 , c 2 ∈ {s, p, o} and c 1 = c 2 as expected.</p><p>These two notions help us to present interesting features of the triple projection characteristics. In other words, we study the distribution of terms (or pairs of terms in case of double projections) according to their significance inside given datasets. For example, having the condition Θz equal to = 1 and inspecting the object components, we are interested in terms x such that there exists right one triple in T with x at component o. Then, size =1,o gives us the number of such x in T . Several projection results are presented in Table <ref type="table" target="#tab_1">2</ref>.</p><p>The results show, that there are usually only few unique predicates which are used in the triples. In DBCS, there are over 270 triples for each predicate, which is the lowest ratio between all datasets. In other datasets, there are thousands of triples per predicate. For subjects, the average number varies from 2 to 20. In the second and third part of the table, we show projections for O and SP, PO, SO respectively. In each case we split the entire space into two disjoint parts: classes with size equal to 1 and classes with greater size. It is interesting that in most cases the projections usually have right one triple. We can also say that a typical dataset contains only a very limited number of predicates. Subjects are used mostly more than once, but they do not form large hubs.</p><p>Star Patterns Assume that T are triples of a particular dataset, then we can split vertices V of the corresponding graph G = (V , T ) into star classes according to signatures of their star patterns, as we already know. Figure <ref type="figure" target="#fig_5">3</ref> depicts the distribution of star classes according to their sizes, separately for ingoing and outgoing stars. In other words, e.g. for ingoing star patterns, the horizontal axis represents different possible sizes of signatures (different numbers of predicates on ingoing edges) and the vertical axis represents the overall number of ingoing star classes having the given size.</p><p>The values are normalised in the same way as in the term length characteristic, i.e. normalised by the total number of distinct star signatures in the particular dataset.  We can see that most of the unique outgoing stars have the size (i.e. number of outgoing predicates) from 10 to 30. Similarly, most of the ingoing stars have the size from 10 to 30, only except ACM dataset where sizes are distributed uniformly. Moreover, for all datasets except DBCS, the first 10% of star signatures covers more than 80% of triples.</p><p>Path Patterns Similarly to star patterns, we computed also the path pattern characteristics. In particular, we considered paths of lengths equal to 2 and 3, since longer paths were out of our computation possibilities. For each path length we detected the number of unique path signatures and the overall number of all paths conforming to them, as we can see in Table <ref type="table" target="#tab_2">3</ref>.</p><p>Moreover, we also studied another aspect -having a particular number of the most frequent path signatures, how many paths do these signatures conform to? The number of paths with the most frequent signature is presented in the mentioned table, while the entire dependency is depicted in Figure <ref type="figure" target="#fig_6">4</ref>. Finally, according to computed results, the ratio between unique path signatures and all paths themselves is relatively low. In other words, having a particular frequent signature, there are many paths conforming to it, which can be exploited in indexing techniques dealing with precomputed paths. Although there exist several works about analyses of semantic documents and Linked Data, there are still open questions that could be discussed.</p><p>We start this overview of the related work with one of our previous papers <ref type="bibr" target="#b9">[10]</ref>, where we proposed a system for automatic document acquisition and analysis. Although we primarily focused on structural characteristics of XML documents, some basic ideas and insight into the complexity of exported datasets can be applied also in the context of Linked Data. Ding et al. <ref type="bibr" target="#b3">[4]</ref> described the analysis of more than 1.5 million FOAF documents. In particular, they inspected the usage of the FOAF namespace, host names and particular properties, as well as the relationships of a person in a group and other components of a social network. In general, this work describes several interesting characteristics, but its impact and context is very restrained.</p><p>Both the previous works assumed analyses at the document level, whereas Rodriguez <ref type="bibr" target="#b8">[9]</ref> looked at datasets from the Linked Data cloud in a more complex way and computed some basic characteristics between them.</p><p>The general statistics of the Linked Data cloud are described in Bizer et al. <ref type="bibr" target="#b1">[2]</ref>. The authors aimed at characteristics and link statistics between selected datasets. These datasets were divided by different thematic domains, for which several ingoing and outgoing statistics were computed. Provenance, licensing and dataset-level metadata published together with these datasets were also considered.</p><p>In this paper we focused on several characteristics of publicly available Linked Data datasets. The results show that although the datasets are from different areas, published by different methods and institutions, some of their characteristics are similar and, thus, the knowledge of these characteristics can be harnessed to make the management of RDF data more efficient.</p><p>We considered only a small sample of the Linked Data cloud as well as only a limited set of proposed characteristics dealing primarily with RDF triple components and structure only. On the other hand, we hope that despite this fact some observations presented in this paper can be generalised, further extended and appropriately exploited. In our future work, we plan to enrich these characteristics and also encompass a wider set of datasets and triples themselves.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>where S in v = { e | e ∈ T , e = (s, p, o) and v = o } is an ingoing star around v composed from ingoing edges to v, and, analogously, S out v = { e | e ∈ T , e = (s, p, o) and v = s } is an outgoing star around v.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. Database schema</figDesc><graphic coords="7,152.06,115.84,311.25,88.23" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 .</head><label>2</label><figDesc>Figure 2. Distribution of literal and URI term lengths</figDesc><graphic coords="8,145.91,207.46,155.62,89.76" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 3 .</head><label>3</label><figDesc>Figure 3. Distribution of ingoing and outgoing star class sizes</figDesc><graphic coords="10,145.91,115.84,155.62,89.76" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 4 .</head><label>4</label><figDesc>Figure 4. Aggregated number of paths according to the signature frequency</figDesc><graphic coords="11,312.47,208.72,102.33,73.70" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Term and triple characteristics</figDesc><table><row><cell></cell><cell>ACM</cell><cell>DBCS</cell><cell>DBEN</cell><cell>GO</cell><cell>MDB</cell><cell>Total</cell></row><row><cell></cell><cell></cell><cell cols="2">Term Counts</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Unique Prefixes</cell><cell>11</cell><cell>10,157</cell><cell>137</cell><cell>195</cell><cell>5,204</cell><cell>15,704</cell></row><row><cell>Unique URIs</cell><cell cols="6">810,266 162,625 867,428 1,187,775 1,327,165 4,355,259</cell></row><row><cell></cell><cell></cell><cell cols="2">Triple Counts</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="7">Unique Triples 2,715,890 1,426,244 4,502,983 7,411,868 5,291,548 21,348,533</cell></row><row><cell>Data Triples</cell><cell cols="6">840,008 1,019,355 3,006,569 2,418,975 2,418,413 9,703,320</cell></row><row><cell></cell><cell></cell><cell cols="2">Term Lengths</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Term prefixes</cell><cell>31.55</cell><cell>54.52</cell><cell>48.87</cell><cell>57.27</cell><cell>47.66</cell><cell>52.21</cell></row><row><cell>Term suffixes</cell><cell>30.07</cell><cell>18.12</cell><cell>16.87</cell><cell>19.03</cell><cell>16.23</cell><cell>19.77</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Triple projections</figDesc><table><row><cell></cell><cell>ACM</cell><cell>DBCS</cell><cell>DBEN</cell><cell>GO</cell><cell>MDB</cell></row><row><cell></cell><cell></cell><cell cols="2">Term Counts</cell><cell></cell><cell></cell></row><row><cell>Unique subjects</cell><cell>810,248</cell><cell>68,946</cell><cell>790,703</cell><cell>776,698</cell><cell>694,399</cell></row><row><cell>Unique predicates</cell><cell>14</cell><cell>5,227</cell><cell>9</cell><cell>22</cell><cell>250</cell></row><row><cell>Unique objects</cell><cell>489,912</cell><cell>98,638</cell><cell>76,736</cell><cell cols="2">1,171,491 1,049,248</cell></row><row><cell></cell><cell></cell><cell cols="2">Simple Projections</cell><cell></cell><cell></cell></row><row><cell>size=1,o</cell><cell>403,425 (82.3%)</cell><cell>69,886 (70.9%)</cell><cell>45,551 (59.4%)</cell><cell>888,005 (75.8%)</cell><cell>657,484 (62.7%)</cell></row><row><cell>size&gt;1,o</cell><cell>86,487 (17.7%)</cell><cell>28,752 (29.1%)</cell><cell>31,185 (40.6%)</cell><cell>283,486 (24,2%)</cell><cell>391,764 (37.3%)</cell></row><row><cell></cell><cell></cell><cell cols="2">Double Projections</cell><cell></cell><cell></cell></row><row><cell>size=1,s,p</cell><cell>2,002,042 (89.1%)</cell><cell>881,317 (86.4%)</cell><cell>246,078 (94.2%)</cell><cell>6,429,816 (98.8%)</cell><cell>4,386,514 (94.5%)</cell></row><row><cell>size&gt;1,s,p</cell><cell>245,828 (10.9%)</cell><cell>139,254 (13.6%)</cell><cell>3,964,721 (5.8%)</cell><cell>79,925 (1.2%)</cell><cell>253,471 (5.5%)</cell></row><row><cell>size=1,p,o</cell><cell>403,425 (82.3%)</cell><cell>102,127 (79.2%)</cell><cell>56,018 (61.7%)</cell><cell>910,420 (76.2%)</cell><cell>873,643 (74.1%)</cell></row><row><cell>size&gt;1,p,o</cell><cell>86,487 (17.7%)</cell><cell>26,761 (30.8%)</cell><cell>34,809 (38.3%)</cell><cell>284,764 (23.8%)</cell><cell>306,150 (25.9%)</cell></row><row><cell>size=1,s,o</cell><cell>2,661,787 (99.1%)</cell><cell>381,569 (82.6%)</cell><cell>1,439,886 (64.0%)</cell><cell>4,980,199 (86.4%)</cell><cell>2,857,318 (80.9%)</cell></row><row><cell>size&gt;1,s,o</cell><cell>24,297 (0.9%)</cell><cell>80,359 (17.4%)</cell><cell>810,836 (36.0%)</cell><cell>783,039 (13.6%)</cell><cell>673,347 (19.1%)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Statistics of path classes</figDesc><table><row><cell cols="2">Length Property</cell><cell>ACM</cell><cell cols="2">DBCS DBEN</cell><cell>GO</cell><cell>MDB</cell></row><row><cell></cell><cell>Unique signatures</cell><cell>7</cell><cell>33,394</cell><cell>14</cell><cell>55</cell><cell>275</cell></row><row><cell>2</cell><cell cols="6">Number of paths 3,382,538 1,191,731 178 1,300,120 2,470,993</cell></row><row><cell></cell><cell>Greatest class</cell><cell cols="2">1,026,874 27,786</cell><cell>38</cell><cell>247,477</cell><cell>248,633</cell></row><row><cell></cell><cell>Unique signatures</cell><cell>0</cell><cell>67,107</cell><cell>0</cell><cell>206</cell><cell>664</cell></row><row><cell>3</cell><cell>Number of paths</cell><cell>0</cell><cell>1,428,871</cell><cell>0</cell><cell cols="2">26,863,416 15,804,941</cell></row><row><cell></cell><cell>Greatest class</cell><cell>0</cell><cell>15,531</cell><cell>0</cell><cell cols="2">2,754,908 550,887</cell></row></table></figure>
		</body>
		<back>

			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This work was supported by the Charles University Grant Agency grant 4105/2011, the Czech Science Foundation grant P202/10/0573 and the EU ICT FP7 project 257943 (LOD2 Project).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Matrix &quot;Bit&quot; loaded: A Scalable Lightweight Join Query Processor for RDF Data</title>
		<author>
			<persName><forename type="first">M</forename><surname>Atre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Chaoji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Zaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Hendler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 19th Int. Conf. on World Wide Web</title>
				<meeting>of the 19th Int. Conf. on World Wide Web<address><addrLine>NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="41" to="50" />
		</imprint>
	</monogr>
	<note>WWW &apos;10</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jentzsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cyganiak</surname></persName>
		</author>
		<ptr target="http://www4.wiwiss.fu-berlin.de/lodcloud/state/" />
		<title level="m">State of the LOD Cloud</title>
				<imprint>
			<date type="published" when="2011-03">March 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Linked Data -The Story so far</title>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Heath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Berners-Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal on Semantic Web and Information Systems</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="1" to="22" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">How the Semantic Web is Being Used: An Analysis of FOAF Documents</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Finin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joshi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS&apos;05) -Track 4 -Volume 04</title>
				<meeting>the 38th Annual Hawaii International Conference on System Sciences (HICSS&apos;05) -Track 4 -Volume 04<address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="113" to="122" />
		</imprint>
	</monogr>
	<note>HICSS &apos;05</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Data Summaries for On-demand Queries over Linked Data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Harth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Karnstedt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Polleres</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">U</forename><surname>Sattler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Umbrich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 19th Int. Conf. on World Wide Web</title>
				<meeting>of the 19th Int. Conf. on World Wide Web<address><addrLine>NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="411" to="420" />
		</imprint>
	</monogr>
	<note>WWW &apos;10</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">RDF Primer</title>
		<author>
			<persName><forename type="first">F</forename><surname>Manola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Miller</surname></persName>
		</author>
		<ptr target="http://www.w3.org/TR/rdf-primer/" />
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">RDF-3X: A RISC-style Engine for RDF</title>
		<author>
			<persName><forename type="first">T</forename><surname>Neumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Weikum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. VLDB Endow</title>
				<meeting>VLDB Endow</meeting>
		<imprint>
			<date type="published" when="2008-08">August 2008</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="647" to="659" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">SPARQL Query Language for RDF</title>
		<author>
			<persName><forename type="first">E</forename><surname>Prud'hommeaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Seaborne</surname></persName>
		</author>
		<ptr target="http://www.w3.org/TR/rdf-sparql-query/" />
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">A Graph Analysis of the Linked Data Cloud</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Rodriguez</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>CoRR</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Analyzer -A Complex System for Data Analysis</title>
		<author>
			<persName><forename type="first">J</forename><surname>Starka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Svoboda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sochna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schejbal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Mlynkova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bednarek</surname></persName>
		</author>
		<idno type="DOI">10.1093/comjnl/bxr103</idno>
	</analytic>
	<monogr>
		<title level="j">Advance Access published</title>
		<imprint>
			<date type="published" when="2011-10-13">2011. October 13, 2011</date>
		</imprint>
	</monogr>
	<note>The Computer Journal</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Linked Data Indexing Methods: A Survey</title>
		<author>
			<persName><forename type="first">M</forename><surname>Svoboda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Mlynkova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">On the Move to Meaningful Internet Systems: OTM 2011 Workshops</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="474" to="483" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Structure Index for RDF Data</title>
		<author>
			<persName><forename type="first">T</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ladwig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Semantic Data Management (SemData@VLDB)</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page">2010</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Hexastore: Sextuple Indexing for Semantic Web Data Management</title>
		<author>
			<persName><forename type="first">C</forename><surname>Weiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Karras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bernstein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. VLDB Endow</title>
				<meeting>VLDB Endow</meeting>
		<imprint>
			<date type="published" when="2008-08">August 2008</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1008" to="1019" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
