Extending VoID for Expressing the Connectivity Metrics of a Semantic Warehouse Michalis Mountantonakis1,2, Carlo Allocca1 , Pavlos Fafalios1,2 , Nikos Minadakis1 , Yannis Marketakis1, Christina Lantzaki1,2 , Yannis Tzitzikas1,2 1 Institute of Computer Science, FORTH-ICS, Greece 2 Computer Science Department, University of Crete, Greece {mountant,carlo,fafalios,minadakn,marketak,kristi,tzitzik}@ics.forth.gr Abstract. VoID (Vocabulary of Interlinked Datasets) has been pro- posed by W3C as the vocabulary for expressing metadata about RDF datasets. Despite its important contributions, VoID cannot express meta- data that concern the connectivity of semantic warehouses. We use the term semantic warehouse to refer to a read-only set of RDF triples fetched (and transformed) from different sources that aims at serving a particular set of query requirements. Connectivity metrics are impor- tant for evaluating the value of a semantic warehouse, since they reflect its query capabilities. Moreover they also quantify the contribution of each constituent source to the warehouse. To allow the representation, exchange, and querying of such measurements, in this paper we describe an extension of VoID that allows representing them. We demonstrate its applicability through the case of a real and operational semantic ware- house for the marine domain. 1 Introduction An increasing number of datasets is already available as Linked Data. For ex- ploiting this wealth of data, and building domain specific applications, in many cases there is a need for fetching and assembling pieces of information coming from more than one sources (including SPARQL endpoints). Then, these pieces can be used for constructing a warehouse, for offering more complete browsing and query services (in comparison to those offered by the underlying sources). We shall use the term Semantic Warehouse (for short warehouse) to refer to a read-only set of RDF triples fetched (and transformed) from different sources that aims at serving a particular set of query requirements. There are vari- ous such warehouses (e.g. [13, 9, 6]), and there are various tools that can aid the construction of such warehouses, including ODCleanStore [8], Sieve [10] and MatWare [16]. However, putting triples together does not guarantee that they will be connected. In general, the aspect of “connectivity” concerns both schema and instances. One method to check and quantify the connectivity of a semantic warehouse is through the connectivity metrics proposed in [15]. These metrics provide an overview of the warehouse that reflects its query capabilities, and also quantify the contribution of each constituent source to the warehouse. In brief, the main metrics are: (a) the matrix of percentages of the common URIs and/or literals, (b) the complementarity factor of the entities of interest, (c) the increments in the average degree of each source, and (d) the unique triple contri- bution of each source. The values of (a),(b),(c) allow evaluating the warehouse, while (c) and (d) mainly concern each particular source. W3C proposed VoID (Vocabulary of Interlinked Datasets), a vocabulary for describing open and linked datasets [7]. It aims at building a bridge between the publishers and the users of a dataset and applications, ranging from data discovery to cataloging and archiving of datasets. Based on Dublin Core [11] for describing generic info, it is an RDF Schema vocabulary for expressing different types of metadata such as general metadata (e.g. dc:title), access metadata (e.g. void:sparqlPoint), structural metadata (e.g. void:exampleResource) and descrip- tion of links between RDF datasets (e.g. void:Linkset). The specification also provides deployment advice and discusses how well-known URIs can be used to locate a VoID file (which is a machine-readable description of an RDF dataset) for its discovery [7]. However, VoID cannot model the aforementioned connectiv- ity metrics. For this reason, in this paper we describe an extension of VoID that allows the representation of such measurements, and thus enables their exchange and querying. We demonstrate its applicability through the case of a real and operational semantic warehouse for the marine domain. In a nutshell, the key contributions of our work are: (a) we motivate (through a concrete scenario) why VoID should be extended, (b) we propose an extension of VoID that models all metrics proposed in [15], (c) we describe its applicability through the use of a real and operational Semantic Warehouse of the marine domain. The rest of this paper is organised as follows: Section 2 describes the required background, i.e. the VoID vocabulary and the connectivity metrics. Section 3 describes the proposed extension of VoID and describes its applicability through the case of a real and operational Semantic Warehouse for the marine domain. Finally, Section 4 concludes the paper. 2 Background Here we describe in brief VoiD (in §2.1), we synopsize the warehouse connectivity metrics (in §2.2), and discuss related works (in §2.3). 2.1 VoID This section describes briefly the current version of VoID (Vocabulary of In- terlinked Datasets) [7]. Informally, its design has been driven by representing a number of both domain-dependent features (e.g. which type of data it contains) and domain-independent ones (e.g. who published it). Conceptually, it has been built around the notions of void:Dataset, void:Linkset and RDF Links. A void:Dataset is a set of RDF triples that are published, main- tained or aggregated by a single provider. A void:Linkset is a collection of RDF 2 Links between two datasets. An RDF Link is an RDF triple whose subject and object are described in different void:Dataset. Based on Dublin Core [11], VoID provides properties that can be attached to both void:Dataset and void:Linkset to express metadata of the type of: General metadata helping users to decide whether the dataset is appropriate for their purpose. Based on Dublin Core model, they refer to information such as dcterms:title, dcterms:description, dcterms:license, dcterms:subject, dc- terms:creator, dcterms:publisher, dcterms:contributor, dcterms:created, dcterms:issued, dcterms:modified, void:feature. Access metadata used to describe methods of accessing the RDF data using various protocols. They are: void:sparqlEndpoint, void:dataDump, void:rootResourse, void:uriLookupEndpoint, void:openSearchDescription. Structural metadata providing high-level information about the schema and internal structure of a dataset and can be helpful when exploring and query- ing the dataset. They are: void:exampleResource, void:uriSpace, void:uriRegexPattern, void:vocabulary, void:subset, void:classPartition, void:propertyPartition, void:triples, void:entities, void:classes, void:properties, void:distinctSubjects, void:distinctObjects, void:documents. Description of links between datasets helpful for understanding how mul- tiple datasets are related and can be used together. They are: void:Linkset, void:target, void:linkPredicate and all patterns for describing datasets can equally be used for void:Linkset. 2.2 Connectivity Metrics One method to check and quantify the connectivity of a semantic warehouse is through the connectivity metrics proposed in [15]. These metrics provide an overview of the warehouse that reflect its query capabilities, and also quantify the contribution of each constituent source to the warehouse. In brief, the main metrics are: (a) the matrix of percentages of the common URIs and/or literals (it shows the percentage of common URIs/literals between every pair of sources), (b) the complementarity factor of the entities of interest (it is the number of sources that provided unique triples for each entity of interest), (c) the table with the increments in the average degree of each source (it measures the increment of the graph-theoretic degree of each entity when it becomes part of the warehouse graph), and (d) the unique triple contribution of each source (the number of triples provided by a source which are not provided by any other source). The values of (a),(b),(c) allow valuating the warehouse, while (c) and (d) mainly concern each particular source. For reasons of self-containedness, here we summarize the definition of the metrics. Table 1 introduces the symbols that are required for defining the metrics. However, for computing the metrics several policies can be followed for deciding whether two URIs or two literals should be considered equivalent. The proposed policies for equivalence are shown in Table 2. Finally, Table 3 shows how each metric is defined. 3 Symbol Meaning S = S1 , . . . Sk the set of underlying sources. triples(Si ) the set of triples that each source contributes to the warehouse. Ui the URIs that appear in triples(Si ) Liti the literals that appear in triples(Si ) W the triples in the warehouse E the entities of interest, in the form of a set of literals and/or URIs T a set of triples degT (e) = |{(s, p, o) ∈ T | s = e or o = e}|, i.e. the degree of an entity e in T degT (E) = avge∈E (degT (e)), i.e. the average degree of the entities E in T ui a URI last(ui ) the string obtained by getting the substring after the last “/” or “#” of ui , turning the letters of the picked substring to lowercase and deleting the underscore letters that might exist. sameAs the sameAs relationship between two URIs according to the entity matching rules that are (or will be eventually) used for the warehouse. Table 1. Definitions Policy Name Policy Description (i) Exact String Equality u1 = u2 ⇒ u1 ≡ u2 (ii) Suffix Canonicalization last(u1 ) = last(u2 ) ⇒ u1 ≡ u2 (iii) Entity Matching u1 sameAs u2 ⇒ u1 ≡ u2 Table 2. Policies used to compare URIs coming from different sources Metric Name Metric Definition Common URIs between two sources Si and |Ui ∩ Uj | Sj |Ui ∩Uj | Percentage of Common URIs between Si and curii,j = min(|U i |,|Uj |) Sj Common Literals between Si and Sj |Liti ∩ Litj | |Liti ∩Litj | Percentage of Common Literals between Si cliti,j = min(|Lit i |,|Litj |) and Sj degW (E)−degS (E) Increase in the average degree degS (E) Unique Triples of Si , triplesU nique(Si ) = triples(Si ) \ (∪1≤j≤k,j6=i triples(Sj )) Percentage of Unique Triples of a Source Si = |triplesU nique(Si )| |triples(Si )| Complementarity factor of an entity e, cf (e)= |{ i | triplesW (e) ∩ triplesU nique(Si ) 6= ∅}| Table 3. Connectivity Metrics 4 The metrics are currently used by the tool MatWare [16], and Figure 1 shows the HTML that it is produced by this tool over a warehouse that integrates information from WoRMS1 , Ecoscope2 , FishBase3 , FLOD4 and DBpedia5 . Fig. 1. Values of Metrics as computed by MatWare Another metric, that was not proposed in [15], but it is useful to have, is the number of blank nodes that appear in the triples of a source Si , i.e. in triples(Si ). If these triples contain bnodes, then a blank node matching algorithm, like those proposed in [14], should be used, for improving the connectivity between the sources. 2.3 Related Work In this section we review the main related work on modeling characteristics of semantic data sources. 1 http://www.marinespecies.org/ 2 http://www.ecoscopebc.ird.fr/EcoscopeKB/ShowWelcomePage.action 3 http://www.fishbase.org/ 4 http://www.fao.org/figis/flod/ 5 http://dbpedia.org/ 5 Completeness. In [4] the authors introduce a formal framework for the declarative specification of completeness statements about RDF data sources and underline how the framework can complement existing initiatives like VoID. They also show how to assess completeness of query answering over plain and RDF/S data sources augmented with completeness statements, and they present an extension of the completeness framework for federated data sources. Provenance. W3C has proposed the PROV Family of Documents6 which defines a model, corresponding serializations and other supporting definitions to enable the interoperable interchange of provenance information. In addition, the work in [5] presents approaches to integrate provenance information into the Web of Data and illustrates how this information can be consumed. In particular, the au- thors introduce a Provenance Vocabulary which, by using it together with VoID, assists providers of Linked Data to describe the provenance of their data using RDF. The authors also discuss possibilities to make such provenance metadata accessible as part of the Web of Data and they describe how this metadata can be queried and consumed to identify outdated information. Connectivity among Concepts / Fuzzy LinkSets. In [2] the authors propose exten- sions of VoID for (i) distinguishing datasets, and (ii) describing fuzzy linksets, i.e. links between different datasets that are not explicitly stated. As regards (i), the authors define the notion of “semantic datasets”, i.e. partitions of re- sources that share certain semantic features. Specifically, they propose the use of two new classes (voidgen:ConnectedDataset and voidgen:ConceptualDataset) for identifying connected sets of resources or sets of conceptually similar re- sources. Thereby, given two such semantic datasets and respective linksets, one can, for instance, observe the connectivity among concepts. As regards (ii), they introduce the notion of k-similarity where two subjects are k-similar, if k of their predicate/object combinations are exact matches. For specifying a fuzzy linkset, the authors propose a new class voidgen:FuzzyLinkset and a new attribute void- gen:kSimilarity. Statistics. The RDF Data Cube Vocabulary7 [3] provides a means to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related datasets and concepts. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange)8 , an ISO standard for exchanging and sharing statistical data and metadata among organizations. As regards our case, we should stress that what we call semantic warehouse is not necessary a multi-dimensional dataset. Therefore, the Data Cube vocabulary cannot replace the need for VoID and the extension that we propose. Of course, a semantic warehouse could contain one or more multi-dimensional datasets (as for exam- ple [12]) and such datasets could be described using the Data Cube vocabulary. For instance, and for the MarineTLO-based warehouse [16], its part that con- tains occurrences of species, could be expressed using the Data Cube vocabulary. 6 http://www.w3.org/TR/2013/NOTE-prov-overview-20130430/ 7 http://www.w3.org/TR/2013/PR-vocab-data-cube-20131217/ 8 http://sdmx.org/ 6 However, for the connectivity metrics per se, the adoption of a multidimensional modeling would not offer any benefit. In [1] the authors describe LODStats, a statement-stream-based approach for gathering comprehensive statistics about RDF datasets. To represent the statistics, they use VoID and the Data Cube Vocabulary. In addition, they link a void:Dataset (a VoID class) to a qb:Observation (a Data Cube class) using a newly defined object property (void-ext:observation), which is a simple extension to VoID. Difference of our approach. The main difference of our approach is that we focus on modeling metrics regarding the connectivity and the quality of a Semantic Warehouse, thus it can complement existing initiatives like VoID and it can be used together with approaches that focus on provenance, completeness, statistics, etc. These metrics reflect the query capabilities of a warehouse (so they are important for evaluating its value) and also quantify the contribution of the underlying sources, allowing evaluating the importance of each source for the warehouse at hand. 3 The Proposed Extension of VoID At first we discuss the requirements (in §3.1), then we describe the proposed conceptual model (in §3.2), then we provide an example of using that model (in §3.3). Subsequently, we show how we can compute these metrics (in §3.4) and how we can store and use their values (in §3.5). Figure 2 gives a total view of the functionality of the metrics and the proposed extension. Fig. 2. The whole metrics process 3.1 Requirements In brief, we could say that the extension should allow all information expressed in the tables of Figure 1 to be expressed in a machine processable (and query- able) manner. If such information is exposed in a machine-readable format, it could be exploited in various methods, e.g.: 7 – For producing visualizations that give an overview of the contents of a ware- house. – For comparing different warehouses and producing comparative reports. – For aiding the automatic discovery of related data since software services/agents based on these metrics could decide which SPARQL endpoints to query based on time/cost constraints. – For crediting good sources since these metrics make evident, and quantifiable, the contribution of a source to the warehouse. Another requirement is that the proposed extension should be compatible with the existing VoID vocabulary and the available VoID-based descriptions. 3.2 Conceptual Model Figure 3 shows the core conceptual model as an implementation in RDF/S. As one can easily see, the implementation reuses classes and properties from VoID, Dublin Core, RDF/S and XML Schema Definition, while the new modeling elements are defined in a separate namespace, generally named with the prefix vdw and here depicted as the default namespace. Fig. 3. Core Conceptual Model We can see that there is the notion of Measurement which is actually a specialization of Event and therefore inherits the property date. A measurement is carried out by an agent using a specific metric according to one (or more) configurations over one (or more) datasets (atomic or composite) and produces a value (i.e. literal). Each metric is an individual with a URI and is defined by a resource (e.g. the DOI of the scientific paper that defined that metric). The notion of Configuration 8 concerns issues that explain how the measurement was done. At this point, and for the requirements at hand, we need two subclasses: the first concerns the way URI equivalence is defined (e.g. through the policies given in Table 2), while the second concerns how the entities of interest are defined. Regarding the latter the current modeling allows someone to specify the desired set of entities by providing a SPARQL query that returns them. The extension is currently published in http://www.ics.forth.gr/isl/ VoIDWarehouse, and apart from the vocabulary it contains URIs for the connec- tivity metrics. 3.3 Using VoID and the Proposed Extension to Describe the MarineTLO-based Warehouse Here we show how with VoID and the proposed extension we can describe the MarineTLO-based warehouse [16] and the corresponding connectivity met- rics (this corresponds to the process “2. VoID extension” of Figure 2). At first we discuss what we can describe using only VoID. The description is presented in a modular way, i.e. Figure 4 shows the used namespaces, and the general and administrative metadata of the warehouse, while Figure 5 shows the description of the constituent datasets and of their schema mappings with the warehouse. @prefix rdf: . @prefix rdfs: . @prefix owl: . @prefix void: . @prefix dcterms: . @prefix dst: . @prefix vdw: . # General Description of the MarineTLO-based Warehouse dst:MarineTLOWarehouse rdf:type void:Dataset; dcterms:title "Marine TLO Warehouse version 4"; dcterms:publisher ; dcterms:description "Warehouse for i-Marine Project version 4"> ; dcterms:licence "Open Database License (ODC-ODbL)"; void:vocabulary ; dcterms:subject ; dcterms:subject ; dcterms:subject ; dcterms:issued "2014-02-02T02:10:30"^^xsd:dateTime; void:sparqlEndpoint ; void:feature ; void:triples "3,500,000" Fig. 4. MarineTLO-based warehouse VoID triples (part 1) Now, Figure 6 illustrates how the value of the connectivity metric called common URIs, as computed over FLOD and Ecoscope, is represented using the proposed extension. We can see that MatWare carried out this measurement and computed 1,738 common URIs between these two sources, according to the suffixCanonicalization-based URI equivalence. Also notice that exact definitions of suffix Canonicalization and commonURIs, are given in the paper [15], whose URL is connected with the measurement at hand. The description of the mea- surement in triples is shown in Figure 7. 9 # Description of some of the components of the MarineTLO-based Warehouse dst:MarineTLOWarehouse rdf:type void:Dataset; void:subset dst:MarineTLO; void:subset dst:FLODPart; void:subset dst:EcoscopePart; void:subset dst:Mappings # MarineTLO dst:MarineTLO rdf:type void:Dataset; dcterms:title "MarineTLO ontology"; dcterms:publisher ; dcterms:description "MarineTLO is a top-level ontology for the marine domain"; dcterms:provenance ; void:triples "4,000" # FLODPart dst:FLODPart rdf:type void:Dataset; dcterms:title "Part of FLOD source"; dcterms:publisher ; dcterms:description "Our Part from FAO data for marine domains"; dcterms:provenance ; void:triples "750,000" # EcoscopePart dst:EcoscopePart rdf:type void:Dataset; dcterms:title "Part of Ecoscope source"; dcterms:publisher ; dcterms:description "Part of Ecoscope Data for Marine Domains"; dcterms:provenance void:triples "150,000" # Mappings dst:Mappings rdf:type void:Dataset; void:subset dst:FLODPart2MarineTLO; void:subset dst:EcoscopePart2MarineTLO; void:subset dst:EcoscopePart2FLODPart # Instance Matching EcoscopePart - FLODPart dst:EcoscopePart2FLODPart rdf:type void:Linkset; void:target dst:EcoscopePart; void:target dst:FLODPart; void:linkPredicate owl:sameAs # Schema Mappings EcoscopePart - MarineTLO (SubClassOf) dst:EcoscopePart2MarineTLO rdf:type void:Linkset; void:target dst:EcoscopePart; void:target dst:MarineTLO; void:linkPredicate rdfs:subClassOf # Schema Mappings FLODPart - MarineTLO (SubClassOf) dst:FLODPart2MarineTLO rdf:type void:Linkset; void:target dst:FLODPart; void:target dst:MarineTLO; void:linkPredicate rdfs:subClassOf Fig. 5. MarineTLO-based warehouse VoID triples (part 2) Fig. 6. Example of using the extension for expressing a measurement 10 dst:EcoscopePart rdf:type void:Dataset . dst:FLODPart rdf:type void:Dataset . vdw:measurement_1 rdf:type vdw:Measurement; vdw:carriedOutBy dst:MatWare; dcterms:date "2014-03-14T03:18:56"^^xsd:dateTime; vdw:over dst:EcoscopePart; vdw:over dst:FLODPart; vdw:accordingTo vdw:suffixCanonicalization; vdw:producesValue "1,738"; vdw:usesMetric vdw:commonUris . vdw:commonUris rdfs:isDefinedBy . vdw:suffixCanonicalization rdfs:isDefinedBy . dst:MatWare rdf:type dcterms:Agent . Fig. 7. Example (in triples) of using the extension for expressing a measurement 3.4 Computing the Connectivity Metrics using SPARQL queries For making clear the entire life cycle, here we show how the values of the con- nectivity metrics can be computed using SPARQL queries (this corresponds to the process “1. SPARQL compute” of Figure 2). Common URIs. The metric Common URIs over two sources Si and Sj , can be computed with the following query: SELECT COUNT (DISTINCT ?o) WHERE { graph :Si {{?s1 ?p1a ?o} UNION {?o ?p1b ?o1}} . FILTER(isURI(?o)) graph :Sj {{?s2 ?p2a ?o} UNION {?o ?p2b ?o2}} } In the context of the warehouse, this metric should be computed over all pairs of sources, i.e. all (Si , Sj ) such that Si , Sj ∈ S and i 6= j. Note that this met- ric is symmetric, i.e. the value of the pair (Si , Sj ) is equal to the value of (Sj , Si ). Common Literals. The Common Literals between two sources Si and Sj can be computed in a similar manner, i.e.: SELECT COUNT DISTINCT ?o WHERE { graph :Si { ?s ?p ?o} . FILTER(isLiteral(?o)) graph :Sj { ?a ?b ?o} } Again, this metric should also be computed over all pairs (Si , Sj ) of the ware- house. Unique Triples Contribution. To compute the unique triple contribution of a source, say S1 , to the warehouse S = S1 , . . . , Sk , we have to count the number of triples of S1 that do not intersect with the triples of any of the other sources of S (i.e. with none of the sources in S2 ... Sn ). This can be done using the following query: SELECT COUNT(*) WHERE { graph :S1 { ?s ?p1 ?o} . FILTER NOT EXISTS { graph :S2 { ?s ?p2 ?o} } . ... ... FILTER NOT EXISTS { graph :Sn { ?s ?pn ?o} } } 11 Complementarity Factor. This metric is computed for a specific entity over all sources of the warehouse. In particular, the complementarity factor of an entity e is increased by 1 for each source Si ∈ S that contains at least one unique triple having the entity e. This means that if all sources in S contain unique triples for e, then its complementarity factor will be n. The query below gives the complementarity factor of an entity e over S. Notice that the WHERE clause contains n graph patterns. Each graph pattern i returns 1 if Si contains unique triples for the entity e, or 0 otherwise. SELECT (?CF1+ .. + ?CFn) AS ?CF WHERE { { SELECT xsd:integer(COUNT(*)>0) as ?CF1 WHERE { { graph :S1 { ?s ?p1 ?o } } FILTER NOT EXISTS { graph :S2 { ?s ?p2 ?o} } . ... ... FILTER NOT EXISTS { graph :Sn { ?s ?pn ?o} } FILTER (regex(?s, e,’i’) || (regex(?o, e,’i’))) } } ... ... { SELECT xsd:integer(COUNT(*)>0) as ?CFn WHERE { { graph :Sn { ?s ?pn ?o } } FILTER NOT EXISTS { graph :S1 { ?s ?p1 ?o } } . ... ... FILTER NOT EXISTS { graph :Sn-1 { ?s ?pn-1 ?o } } FILTER (regex(?s, e,’i’) || (regex(?o, e,’i’))) } } } Increase in the Average Degree. Let E be a set of entities coming from a source Si . To compute the increase in the average degree of these entities when they “enter” into the warehouse, the following query computes both average values (before and after the entrance to the warehouse) and reports back the increase. Note that that above query considers the “entity matching” policy of Table 2. SELECT ((?avgDW-?avgDS)/?avgDS) as ?IavgD WHERE { { SELECT xsd:double((count(?in)+count(?out))) /xsd:double(count (distinct ?e)) as ?avgDS FROM :Si WHERE{ ?e rdf:type :E. {?e ?in ?o} UNION {?o1 ?out ?e} } } { SELECT xsd:double((count(?in)+count(?out))) /xsd:double(count (distinct ?e)) as ?avgDW FROM :W WHERE { ?e rdf:type :E . { ?e ?in ?o} UNION {?o1 ?out ?e} } } } Time efficiency. Table 4 shows the query execution times for computing the metric Common URIs for each of the three policies of Table 2, i.e. Exact String Equality, Suffix Canonicalization and Entity Matching9 . The first row corre- sponds to the pure SPARQL approach that was presented earlier. The second row corresponds to a hybrid approach, where more simple queries are used for 9 The experiments were conducted using Openlink Virtuoso V6.1, Ubuntu 12.10 64bit, Quad-Core, 4 GB RAM 12 getting the resources of interest (i.e. the two sets of URIs, one for each source Si , Sj ), and Java code is used for computing their intersection. We observe that the hybrid approach is faster than the pure SPARQL, as the comparisons are im- plemented faster in Java. In general, we have observed that the hybrid approach loses in time efficiency when the implemented queries return a big amount of data (as in the case of Unique Triples Contribution), while it is faster (than pure SPARQL) in comparisons. Table 4. Times (in min) needed to compute metrics on various approaches and policies Common URIs Computation Method Policy 1 Policy 2 Policy 3 pure SPARQL 7 20 8 hybrid 3 4 4 The following query shows how SPARQL applies Suffix Canonicalization on URIs using some functions of Virtuoso. Regarding this policy, the pure SPARQL ap- proach becomes less efficient, as the string comparisons cost more when im- plemented over the endpoint. Regarding the third policy, both approaches are increased by 1 minute. This uniform increase is reasonable as an additional graph that contains the triples with the sameAs properties is taken into account. SELECT DISTINCT bif:lower(bif:regexp_substr(’[^#|/]+\$’,?o,0)) as ?o FROM :Si {{ ?s ?p ?o } UNION { ?o ?p ?s } FILTER(isURI(?o)) } 3.5 Storing and Querying the Values of the Connectivity Metrics The measurements computed by the aforementioned queries can be represented and exchanged using the VoID extension. They can also be stored in a graph space in the triplestore; indeed MatWare can compute and store these triples in a separate graph space in the same SPARQL endpoint (this corresponds to the process “3. store” of Figure 2). Figure 8 gives an example of this procedure for the metric Common Literals over the sources Ecoscope and FishBase. The query both computes the metric and inserts the computed values (expressed using the VoID extension) to a graphspace. After that, one could retrieve these values from the warehouse using SPARQL queries (this corresponds to the process “4. query” of Figure 2). For example, the query in Figure 9 returns all triples of the warehouse that concern the common URIs metric. 4 Concluding Remarks W3C has proposed VoID as the vocabulary for describing interlinked and open linked datasets. Motivated by a concrete scenario of modeling connectivity met- rics in the context of an operational semantic warehouse of the marine domain, we have proposed an extension of VoID which is able to represent these metrics. 13 prefix dcterms: prefix dst: prefix vdw: INSERT INTO dst:Metrics { vdw:measurement_2 rdf:type vdw:Measurement ; vdw:usesMetric vdw:commonLiterals; vdw:producesValue ?commonLiterals; dcterms:date "2014-03-14T03:19:45"^^xsd:dateTime; vdw:carriedOutBy dst:Matware; vdw:over dst:EcoscopePart; vdw:over dst:FishbasePart . } WHERE{{ SELECT (count(distinct ?o) as ?commonLiterals ) WHERE { graph dst:EcoscopePart { ?s ?p ?o } . FILTER(isLiteral(?o)) graph dst:FishbasePart { ?a ?b ?o } } } } Fig. 8. A SPARQL query which both computes the common URIs between all sources of the warehouse and inserts the computed values to a graphspace. PREFIX vdw: SELECT DISTINCT ?si ?sj ?commonUris WHERE { {?measurement rdf:type vdw:Measurement . ?measurement vdw:usesMetric vdw:commonUris . ?measurement vdw:over ?si , ?sj . ?measurement vdw:producesValue ?commonUris} . FILTER (?si!=?sj)} ORDER BY (?si) Fig. 9. A SPARQL query which returns the common URIs between all sources of the warehouse The benefit of the proposed extension is that it allows someone to publish the metrics and their associated values in a standard and machine processable way. Finally, we have shown how the metrics can be computed and we have reported the times required for computing these metrics either using solely SPARQL, or SPARQL and programming language code. Acknowledgement This work was partially supported by the ongoing project iMarine (FP7 Research Infrastructures, 2011-2014). References 1. S. Auer, J. Demter, M. Martin, and J. Lehmann. LODStats - an Extensible Frame- work for High-Performance Dataset Analytics. In Knowledge Engineering and Knowledge Management, pages 353–362. Springer, 2012. 2. C. Böhm, J. Lorey, and F. Naumann. Creating VoID Descriptions for Web-Scale Data. Web Semantics: Science, Services and Agents on the World Wide Web, 9(3):339–345, 2011. 3. R. Cyganiak, S. Field, A. Gregory, W. Halb, and J. Tennison. Semantic Statistics: Bringing Together SDMX and SCOVO. LDOW, 628, 2010. 4. F. Darari, W. Fariz, W. Nutt, G. Pirro, and S.Razniewski. Completeness State- ments about RDF Data Sources and their Use for Query Answering. In The Semantic Web–ISWC 2013, pages 66–83. Springer, 2013. 14 5. O. Hartig and J. Zhao. Publishing and Consuming Provenance Metadata on the Web of Linked Data. In Provenance and Annotation of Data and Processes, pages 78–90. Springer, 2010. 6. Y. Hu, K. Janowicz, G. McKenzie, K. Sengupta, and P. Hitzler. A Linked-Data- driven and Semantically-enabled Journal Portal for Scientometrics. In The Seman- tic Web–ISWC 2013, pages 114–129. Springer, 2013. 7. M. H. Keith Alexander, Richard Cyganiak and J. Zhao. Describing linked datasets with the void vocabulary, w3c interest group note, 2011. 8. T. Knap, J. Michelfeit, J. Daniel, P. Jerman, D. Rychnovskỳ, T. Soukup, and M. Nečaskỳ. ODCleanStore: a Framework for Managing and Providing Integrated Linked Data on the Web. In Web Information Systems Engineering-WISE 2012, pages 815–816. Springer, 2012. 9. K. Makris, G. Skevakis, V. Kalokyri, P. Arapi, S. Christodoulakis, J. Stoitsis, N. Manolis, and S. L. Rojas. Federating Natural History Museums in Natural Europe. In Metadata and Semantics Research, pages 361–372. Springer, 2013. 10. P. N. Mendes, H. Mühleisen, and C. Bizer. Sieve: Linked Data Quality Assessment and Fusion. In Proceedings of the 2012 Joint EDBT/ICDT Workshops, pages 116–123. ACM, 2012. 11. A. Powell, M. Nilsson, A. Naeve, and P. Johnston. Dublin core metadata initiative - abstract model, 2005. White Paper. 12. M. Sabou, I. Arsal, and A. M. Braşoveanu. Tourmislod: A tourism linked data set. Semantic Web, 4(3):271–276, 2013. 13. Y. Tzitzikas, C. Alloca, C. Bekiari, Y. Marketakis, P. Fafalios, M. Doerr, N. Mi- nadakis, T. Patkos, and L. Candela. Integrating Heterogeneous and Distributed Information about Marine Species through a Top Level Ontology. In Proceedings of the 7th Metadata and Semantic Research Conference (MTSR’13), Thessaloniki, Greece, November 2013. 14. Y. Tzitzikas, C. Lantzaki, and D. Zeginis. Blank Node Matching and RDF/S Comparison Functions. In International Semantic Web Conference (1), pages 591– 607. Springer, 2012. 15. Y. Tzitzikas, N. Minadakis, Y. Marketakis, P. Fafalios, C. Alloca, and M. Mountan- tonakis. Quantifying the Connectivity of a Semantic Warehouse. In Proceedings of the 4th International Workshop on Linked Web Data Management (LWDM 2014) in conjunction with the 17th International Conference on Extending Database Tech- nology (EDBT 2014), 2014. 16. Y. Tzitzikas, N. Minadakis, Y. Marketakis, P. Fafalios, C. Allocca, M. Mountan- tonakis, and I. Zidianaki. MatWare: Constructing and Exploiting Domain Specific Warehouses by Aggregating Semantic Data. In 11th Extended Semantic Web Con- ference (ESWC’14), Anissaras, Crete, Greece, May 2014. 15