The Sevod Vocabulary for Dataset Descriptions for Federated Querying Stasinos Konstantopoulos, Angelos Charalambidis, Antonis Troumpoukis, Giannis Mouchakis, and Vangelis Karkaletsis Institute of Informatics and Telecommunications, NCSR ‘Demokritos’, Athens, Greece {konstant,acharal,antru,gmouchakis,vangelis}@iit.demokritos.gr Abstract. Dataset description vocabularies focus on provenance, ver- sioning, licensing, and similar metadata. VoID is a notable exception, providing some expressivity for describing subsets and their contents and can, to some extent, be used for discovering relevant resources and for optimizing querying. In this paper we describe the Sevod vocabulary, an extension of VoID that provides the expressivity needed in order to sup- port the query planning methods typically used in federated querying. We also present a tool for automatically scraping such metadata from RDF dumps and give statistics about the size of the descriptions for the FedBench datasets. Keywords: RDF store histograms, RDF vocabulary, optimizing federated SPARQL query processing 1 Introduction Machine-readable descriptions of Web data, such as general metadata, quality features, statistical information about the entities and how they link, and licens- ing/provenance metadata, are becoming an increasingly important element of the architecture of the Web. Without such descriptions, client applications can- not be informed of the nature, scope and characteristics of data from particular sources, limiting applications to consuming well-known reference datasets. Existing vocabularies operate both at the level of cataloguing datasets, as well as at the more detailed level of statistics about the resources contained and described in each dataset [6, Section 5.6]. The Data Catalogue Vocabulary (DCAT) [17] is the W3C Recommendation for describing datasets in data cata- logs. DCAT is used to publish attribution, licensing, and thematic information about complete datasets. The Asset Description Metadata Schema (ADMS) [3] is the DCAT extension that targets semantic interoperability by linking datasets to semantic assets (schemas, taxonomies, codelists, etc.) DCAT and ADMS af- ford discoverability and the expression of provenance and licensing, while linking to the semantic assets used by each dataset works towards enabling client appli- cations to assess if data from multiple datasets is interoperable. The Vocabulary of Interlinked Datasets (VoID) [2] is the W3C recommen- dation for publishing details about the internal structure of datasets. Besides the schemas used in a dataset, VoID also allows declaring the namespace of the resources described in it. This further enhances discoverability, as a client application can reason about which datasets might contain data that describes a given resource. Finally, VoID also supports query optimization use cases, by foreseeing terms for providing multifaceted instance-level statistics, such as the number of triples with a specific predicate, the number of triples with subjects that are members of a specific class, the number of distinct subjects, predicates, and objects, and similar. Although dataset statistics in VoID go some of the way towards supporting client applications to efficiently plan and execute queries, they do not cover all information needs of modern federated querying systems. Federated querying is a key technology for sustaining the decentralized nature of the Web and has been formally added to the architecture of the Semantic Web in the new edition of the SPARQL specification. In this paper, we present Sevod, an extension of VoID that specifically addresses the aspects of dataset description that are relevant to efficient and transparent federated SPARQL query processing. More specifically, we first review federated SPARQL query processing, focusing on the data descriptions needed by the different federation engines (Section 2). Based on this review, we extract requirements for a dataset description schema that can support the state of the art of federated SPARQL query processing and discuss how well these requirements are met by VoID (Section 3). We then proceed to present Sevod and to explain how the proposed extensions cover the requirements that are not met by VoID (Section 4), present a method for generating Sevod description (Section 5), and conclude (Section 6). 2 Federated SPARQL Query Processing Given a non-trivial query, there are multiple alternative query execution plans that will all produce the same response, or equally valid responses. Although any of these can be used, they might vary radically in their cost, that is, in the resources and the amount of time that they need to execute. Query optimization, i.e., selecting the most cost-efficient plan, has been studied in a variety of contexts and from many different angles [10]. These different approaches converge on relying on a cost function that assigns a numerical cost to each candidate plan. This cost is meant to reflect the resources needed to execute the plan. To calculate cost, these functions refer to statistics that estimate the volume of data that needs to be lifted from the persistent store and to be transferred. Such statistics are the cardinality of query patterns, i.e., the number of tuples that match a given pattern, and the selectivity of joins, i.e., the ratio of the car- dinality of a pattern that joins against another query pattern. In conventional databases, these statistics are stored in histograms [14], internal data structures that are part of the database index. Prominent relevant work includes statis- tics on intermediate tables (SITs), one-dimensional cardinality counts for an attribute [7]. SITs are organized in buckets; each bucket stores a range of values for a given attribute and the number of tuples in this range. Multidimensional histograms [1] avoid the propagation of errors through a sequence of operators by directly storing statistics about more complex sub-expressions that match longer intermediate sub-expressions of the query, rather than individual attributes. Nat- urally, not all possible combinations can be stored. To limit space requirements, buckets are merged when their statistics are comparable and merging does not degrade the estimations, and sub-buckets are created whenever there is a narrow value range with diverging statistics. As histograms are internally used by the database’s own query processor, they never received an explicit representation or serialization. This includes dis- tributed databases, where the central node controls the way data is distributed and also maintains such histograms; or where distributed histograms are commu- nicated in implementation-specific serializations. The open nature of the Web, however, creates new use cases where federated query processors are less tightly integrated. In addition to the optimizations performed internally to each data source, federated systems also need to identify data sources that contain rele- vant data and to optimize the execution of the various sub-queries among them [20, 21]. Some federated systems do not rely on any prior knowledge about the data sources they federate, and base their planning on universal, hard-wired assumptions [25, FedX] or information they discover during query execution [5, Avalance]. But most systems base query planning on histograms, adapting methods originally proposed in the databases literature and assigning explicit semantics to these originally internal data structures. Federated querying sys- tems such as DARQ [19], SPLENDID [12], LHD [27], and Semagrow [9] consume detailed, instance-level VoID descriptions for data source discovery and for query plan cost estimation. Several systems, however, have out-grown VoID and use methods that require more sophisticated data source descriptions. Recent versions of Semagrow use an inclusion hierarchy of multi-dimensional histogram buckets, each providing statistics about triple patterns or sets of joined triple patterns [28]. Although VoID subsets can represent the inclusion hierarchy of buckets, VoID focuses on star-shaped descriptions of resources and more complex joins of triple patterns cannot be represented. The Odyssey system [18] uses characteristic sets to rep- resent statistics about how resources are linked, which also fall outside VoID’s ability to express information about the objects of triples. Finally, the QUET- SAL system integrates a line of research on sophisticated data source selection [23, 22, 8]. QUETSAL advances beyond source selection based on isolated triple pattern matching to also consider the joins between triple patterns. All of these recent developments necessitate the extension of VoID with the expressivity needed to represent not only statistics about how resources are described in star-shaped joins, but also how they link and combine in arbitrary joins. This will allow making explicit information structures that are currently internally computed and stored by these federation systems, so that they can be shared and, hopefully, exposed directly by future triple store implementations. 3 Vocabulary Requirements As discussed in the previous section, database histograms comprise buckets. A bucket holds a value range for an attribute and the cardinality of this attribute range, that is, the number of tuples where the given attribute has a value within the given range. Following what is common practice in representing relational data in RDF, an attribute value in a relational table’s tuple would be a triple where the subject is the key of the tuple, the predicate is the attribute name, and the object is the attribute value. We will generalize buckets here, to allow not only the object (attribute value) but any of the elements of a triple to be specified by a range and not fixed. We will also call buckets subsets, to match more familiar terminology: Requirement 1 The elementary unit of information in a Web dataset descrip- tion is a subset that holds: – the subject URI or a specification of a range of subjects – the predicate URI or a specification of a range of predicates – the object URI or value, or a specification of a range of objects – the cardinality: the number of triples that match the above. This requirement makes subsets useful for retrieving the cardinality of isolated SPARQL triple patterns, by searching in the histogram for a subset where the binded values in the triple pattern fall within the range given in the subset. Multidimensional buckets hold the number of tuples that have values within the range given in the bucket for all the attributes covered by the bucket. Again, we transfer this to RDF in a way that reflects SPARQL queries, requiring that histograms can also capture the cardinality of triple pattern joins: Requirement 2 A subset can be specified by a set of triple specifications like those in Requirement 1, with added constraints on triple elements that should be identical. Buckets are organized in an inclusion hierarchy, where the attribute ranges in a child bucket are subsets of the attribute ranges in its parent. Since the children of a bucket might or might not be a partitioning of the parent, depending on the histogram algorithm used, it must also be possible to specify if the children of a given bucket are a partitioning or not. Requirement 3 Subsets are organized in an inclusion hierarchy, where the triples matching a child subset are included in the triples matching the parent subset. The children of a subset might or might not be a partitioning, but if they are it must be possible to express this fact. A further consideration is that, depending on the method used to con- struct the histogram, selectivity and cardinality values might be approximated or known be within a given range, rather than exactly known. Srivastava et al. [26] and Kaushik and Suciu [15], for instance, presented methods that model cardinalities and distinct value counts based on the entropy maximization prin- ciple. void:Linkset void:Dataset void:target Description void:subset void:feature void:Technical foaf:topic void:Dataset Feature foaf:primaryTopic svd:partitions svd:joinPredicate svd:part svd:joinSubject svd:joinObject svd:Partition svd:Join Fig. 1. Sevod top classes and relationship to VoID. Requirement 4 Subset statistics do not need to be scalar values, but may also be more complex representations of constraints over or distributions of scalar values that are not precisely known. 4 The Sevod Vocabulary Sevod extends VoID to provide the expressivity needed in order to support the query planning methods typically used in federated querying. The top Sevod classes and their relationship to their VoID super-classes is shown in Figure 1. The svd:Join class connects the void:Dataset instances that bind joins of triple patterns with the selectivity of these joins. The svd:Partition class connects the void:Dataset instances that form a partitioning of another void:Dataset instance and allows to express the local closure of the data that might be present in this superset. The vocabulary can be downloaded from its namespace URL, http://www. w3.org/2015/03/sevod 4.1 Joins of Triple Patterns For the purposes of resource discovery, VoID is adequate for finding sources for grounding star patterns in the query, where the properties of a known ‘central’ resource need to be retrieved. It is less well-suited for finding sources for path and sink patterns, since VoID makes no assertions regarding the individuals in the object position of triples. In order to fully cover Requirement 1, Sevod defines the following properties: Definition 1 (Triple Pattern) The properties svd:subjectRegexPattern, svd:predicateRegexPattern, and svd:objectRegexPattern denote that all triples in a dataset have subject URIs, predicate URIs, and object URIs/lexical forms that match the given regular expressions. svd:subjectRegexPattern rdf:type rdf:Property ; rdfs:domain void:Dataset ; rdfs:range xsd:string . svd:predicateRegexPattern rdf:type rdf:Property ; rdfs:domain void:Dataset ; rdfs:range xsd:string . svd:objectRegexPattern rdf:type rdf:Property ; rdfs:domain void:Dataset ; rdfs:range xsd:string . To cover Requirement 2, Sevod introduces the svd:Join class. A svd:Join instance expresses that two void:Dataset instances can be joined Definition 2 (Join) The svd:joins property links a svd:Join instance with the void:Dataset instances that are joined. svd:Join rdf:type rdfs:Class . svd:joins rdf:type rdf:Property ; rdfs:domain svd:Join ; rdfs:range void:Dataset . The svd:joins property is refined into three sub-properties, so that a svd:Join instance can also specify on with triple element (subject, predicate, object) the two datasets join. Definition 3 (joinSubject) The svd:joinSubject property links a svd:Join instance j with a void:Dataset instance d1 iff – there is also a triple j svd:joins d2; and – for all joinable triples (S P O) of d1, the corresponding joined triples in d2 have S as one of their elements. svd:joinsSubject rdfs:subPropertyOf svd:joins . Definition 4 (joinPredicate) The svd:joinPredicate property links a svd:Join instance j with a void:Dataset instance d1 iff – there is also a triple j svd:joins d2; and – for all joinable triples (S P O) of d1, the corresponding joined triples in d2 have P as one of their elements. svd:joinsPredicate rdfs:subPropertyOf svd:joins . Definition 5 (joinObject) The svd:joinObject property links a svd:Join instance j with a void:Dataset instance d iff – there is also a triple j svd:joins d2; and – for all joinable triples (S P O) of d1, the corresponding joined triples in d2 have O as one of their elements. svd:joinsObject rdfs:subPropertyOf svd:joins . Different join patterns can be expressed using these properties, such as stars (using svd:joinSubject only), paths (using both svd:joinObject and svd:joinSubject), and sinks (using svd:joinObject only). Naturally, svd:Join instances hold statistics about the selectivity of the join. To cover Requirement 4, we define the range of the svd:selectivity property to be a class instead of simple numerical fillers. In this manner, we encapsulate statistics under a class that can be extended to cover application- specific requirements. In order to ensure compatibility, we further require that instances of this svd:SelectivityValue class must have an rdf:value property and that this property has as value an xds:integer. Other, applica- tion specific, properties may be defined as needed for extensions of this class. Definition 6 (selectivity) The svd:selectivity property links an instance of svd:Join with an instance of svd:SelectivityValue that is (or esti- mates or approximates) the selectivity of the join denoted by the svd:Join instance. Instances of the svd:SelectivityValue class denote an exact, es- timated, or approximated measurement. svd:SelectivityValue instances must have the rdf:value property with a filler that is an xds:integer that is either the value itself (if the measurement is exact and certain) or a value that is appropriate to use by applications that do not take into account uncertainty or approximation parameters. Where appropriate, uncertainty or approximation parameters are given by application-specific properties. svd:selectivity rdf:type rdf:Property ; rdfs:domain svd:Join ; rdfs:range svd:SelectivityValue . svd:SelectivityValue rdf:type rdfs:Class . Join Example To give an example, consider the following SPARQL query which retrieves from the New York Times dataset topic pages about places in Greece, using links to Geonames to identify which of the places in the New York Times dataset are in Greece: SELECT ?Page WHERE { ?X nyt:topicPage ?Page . ?X owl:sameAs ?G . ?G geonames:countryCode "GR" . } Let us assume that two SPARQL endpoints serving relevant data can be iden- tified, based on void:property annotations: the nyt:topicPage predicate :nyt a void:Dataset ; void:triples "345889"^^xsd:long . void:propertyPartition :nyt−1 :nyt−2 a void:Dataset ; a void:Dataset ; void:property nyt:topicPage ; void:property owl:sameAs ; void:triples "7432"^^xsd:long ; void:triples "31673"^^xsd:long ; void:distinctSubjects "7432"^^xsd:int ; void:distinctSubjects "11133"^^xsd:int ; void:distinctObjects "7354"^^xsd:int . void:distinctObjects "30614"^^xsd:int . svd:joinsSubject svd:selectivity _:_ a svd:Join _:_ a svd:SelectivityValue rdf:value "21236"^^xsd:long . Fig. 2. Example usage of svd:Join. only appears in the New York Times endpoint and the geonames:countryCode predicate only appears in the Geonames endpoint. For the, more ubiquitous, owl:sameAs predicate finer-grained description that gives information about about the subjects and objects is used to point at the New York Times endpoint. One piece of information needed for efficient query execution planning is the cardinality of the query fragment that will be executed at each endpoint. Figure 2 gives an example of how svd:Join instances represent statistics about joins of triple patterns. 4.2 Partitions VoID defines the void:subset property for structuring the overall dataset into homogeneous subsets. This partially covers Requirement 3 by defining the vocabulary needed for expressing inclusion hierarchies of subsets. To fully cover Requirement 3, Sevod introduces classes and properties that express that the subsets of a given dataset are exhaustive and no other subsets exist. Definition 7 (Partition) svd:Partition denotes that a set of void:Dataset instances are a partition of another void:Dataset instance. This is done by using the property svd:part to link the svd:Partition instance with the instances that make up the partition and the property svd:partitions to link it with the instance that is partitioned by them. svd:Partition rdf:type rdfs:Class . Definition 8 (partitions) svd:partitions is the functional property that links a svd:Partition instance with the void:Dataset for which it is a partition. svd:partitions rdf:type rdf:Property ; rdfs:domain svd:Partition ; rdfs:range void:Dataset . Definition 9 (part) The svd:part property links a svd:Partition in- stance with each of the void:Dataset instances that make up the partition. All fillers of this property must also be fillers of the void:subset property of the void:Dataset instance that fills the svd:Partition instance’s svd:partitions property. svd:part rdf:type rdf:Property ; rdfs:domain svd:Partition ; rdfs:range void:Dataset . Partition Example Building up on our previous example, Figure 3 gives a more detailed description of the contents of nyt2, the owl:sameAs subset of the New York Time dataset. As already hinted in the discussion of joins, details about the subjects and objects of ubiquitous predicates such as rdf:type or owl:sameAs are needed to avoid involving irrelevant datasets in a query. Fig- ure 3 shows how svd:Partition instances can be used to express two par- titionings of the same data along two facets (subject namespace and object namespace). 5 Generating Sevod Descriptions Detailed instance-level descriptions are cumbersome to acquire and maintain, ruling out manual curation. The process can, however, be automated by ob- serving the responses to queries. SWOOGLE [11], RDFStats [16], LODStats [4] and STRHist [28] generate rich enough statistics to populate the VoID and Sevod models. Useful as they might be, these estimators of what lies behind a SPARQL endpoint are developed to work around the lack of an authoritative data source description by the data source itself. Although providing a DCAT or VoID description is becoming increasingly popular, these are restricted to gen- eral metadata about the dataset and schema-level descriptions, including when VoID is used. We present here Sevod Scraper,1 a tool which extracts Sevod metadata di- rectly from the actual data, operating over an RDF dump file of the dataset. Sevod Scraper is meant to be used by the data providers to prepare and maintain dataset descriptions to be published together with the actual data. 1 Open source, available at https://github.com/semagrow/sevod-scraper :nyt−2 a void:Dataset ; void:subset void:property owl:sameAs ; void:subset void:triples "31673"^^xsd:long . _:p2−1 _:p2−5 a void:Dataset ; svd:partitions a void:Dataset ; void:property owl:sameAs ; void:property owl:sameAs ; void:triples "9939"^^xsd:long . void:triples "30423"^^xsd:long . svd:objectVocabulary svd:subjectVocabulary _:_ a svd:Partition _:p2−2 svd:part _:p2−6 a void:Dataset ; a void:Dataset ; void:property owl:sameAs ; void:property owl:sameAs ; void:triples "1789"^^xsd:long . void:triples "1250"^^xsd:long . svd:objectVocabulary svd:subjectVocabulary _:p2−3 _:p2−7 a void:Dataset ; a void:Dataset ; void:property owl:sameAs ; void:property owl:sameAs ; void:triples "5"^^xsd:long . void:triples "9990"^^xsd:long . svd:part svd:objectVocabulary svd:objectVocabulary _:_ _:p2−4 a svd:Partition _:p2−8 a void:Dataset ; a void:Dataset ; void:property owl:sameAs ; svd:part void:property owl:sameAs ; void:triples "7"^^xsd:long . void:triples "9943"^^xsd:long . svd:objectVocabulary svd:objectVocabulary Fig. 3. Example usage of svd:Partition. Sevod Scraper partitions the dataset into several (possibly overlapping) sub- sets. Provided that we need to generate metadata not only for properties, but for subjects and objects as well, we generate one subset for each property, one sub- set for some subject URI prefixes and one subset for some object URI prefixes. The decision to use prefixes to specify URI ranges is based on the observation [16] that string (including URI) range estimations can be given: – using one bucket for each distinct string or storing the range as a set of strings, resulting in large histograms; – using a hash function to reduce the number of distinct strings. However, hashing the string representation of URIs fails to take into account the se- mantic similarity between resources, so it is unlikely that a universally good function can be identified [13]; or – by reducing strings to prefixes. Table 1. FedBench datasets statistics and number of triples of the metadata obtained by the Sevod Scraper using the default parameters. Dataset numb. numb. numb. dis- numb. dis- Metadata triples properties tinct subjects tinct objects triples ChEBI 4,775,935 49 51,297 772,930 437 Dbpedia 42,852,838 1,080 9,496,685 13,518,436 6,282 DrugBank 520,252 141 20,511 408,964 1,142 GeoNames 107,953,314 48 7,480,534 35,454,090 367 Jamendo 1,052,868 49 336,745 593,727 2,777 KEGG 1,094,059 42 35,080 940,041 426 LMDB 6,151,225 244 695,220 2,053,739 2,143 NYT 338,426 57 22,486 191,288 420 Determining which URI prefixes we want to keep balances between the size and detail of the resulting metadata. To make this decision easier for the user, the user sees parameters from the perspective of size limits and the tool decides how to adhere to these by selecting which URI prefixes to include in the description. Table 1 gives a sense of the size of the descriptions generated by Sevod Scraper using default parameters. The table lists the number of triples used to describe well-known datasets from the FedBench suite [24]. Given a user-provided bound B, if a number of k URI prefixes s.t. k > B have the same prefix, we would like to replace these URI prefixes with their longest common prefix. Sevod Scraper uses for this reason two path tries, one for the subject and the other for the object URIs. In these path tries each edge corresponds to a path component, allowing also the * special component to denote, unsurprisingly, any number of characters or an empty string. If during any insert we have a situation that one node contains more than B children, some of its children will be combined to a single node using their common URI prefix. Instead of subject and object URIs, these nodes are specified using the svd:subjectRegexPattern and svd:objectRegexPattern properties. This hierarchy of trie nodes is then represented as a hierarchy of VoID subsets. For each subset, Sevod Scraper extracts the standard VoID statistics, namely the number of triples properties, distinct subjects and distinct objects of the subset. Especially for the property subsets, the authority component2 of all subjects and objects is extracted and added to the description using the svd:subjectVocabulary and svd:objectVocabulary terms. Finally, the scraper also computes join selectivity metadata. For every pair of property sub- sets, the tool computes the selectivity values of the star (i.e., subject-subject), the sink (i.e., object-object) and the path (i.e., object-subject) joins between these subsets. 2 The URI component after the scheme, if foreseen by the scheme; for the http scheme used here, the authority is the host name between http:// and the immediately following / 6 Conclusions We presented the Sevod vocabulary, an extension of VoID that specifically ad- dresses the aspects of dataset description that are relevant to efficient and trans- parent federated SPARQL query processing. The extension is designed to address federated SPARQL requirements extracted by analysing the information needs of current federated SPARQL query processors. Sevod is the first vocabulary that makes explicit and share-able the data summaries used to optimize query processing. The adoption and maintenance of Sevod can facilitate the transfer of optimization methods between the databases and the Semantic Web communities. But of more interest to the Semantic Web is the ability to publish these detailed data summaries, allowing endpoints to provide the metadata needed to be discovered as relevant to federated queries and to be included in an efficient query execution plan. In this manner, federated querying can be made as efficient as the querying of distributed databases while maintaining the dynamic and decentralized nature of the Semantic Web. In order to realize this ability, we have started by making it easy for data providers to publish Sevod descriptions. We developed the Sevod Scraper tool that automates the generation Sevod descriptions of varying detail by setting the intended description size. Future work on the Scraper will be on using past query load to make more informed decisions about where the descriptions should be more detailed and where they can be left more shallow, in order to adhere to the publishers’ maximum description size requirement with minimal loss of query plan efficiency. This will be based on our previous work on workload aware histogram construction on the client side [28], but extended to take advantage of the direct access to the data enjoyed by the publisher. The next step will be to develop tools for serializing and deserializing Sevod descriptions for the most prominent, current federated query processors. The expectation is that this will close the loop between data publishers and data consumers, and kick-start the adoption of Sevod. Acknowledgements The work described here has received funding from the European Union’s Hori- zon 2020 research and innovation programme under grant agreement No 644564. For more details, please visit https://www.big-data-europe.eu Bibliography [1] Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Building histograms without looking at data. In: Proceedings of the 1999 ACM Interna- tional Conference on Management of Data (SIGMOD ’99). pp. 181–192. ACM, New York, NY, USA (1999), http://doi.acm.org/10.1145/ 304182.304198 [2] Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the VoID vocabulary. W3C Interest Group Note, 3 March 2011 (Mar 2011), http://www.w3.org/TR/void [3] Archer, P., Shukair, G.: Asset description metadata schema (ADMS). W3C Working Group Note, 1 August 2013 (2013), http://www.w3.org/TR/ vocab-adms, this version of the WG Note is based on M. Dekkers (ed., 2012), ADMS Draft Specification, ISA Deliverable D1.1. URL https:// joinup.ec.europa.eu/asset/adms [4] Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: Proceedings of the 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW ’12). pp. 353–362. Springer-Verlag, Berlin, Heidelberg (2012), http://dx.doi.org/10.1007/978-3-642-33876-2_31 [5] Basca, C., Bernstein, A.: Querying a messy web of data with Avalanche. Journal of Web Semantics 26(1), 1–28 (2014), http://dx.doi.org/10. 1016/j.websem.2014.04.002 [6] Ben Ellefi, M., Bellahsene, Z., Breslin, J., Demidova, E., Dietze, S., Szyman- ski, J., Todorov, K.: RDF dataset profiling: A survey of features, methods, vocabularies and applications. Semantic Web Journal, accepted for publi- cation (2017). [7] Bruno, N., Chaudhuri, S.: Exploiting statistics on query expressions for optimization. In: Proceedings of the 2002 ACM International Conference on Management of Data (SIGMOD ’02). pp. 263–274. ACM, New York, NY, USA (2002), http://doi.acm.org/10.1145/564691.564722 [8] Cem Ozkan, E., Saleem, M., Dogdu, E., Ngonga Ngomo, A.C.: UPSP: unique predicate-based source selection for SPARQL endpoint federation. In: Proceedings of 3rd International Workshop on Dataset Profiling and Federated Search for Linked Data (PROFILES 2016), held on 30 May 2016 at ESWC 2016, Anissaras, Crete, Greece (2016) [9] Charalambidis, A., Troumpoukis, A., Konstantopoulos, S.: SemaGrow: Op- timizing federated SPARQL queries. In: Proceedings of the 11th Inter- national Conference on Semantic Systems (SEMANTiCS 2015), Vienna, Austria, 15-18 September 2015 (2015), http://dx.doi.org/10.1145/ 2814864.2814886 [10] Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS ’98). pp. 34–43 (1998) [11] Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: A search and metadata engine for the se- mantic web. In: Proceedings of the Thirteenth ACM International Confer- ence on Information and Knowledge Management. pp. 652–659. CIKM ’04, ACM, New York, NY, USA (2004), http://doi.acm.org/10.1145/ 1031171.1031289 [12] Görlitz, O., Staab, S.: SPLENDID: SPARQL endpoint federation exploiting VOID descriptions. In: Proceedings of the 2nd International Workshop on Consuming Linked Data (COLD 2011), Bonn, Germany, October 23, 2011. CEUR Workshop Proceedings, vol. 782 (2011) [13] Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: Proceedings of the 19th International World Wide Web Conference (WWW 2010), Raleigh, NC, USA, 26-30 April 2010 (2010) [14] Ioannidis, Y.: The history of histograms (abridged). In: Proceedings of the 29th International Conference on Very Large Databases (VLDB 2003), Berlin, Germany (2003), ten-Year Best Paper Award [15] Kaushik, R., Suciu, D.: Consistent histograms in the presence of distinct value counts. Proc. VLDB Endow. 2(1), 850–861 (Aug 2009), http:// dx.doi.org/10.14778/1687627.1687723 [16] Langegger, A., Wöss, W.: RDFStats – an extensible RDF statistics gener- ator and library. In: 23rd International Workshop on Database and Expert Systems Applications. pp. 79–83. IEEE Computer Society, Los Alamitos, CA, USA (2009) [17] Maali, F., Erickson, J., Archer, P.: Data catalog vocabulary (DCAT). W3C Recommendation 16 January 2014 (Jan 2014), http://www.w3.org/TR/ vocal-dcat [18] Montoya, G., Skaf-Molli, H., Hose, K.: The Odyssey approach for optimizing federated SPARQL queries. In: 16th International Semantic Web Confer- ence (ISWC 2017), Vienna, Austria, 23–25 October 2017 (2017), preprint available at https://arxiv.org/abs/1705.06135 [19] Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) Proceedings of the 5th European Semantic Web Conference (ESWC 2008), Tenerife, Spain, 1–5 June 2008. Lecture Notes in Computer Science, vol. 5021 (2008) [20] Rakhmawati, N.A., Hausenblas, M.: On the impact of data distribution in federated SPARQL queries. In: Proceedings of the Sixth IEEE International Conference on Semantic Computing (ICSC 2012). pp. 255–260 (2012) [21] Rakhmawati, N.A., Umbrich, J., Karnstedt, M., Hasnain, A., Hausenblas, M.: Querying over federated sparql endpoints: A state of the art survey. Tech. rep., DERI (Jun 2013), http://arxiv.org/abs/1306.1723 [22] Saleem, M., Ngonga Ngomo, A.C.: HiBISCuS: Hypergraph-based source selection for SPARQL endpoint federation. In: Proceedings of the 11th ESWC Conference, Anissaras, Crete, Greece, 25–29 May 2014. pp. 176–191 (2014), http://dx.doi.org/10.1007/978-3-319-07443-6_13 [23] Saleem, M., Ngonga Ngomo, A.C., Xavier Parreira, J., Deus, H.F., Hauswirth, M.: DAW: Duplicate-aware federated query processing over the Web of Data. In: Proceedings of the 12th International Semantic Web Conference (ISWC 2013), Sydney, Australia, 21–25 October 2013, Part I. Lecture Notes in Computer Science, vol. 8218. Springer (2013), https://doi.org/10.1007/978-3-642-41335-3_36 [24] Schmidt, M., Görlitz, O., Haase, P., Ladwig, G., Schwarte, A., Tran, T.: Fedbench: A benchmark suite for federated semantic data query processing. In: Proceedings of the 10th International Semantic Web Conference (ISWC 2011), Bonn, Germany, 23–27 October 2011. Lecture Notes in Computer Science, vol. 7031, pp. 585–600. Springer (2011), http://doi.org/10. 1007/978-3-642-25073-6_37 [25] Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: A federation layer for distributed query processing on Linked Open Data. In: Proceedings of the 8th Extended Semantic Web Conference (ESWC 2011), Heraklion, Crete, Greece, May 29 – June 2, 2011. Lecture Notes in Computer Science, vol. 6644, pp. 481–486. Springer (2011) [26] Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: Consistent histogram construction using query feedback. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE ’06). IEEE Computer Society, Washington, DC, USA (2006), http://dx.doi.org/ 10.1109/ICDE.2006.84 [27] Wang, X., Tiropanis, T., Davis, H.C.: LHD: optimising linked data query processing using parallelisation. In: Proceedings of Linked Data on the Web (LDOW 2013), Rio de Janeiro, 14 May 2013 (2013) [28] Zamani, K., Charalambidis, A., Konstantopoulos, S., Zoulis, N., Mavroudi, E.: Workload-aware self-tuning histograms for the Semantic Web. Trans- actions on Large Scale Data and Knowledge-Centered Systems 28 (Sep 2016), http://dx.doi.org/10.1007/978-3-662-53455-7_6, pub- lished as LNCS 9940