Assessing Linked Data Versioning Systems: The Semantic Publishing Versioning Benchmark? Vassilis Papakonstantinou, Irini Fundulaki, and Giorgos Flouris Institute of Computer Science-FORTH, Greece Abstract. As the Linked Open Data Cloud is constantly evolving, both at schema and instance level, there is a need for systems that efficiently support storing and querying of such data. However, there is a limited number of such systems and even fewer benchmarks that test their per- formance. In this paper, we describe in detail the Semantic Publishing Versioning Benchmark (SPVB) that aims to test the ability of versioning systems to efficiently manage versioned Linked Data datasets and queries evaluated on top of these datasets. We discuss the benchmark data and SPARQL query generation process, as well as the evaluation methodol- ogy we followed for assessing the performance of a benchmarked system. Finally, we describe a set of experiments conducted with the R43ples and Virtuoso systems using SPVB. Keywords: RDF, Linked Data, Versioning, SPARQL, Benchmarking 1 Introduction A key step towards abolishing the barriers to the adoption and deployment of Big Data is to provide companies with open benchmarking reports that allow them to assess the fitness of existing solutions for their purposes. For this reason, there exist a number of benchmarks that test the ability of Linked Data systems to store and query data in an efficient way. However, to the best of our knowledge, only a limited number of systems (mostly academic) and benchmarks exist for handling versioned datasets, and testing the proposed solutions respectively. However, the existence of such systems and benchmarks is of utmost impor- tance, as dynamicity is an indispensable part of the Linked Open Data (LOD) initiative [1, 2]. In particular, both the data and the schema of LOD datasets are constantly evolving for several reasons, such as the inclusion of new experimental evidence or observations, or the correction of erroneous conceptualizations [3]. The open nature of the Web implies that these changes typically happen with- out any warning, centralized monitoring, or reliable notification mechanism; this raises the need to keep track of the different versions of the datasets and intro- duces new challenges related to assuring the quality and traceability of Web data ? This work will be published as part of the book “Emerging Topics in Semantic Tech- nologies. ISWC 2018 Satellite Events. E. Demidova, A.J. Zaveri, E. Simperl (Eds.), ISBN: 978-3-89838-736-1, 2018, AKA Verlag Berlin” 45 over time. The tracking of frequent changes is called versioning, and the systems that can handle such versioned data, versioning systems. Note here, that data versioning slightly differs from data archiving, since the later refers to inactive and rarely modified data that needs to be retained for long periods of time. In this paper, we discuss the Semantic Publishing Versioning Benchmark (SPVB) developed in the context of the H2020 European HOBBIT project1 for testing the ability of versioning systems to efficiently manage versioned datasets. SPVB acts like a Benchmark Generator, as it generates both the data and the queries needed to test the performance of the versioning systems. The main com- ponent of the benchmark is the Data Generator that is highly configurable. SPVB is not tailored to any versioning strategy (see Section 2.1) and can produce data of different sizes that can be altered in order to create arbitrary numbers of ver- sions using configurable insertion and deletion ratios. It uses the data generator of Linked Data Benchmark Council’s (LDBC)2 Semantic Publishing Benchmark (SPB) as well as DBpedia [4] data. LDBC–SPB leverages the scenario of the BBC media organization, which makes extensive use of Linked Data Technolo- gies, such as RDF and SPARQL. SPVB’s Data Generator, is also responsible for producing the SPARQL queries (the so-called tasks) that have to be executed by the system under test. Such queries are of different types (see Section 2.2) and are partially based on a subset of the 25 query templates defined in the context of DBpedia SPARQL Benchmark (DBPSB) [5]. SPVB evaluates the correctness and performance of the tested system via the following Key Performance Indi- cators (KPIs): i) Query failures ii) Initial version ingestion speed iii) Applied changes speed iv) Storage space cost v) Average Query Execution Time and vi) Throughput. The outline of the paper is the following. In Section 2, we discuss the back- ground information regarding the versioning strategies and query types. We present the already proposed benchmarks for systems handling versioned RDF data in Section 3. In Sections 4 and 5 we describe the versioning benchmark SPVB and the experiments conducted with it, respectively. Finally, Section 6 concludes and outlines future work. 2 Background 2.1 Versioning Strategies Three alternative RDF versioning strategies have been proposed in the litera- ture. The first one is full materialization, where all different versions are explicitly stored [6]. Next, there is the delta-based strategy, where one full version of the dataset needs to be stored, and, for each new version, only the set of changes with respect to the previous/next version (also known as the delta) has to be kept [7–11]. Finally, the annotated triples strategy, is based on the idea of aug- menting each triple with its temporal validity, which is usually composed of two 1 https://project-hobbit.eu/ 2 ldbc.council.org 2 46 timestamps that determine when the triple was created and deleted [12]. Hybrid strategies [13] that combine the above, have also been considered. Such strate- gies try to enjoy most of the advantages of each approach, while avoiding many of their respective drawbacks. 2.2 Query Types An important novel challenge imposed by the management of multiple versions is the generation of different types of queries (e.g., queries that access multiple versions and/or deltas). There have been some attempts in the literature [14, 13, 15, 16] to identify and categorize these types of queries. Our suggestion, which is a combination of such efforts and was presented in [17], is shown in Figure 1. Fig. 1: Different queries organized by focus and type (see also [17]) Firstly, queries are distinguished by focus (i.e., target), in version and delta queries. Version queries consider complete versions, whereas delta queries con- sider deltas. Version queries can be further classified to modern and historical, depending on whether they require access to the latest version (the most common case) or a previous one. Obviously, the latter categorization cannot be applied to delta queries, as they refer to time changes between versions (i.e., intervals), which have no specific characteristics that are related to time. In addition, queries can be further classified according to type, to materializa- tion, single-version and cross-version queries. Materialization queries essentially request the entire respective data (a full version, or a full delta); single-version queries can be answered by imposing appropriate restrictions and filters over a single dataset version or a single delta; whereas cross-version queries request data related to multiple dataset versions (or deltas). Of course, the above categories 3 47 are not exhaustive; one could easily imagine queries that belong to multiple cat- egories, e.g., a query requesting access to a delta, as well as multiple versions. These types of queries are called hybrid queries. More specifically the types of queries that we consider in SPVB are: – QT1 - Modern version materialization queries ask for a full current version to be retrieved. – QT2 - Modern single-version structured queries are performed in the current version of the data. – QT3 - Historical version materialization queries on the other hand ask for a full past version. – QT4 - Historical single-version structured queries are performed in a past version of the data. – QT5 - Delta materialization queries ask for a full delta to be retrieved from the repository. – QT6 - Single-delta structured queries are performed on the delta of two consecutive versions. – QT7 - Cross-delta structured queries are evaluated on changes of several versions of the dataset. – QT8 - Cross-version structured queries are evaluated on several versions of the dataset. 3 Related Work A benchmark is a set of tests against which the performance of a system is evaluated. A benchmark helps computer systems compare and assess their per- formance in order to become more efficient and competitive. To the best of our knowledge, there have been only two proposed benchmarks for versioning sys- tems in the literature, which are described below (see [17] for more details). The bear [18, 14] benchmark is an implementation and evaluation of a set of operators that cover crucial aspects of querying and versioning Semantic Web data for the three versioning strategies (Full Materialization, Delta-Based and Annotated Triples) described in Section 2.1. As a basis for comparing the differ- ent strategies, the bear benchmark introduces some features that describe the dataset configuration. Such features are i) the data dynamicity that measures the number of changes between versions, ii) the data static core that contains the triples that exist in all dataset versions, iii) the total version-oblivious triples that compute the total number of different triples in a version, and, finally iv) the RDF vocabulary that describes the different subjects, predicates and objects in a single version. Regarding the generation of the queries of the benchmark, the result cardinality and selectivity of the queries are considered to guarantee that retrieval differences in query response times are attributed to the versioning strategy. In order to be able to judge the different strategies, bear introduces five foundational query atoms to cover the broad spectrum of emerging retrieval demands in RDF archiving. In particular, the authors propose i) queries on ver- sions (QT2, QT4 from our categorization), ii) on deltas (QT5, QT6, QT7), iii) 4 48 the Version queries that return the results of the query Q annotated with the version label in which each of them exists, iv) the Cross-version join queries that serves the join between the results of some Q1 in Vi , and some Q2 in Vj and v) the Change materialisation queries that provide those consecutive versions in which the given query Q produces different results. Even though bear provides a detailed theoretical analysis of the features that are useful for designing a benchmark, it lacks configurability and scalability as its data workload is composed of a static, non configurable dataset. Also, it is focusing on the evaluation of the versioning strategies instead of the systems that implement them, which is the main objective of SPVB. evogen[16] is a generator for versioned RDF data used for benchmarking versioning and change detection systems. evogen is based on the LUBM gen- erator [19], by extending its schema with 10 RDF classes and 19 properties to support schema evolution. Its benchmarking methodology is based on a set of requirements and parameters that affect the data generation process, the context of the tested application and the query workload, as required by the nature of the evolving data. evogen is an extensible and configurable Benchmark Generator in terms of the number of generated versions and the number of changes occur- ring from version to version. The query workload produced by evogen leverages the 14 LUBM queries, appropriately adapted to apply on versions. In particu- lar, the following six types of queries are generated: i) Retrieval of a diachronic dataset, ii) Retrieval of a specific version (QT1, QT3 from our categorization), iii) Snapshot queries (QT2, QT4), iv) Longitudinal (temporal) queries (QT8), v) Queries on changes (QT5, QT6) and vi) Mixed queries. evogen is a more complete benchmark, as it is a highly configurable and ex- tensible benchmark generator. However, its query workload seems to be approach- dependent, in the sense that the delta-based queries require that the bench- marked systems store metadata about underlying deltas (addition/deletion of classes, class instances etc.) in order to be answered. Moreover, to successfully answer 11 of the 14 original LUBM queries, the benchmarked systems must support RDFS reasoning (forward or backward). 4 Semantic Publishing Versioning Benchmark Here, we present the second version of SPVB (the first version was presented in [20]), developed in the context of the HOBBIT H2020 project. The benchmark is built on top of the HOBBIT platform which is available online3 , but it can be locally deployed as well4 . The source code of the benchmark can be found in the project’s github repository5 . SPVB is the first benchmark for versioned RDF data that uses both realistic synthetic data and real DBpedia data, while at the same time its query workload is mainly based on real DBpedia queries. 3 https://master.project-hobbit.eu/ 4 https://github.com/hobbit-project/platform 5 https://github.com/hobbit-project/versioning-benchmark 5 49 The new version of SPVB, when compared to its previous version, has been improved in various aspects. First, its older version supported only additions from one version to another, and the generated data was split into different ver- sions of equal size according to their creation date. As a result, the benchmark was not useful for systems that had to be tested for versioned datasets of in- creasing or decreasing size. On the contrary, the new version of SPVB allows one to configure the number of additions and deletions for the different versions. his is a critical feature, that allows one to produce benchmarks that can be used to test a broad spectrum of situations regarding versioned datasets, a feature that reveals the benefits or pitfalls of systems under test. A second improvement is related to the form of the data. In the previous version the benchmark, each data version was sent to the system only as an independent copy whereas now, the data generator can send data in different forms (independent copies, change-sets or both). As a result, systems do not need to post-process the data in order to load them, since they can choose the appropriate form to receive the data, according to the versioning strategy they implement. Finally, the generated queries were exclusively synthetic. Conversely, in the current version, the use of DBpedia data allows us to use real DBpedia queries that arise from real world situations making the benchmark more realistic. SPVB consists of the following four main components: the Data Generator, the Task Provider, the Evaluation Storage and the Evaluation Module. Each of these components is described in detail in the following Sections and their architecture is shown graphically in Figure 2. 4.1 Data Generation The Data Generator, as shown in Figure 2, is the main component of SPVB. It is responsible for creating both the versions and the SPARQL queries, as well as to compute the expected results (gold standard) for the benchmark queries. SPVB’s data generator is highly configurable, since it allows the user to generate data with different characteristics and of different forms. In particular, the following parameters can be set to configure the data generation process: 1. Number of versions: defines the number of versions to produce. 2. Size: defines the size of the initial version of the dataset in terms of triples. 3. Version insertion ratio: defines the proportion of added triples between two consecutive versions (originally proposed in [18]). In particular, given + two versions Vi and Vi+i , the version insertion ratio δi,i+1 is computed by + + + the formula δi,i+1 = |∆i,i+1 |/|Vi |, where |∆i,i+i | is the number of added triples from version i to version i + 1, and |Vi | is the total number of triples of version i (in which triples are added). 4. Version deletion ratio: defines the proportion of deleted triples between two consecutive versions (also proposed in [18]). Given two versions Vi and − − Vi+i , the version deletion ratio δi,i+i is computed by the formula δi,i+i = − − |∆i,i+i |/|Vi |, where |∆i,i+i | is the number of deleted triples from version i to 6 50 Data Generator LDBC - SPB SPARQL Task queries Data Generator Provider Expected Added triples results SPARQL Deleted triples Expected results queries BBC V0 V1 ... Vn Generated Virtuoso Evaluation Evaluation ontologies data TripleStore Storage Module Evenly distributed Results V0 V1 V2 V3 V4 System 5 DBpedia versions for 1000 entities Fig. 2: Architecutre of SPVB version i + i, and |Vi | is the total number of triples of version i (from which the triples are deleted). 5. Generated data form: each system implements a different versioning strat- egy, so it requires the generated data in a specific form. SPVB’s data genera- tor can output the data of each version as i) an Independent Copy (suitable for systems implementing the full materialization strategy), ii) as a Change- set − set of added and deleted triples (suitable for systems implementing the delta-based or annotated triples strategies) or iii) both as an independent copy and changeset (suitable for systems implementing a hybrid strategy). 6. Generator seed: used to set the random seed for the data generator. This seed is used to control all random data generation happening in SPVB. Based on these parameters the generator can produce a version of the orig- inal dataset that contains both realistic synthetic data and real DBpedia data. Regarding the generation of synthetic data, the data generator of SPVB uses the data generator of LDBC–SPB [21] for producing the initial version of the dataset as well as the triples that will be added from one version to another. To do so, it uses seven core and three domain RDF ontologies (as those described in [21]) and a reference dataset of DBpedia for producing the data. The data generator of LDBC–SPB produces RDF descriptions of creative works that are valid instances of the BBC Creative Work core ontology. A cre- ative work can be defined as metadata about a real entity (or entities) that exist in the reference dataset of DBpedia. A creative work has a number of properties such as title, shortTitle, description, dateCreated, audience and format; it has a category and can be about or mention any entity from the DBpedia reference dataset. Thus, a creative work provides metadata about one or more entities and 7 51 defines relations between them. The data generator of LDBC–SPB models three types of relations in the data, and for each one produces 1/3 of the number of creative works. Such relations are the following ones: – Correlations of entities. The correlation effect is produced by generating creative works about two or three entities from reference data in a fixed period of time. – Clustering of data. The clustering effect is produced by generating creative works about a single entity from the reference dataset and for a fixed period of time. The number of creative works referencing an entity starts with a high peak at the beginning of the clustering period and follows a smooth decay towards its end. – Random tagging of entities. Random data distributions are defined with a bias towards popular entities created when the tagging is performed. For producing the initial version of the dataset the data generator of LDBC–SPB runs with the Size configuration parameter as input. For producing the triples to add, we first compute the number of triples to be added using the version insertion ratio, and then run the data generator with this number as input. As a result, the set of creative works that have to be added are produced. At this point we let the generator produce random data distributions, as we do not want to “break” the existing relations among creative works and entities (clustering or the correlation of entities) of the initial version. For producing the triples that will be deleted, we first compute the number of triples to be deleted based on the version deletion ratio. Then, we take as input the triples that were produced randomly in the previous versions, and we choose creative works in a random manner until we reach the targeted number of triples to delete. The reason we only choose creative works that were previously produced randomly, is the same as in case of additions (we do not want to “break” the clustering or correlations for the already existing entities). Recall that in the initial version of the dataset, the maximum number of triples that can be produced randomly equals the 1/3 of the total number of triples. So, if the generator is configured to delete more than 1/3 of the number of triples for each version, that is, the triples to be deleted exceeds the triples that were randomly − produced, we impose the threshold of 33% for δi,i+1 . As we mentioned earlier, except for creating the versioned dataset for creative works, SPVB’s data generator supports the versioning of the reference dataset of DBpedia employed by LDBC–SPB to annotate the creative works. In particular, we maintain 5 different versions of DBpedia, from year 2012 to year 2016 (one for each year). Such versions contain the subgraph of each entity used by the data generator of LDBC–SPB to annotate creative works through the about and mentions properties. In practice, all DBpedia triples in which the entity URI is in the subject position are maintained. However, LDBC–SPB is using about 1 million entities for the annotation of creative works. Obviously, it is not possible to burden the generated creative works with such volume of data. So, from those 1 million entities we keep the 1000 most popular ones, based on the score 8 52 provided by LDBC–SPB, and we extract their RDF graph from the 5 different versions of DBpedia. By doing so, we ended up with 5 DBpedia subgraphs (one for each version) containing 40K, 35K, 65K, 60K and 71K triples respectively. These versions enhance the generated creative works and are evenly distributed to the total number of versions that the data generator is configured to produce. E.g., assuming that 10 versions are produced by the data generator, the triples of the 5 versions of DBpedia will be added to versions 0, 2, 5, 7 and 9 respectively. In cases that the data generator is configured to produce less than 5 versions (say N ), we only keep the first N DBpedia versions. After the generation of both creative works and DBpedia data has finished, they are loaded into a Virtuoso triplestore6 . This way, we can later evaluate the produced SPARQL queries and compute the expected results that are required by the Evaluation Module (Section 4.3) to assess the correctness of the results reported by the benchmarked system. 4.2 Tasks Generation As shown in Figure 2, the generation of the SPARQL queries that have to be executed by the systems under test is a process that is also taking place in the Data Generator component. Given that there is neither a standard language, nor an official SPARQL extension for querying RDF versioned data, the generated queries of SPVB assume that each version is stored in its own named graph. Each benchmarked system should express these queries in the query language that it supports in order to be able to execute them. The queries produced by the Data Generator are based on a set of query templates. In particular, for each one of the eight versioning query types (Sec- tion 2.2), we have defined one or more query templates. We show an example in Listing 1.1 that retrieves creative works that are about different topics, along with the topics type from a past version. The full list of the query templates can be found online7 . Such query templates contain placeholders of the form {{{placeholder}}} which may refer either to the queried version ({{{historicalVersion}}}) or an IRI from the reference dataset of DBpedia ({{{cwAboutUri}}}). The place- holders are replaced with concrete values, in order to produce a set of similar queries. 1 SELECT DISTINCT ? c r e a t i v e W o r k ? v1 2 FROM {{{ h i s t o r i c a l V e r s i o n }}} 3 WHERE { 4 ? c r e a t i v e W o r k cwork : about {{{ cwAboutUri }}} . 5 {{{ cwAboutUri }}} r d f : t y p e ? v1 . 6 } Listing 1.1: Historical single-version structured query template 6 http://vos.openlinksw.com 7 https://hobbitdata.informatik.uni-leipzig.de/MOCHA_ESWC2018/Task3/ query_templates/ 9 53 For the query types that refer to structured queries on one or more versions (QT2, QT4 and QT8), we use 6 of the 25 query templates proposed by the DBpedia SPARQL Benchmark (DBPSB) [5] which are based on real DBpedia queries. The reason we chose those 6 templates was that if they are evaluated on top of the 5 versions of the reference dataset of DBpedia, with their corre- sponding placeholder replaced with a variable, will always return results. At the same time, they comprise most of the SPARQL features (FILTERs, OPTIONALs, UNIONs, etc.). Note here that the DBPSB query templates were generated to be executed on top of DBpedia data only. On the other hand, SPVB generates data that combine DBpedia and creative works. So, in order for the 6 DBPSB query templates to be applicable to the data generated by SPVB we added an extra triple pattern to them that “connects” the creative works with DBpedia through the about or mentions properties, as shown in line 4 of Listing 1.1. As we mentioned earlier, the placeholders may refer either to the queried version or an IRI from the reference dataset of DBpedia. The ones that refer to the queried version, are replaced in such a way that a wide range of available versions is covered. For example, assume that we have the query template shown in Listing 1.1 and the generator is configured to produce n versions in total. The {{{historicalVersion}}} placeholder will be replaced with the graph names denot- ing i) the initial version, ii) an intermediate version and iii) the n − 1 version. The placeholders that refer to an IRI, are the same placeholders used in the DBPSB query templates. To replace them with concrete values we use a tech- nique similar to the one used by DBPSB. We run offline on top of the 5 different versions of DBpedia each of the 6 DBPSB query templates, having replaced their placeholder with a variable, and keep at most 1000 possible concrete values that each placeholder may be replaced with. So, according to the queried version we can randomly pick one of those values. Such a technique guarantees that the produced query will always return results. After replacing all the placeholders of each query template, a set of similar queries is generated (a complete set of such queries for a specific configuration can be found online8 ). As shown in Figure 2, such queries are evaluated on top of Virtuoso9 , where the already generated versions have been loaded into, in order to calculate the expected results. After the expected results have been computed, the Data Generator sends the queries along with the expected results to the Task Provider component and the generated data to the system under test. The job of the Task Provider is to sequentially send the SPARQL queries to the system under test and the expected results to the Evaluation Storage component. So, the system under test evaluates the queries on top of the data (after the appropriate rewritings, if necessary) and reports the results to the Evaluation Storage. 8 https://hobbitdata.informatik.uni-leipzig.de/MOCHA_ESWC2018/Task3/ queries 9 http://vos.openlinksw.com 10 54 4.3 Evaluation Module The final component of the benchmark is the Evaluation Module. The Evaluation Module receives from the Evaluation Storage component the results that are sent by the system under test, as well as the expected results sent by the Task Provider and evaluates the performance of the system under test. To do so, it calculates the following Key Performance Indicators (KPIs), covering a high spectrum of aspects that one needs for assessing the performance of a versioning system: – Query failures: The number of queries that failed to execute. Failure refers to the fact that the system under test returns a result set (RSsys ) that is not equal to the expected one (RSexp ). This means that i) RSsys has equal size to RSexp and ii) every row in RSsys has one matching row in RSexp , and vice versa (a row is only matched once). If the size of the result set is larger than 50.000 rows, only condition i) is checked for performance reasons. – Initial version ingestion speed (triples/second): the total triples of the initial version that can be loaded per second. We distinguish this from the ingestion speed of the other versions because the loading of the initial version greatly differs in relation to the loading of the following ones, where under- lying processes such as, computing deltas, reconstructing versions, storing duplicate information between versions etc., may take place. – Applied changes speed (changes/second): tries to quantify the overhead of the underlying processes that take place when a set of changes is applied to a previous version. To do so, this KPI measures the average number of changes that could be stored by the benchmarked systems per second after the loading of all new versions. – Storage space cost (MB): This KPI measures the total storage space re- quired to store all versions measured in MB. – Average Query Execution Time (ms): The average execution time, in milliseconds for each one of the eight versioning query types, as those de- scribed in Section 2.2. – Throughput (queries/second): The execution rate per second for all queries. 5 Experiments In order to test the implementation of SPVB on top of the HOBBIT platform for the different versioning systems described in [17], we were able to conduct experiments only for R43ples (Revision for triples) [8], which uses Jena TDB as an underlying storage/querying layer. For the rest, since most of them are academic products, we encountered various difficulties related to the installation and use (e.g., no documentation on how to use a given API, no way to contact with developers for further instructions, no access to the versioning extensions, or no option for a local deployment to be tested). In order to have a baseline system we decided to implement the full materialization versioning strategy, by assuming that each version is stored in its own named graph, on top of Virtuoso10 triplestore that does not handle versioned data. 10 https://virtuoso.openlinksw.com/ 11 55 The online instance of the HOBBIT platform is deployed on a server clus- ter. Each of the Data Generator, Task Provider, Evaluation Module and System components is created and runs on its own node having 32 threads in total (2 sockets, each with 8 cores with hyper-threading) and a total of 256GB of RAM. For our experiments, we produced 3 datasets of different initial sizes that corre- spond to 100K, 500K and 1M triples. For each dataset, we produced 5 different versions following a version insertion and deletion ratio of 15% and 10% respec- tively. As a result, for each one of the initial datasets of 100K, 500K and 1M triples, we produced a final version of 141K, 627K and 1.235M triples respec- tively. The number of triples in the final versions includes the versioned data of the reference dataset of DBpedia as well. For fairness, we ran three experiments per dataset size and computed the average values for all reported results. The experiment timeout was set to 1 hour for both R43ples and Virtuoso systems. This was necessary, in order to have a good use of the HOBBIT platform, since the submitted experiments run sequentially for more than twenty benchmarks. Regarding the full materialization strategy that we implemented on top of Virtuoso, for the datasets of 100K, 500K and 1M triples, the queries were executed at a rate of 1.4, 0.3 and 0.16 queries per second respectively and all of them returned the expected results. For the rest of the KPIs we graphically report the results. Fig. 3: Ingestion speeds Fig. 4: Storage space overhead In the left bar chart of Figure 3 we can see for all datasets the initial version ingestion speed. For the ingestion of new triples we used the bulk loading process offered, with 12 RDF loaders so that we can parallelize the data load and hence maximize loading speed. As we can see, the speed ranges from 24K to 140K triples per second and increases with the dataset size, and consequently with the size of its initial version. This is an expected result, as Virtuoso bulk loads files containing much more triples, as the dataset size increases. The same holds for the applied changes speed, shown in the right side of the same figure, which increases from 14K to 53K changes per second. We can observe here that the initial version ingestion speed outperforms the applied changes speed. This is an overhead of the chosen versioning strategy i.e., full materialization. Recall that the unchanged information between versions is duplicated when a new version 12 56 is coming, so the time required for applying the changes of a new version is significantly increased as it includes the loading of data from previous versions. In Figure 4 we can see the storage space required for storing the data for all different datasets. The space requirements as expected increase as the total number of triples increases from 80MB to 800MB. This significant overhead on storage space is due to the archiving strategy used (i.e., Full Materialization). In Figures 5, 6, 7, 8 and 9 we present the average execution time (in ms) for the queries of each versioning query type, and for each dataset size. Fig. 5: Execution times for materialization queries In Figure 5 we can see the time required for materializing i) the modern (current) version; ii) a historical (past) one; and iii) the difference between two versions (delta). In the left and middle bars the times required for materializing the modern and a historical version are presented respectively. As expected, the execution time increases as the dataset size increases and the time required for materializing a historical version is quite smaller than the modern one, as it contains less triples. In both cases, we observe that execution times are small enough, as all the versions are already materialized in the triple store. In the right side of the same Figure we can see the time required for materializing a delta. Since deltas have to be computed on the fly when the queries are evaluated, we see a significant overhead in the time required for evaluation. In Figures 6, 7 , 8 and 9 we can see the execution times for all types of structured queries. In all of the cases, similarly to materialization queries, the execution time increases as the number of triples increases. Although someone would expect that delta-based queries were to be slower than the version-based ones, as deltas have to be computed on the fly, this does not seem to be the case. This is happening as the version-based queries (that are based on DBPSB query templates) are much harder regarding query evaluation than the delta-based ones. Regarding the R43ples system, we only managed to run experiments for the first dataset, of 100K triples, since for the remaining ones the experiment time exceed the timeout of 1 hour. In most of the cases the response times are order(s) 13 57 Fig. 6: Execution times for single Fig. 7: Execution times for version structured queries cross-version structured queries Fig. 8: Execution times for single Fig. 9: Execution times for delta structured queries cross-delta structured queries of magnitude slower, so we do not report and compare the results graphically with the corresponding ones of Virtuoso. At first, in the left table of Table 1 we can see the initial version ingestion speed and applied changes speed. The changes are applied in a slower rate than the triples of the initial version are loaded, as the version that is materialized is always the current one, so for every new delta, the current version has to be computed. Compared to Virtuoso, R43ples is 1 order of magnitude slower even in the case of applied changes speed where we would expect it to be faster. The storage space overhead, as shown in the same Tables, is double the storage space for Virtuoso. Someone would expect R43ples to outperform Virtuoso as in Virtuoso we implemented the Full Materialization strategy, but that seems not to be the case, as the underlying storage strategies of Virtuoso and Jena TDB seem to be very different. Next, regarding the execution of queries, with a quick glance, we can see that queries are executed at a much lower rate and many of them failed to re- turn the correct results. Such failures are possibly due to the fact that R43ples failed to correctly load all the versioned data. E.g. in the final version R43ples maintained a total of 140.447 triples instead of 141.783. Concerning the queries execution time, for materializing the current version (QT1) or executing a struc- tured query on it (QT2), R43ples (although in some cases returned slightly less results) required similar time compared to Virtuoso. This is something that we expected, as the current version is kept materialized just like in Virtuoso. This is not happening for the rest of the query types as R43ples is 1 to 3 orders of 14 58 magnitude slower than Virtuoso. This is also an expected result as R43ples needs to reconstruct the queried version on-the-fly. Succeeded Metric Result queries Metric Result QT1 (ms) 13887.33 0/1 V0 Ingestion speed (triples/sec) 3502.39 QT2 (ms) 146.28 25/30 Changes speed (changes/sec) 2767.56 QT3 (ms) 18265.78 0/3 Storage cost (MB) 197378 QT4 (ms) 11681.49 13/18 Throughput (queries/second) 0.09 QT5 (ms) 31294.00 0/4 Queries failed 25 QT6 (ms) 12299.58 4/4 QT7 (ms) 35294.33 2/3 QT8 (ms) 19177.33 30/36 Table 1: Results for R43ples system 6 Conclusions and Future work In this paper we described the state-of-the-art approaches for managing and benchmarking evolving RDF data, and described in detail SPVB, a novel bench- mark for versioning systems, along with a set of experimental results. We plan to keep SPVB an evolving benchmark, so that we can attract more versioning systems to use it, as well as to try to asses the performance of more versioning systems on ourselves. To achieve this, we plan to add some extra functionalities or improve the existing ones. In particular, we want to move the expected results computation out of the Data Generator component. By doing so, SPVB will be able to generate data in parallel through multiple Data Generators. Also, we want to make the query workload more configurable, by giving the ability to the benchmarked system to include or exclude specific query types. Moreover, we want to optimize the responses evaluation taking place in the Evaluation Module component, as for dozens of thousands of results the evaluation may become very costly. Finally, we plan to add functionalities that the second version of the HOBBIT platform offers, such as graphical visualisation of the KPIs. Acknowledgments This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227). References 1. T. Käfer, A. Abdelrahman, et al. Observing linked data dynamics. In ESWC-13. 15 59 2. J. Umbrich, S. Decker, et al. Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources. In LDOW, 2010. 3. F. Zablith, G. Antoniou, et al. Ontology evolution: a process-centric survey. Knowl- edge Eng. Review, 30(1):45–75, 2015. 4. C. Bizer, T. Heath, et al. Linked data-the story so far. Semantic Services, Inter- operability and Web Applications: Emerging Concepts, pages 205–227, 2009. 5. M. Morsey, J. Lehmann, et al. Dbpedia sparql benchmark–performance assessment with real queries on real data. In ISWC, pages 454–469. Springer, 2011. 6. M. Völkel and T. Groza. SemVersion: An RDF-based ontology versioning system. In IADIS, volume 2006, page 44, 2006. 7. S. Cassidy and J. Ballantine. Version Control for RDF Triple Stores. ICSOFT, 2007. 8. M. Graube, S. Hensel, et al. R43ples: Revisions for triples. LDQ, 2014. 9. M. Vander Sande, P. Colpaert, et al. R&Wbase: git for triples. In LDOW, 2013. 10. D. H. Im, S. W. Lee, et al. A version management framework for RDF triple stores. IJSEKE, 22(01):85–106, 2012. 11. H. Kondylakis and D. Plexousakis. Ontology evolution without tears. Journal of Web Semantics, 19, 2013. 12. T. Neumann and G. Weikum. x-RDF-3X: fast querying, high update rates, and consistency for RDF databases. VLDB Endowment, 3(1-2):256–263, 2010. 13. K. Stefanidis, I. Chrysakis, et al. On designing archiving policies for evolving RDF datasets on the Web. In ER, pages 43–56. Springer, 2014. 14. G. Fernandez, D. Javier, et al. BEAR: Benchmarking the Efficiency of RDF Archiv- ing. Technical report, 2015. 15. M. Meimaris, G. Papastefanatos, et al. A query language for multi-version data web archives. Expert Systems, 33(4):383–404, 2016. 16. M. Meimaris and G. Papastefanatos. The EvoGen Benchmark Suite for Evolving RDF Data. MeDAW, 2016. 17. V. Papakonstantinou, G. Flouris, et al. Versioning for linked data: Archiving systems and benchmarks. In BLINK@ ISWC, 2016. 18. J. D. Fernández, J. Umbrich, et al. Evaluating query and storage strategies for rdf archives. In Proceedings of the 12th International Conference on Semantic Systems, pages 41–48. ACM, 2016. 19. Y. Guo, Z. Pan, et al. LUBM: A benchmark for OWL knowledge base systems. Web Semantics: Science, Services and Agents on the WWW, 3(2):158–182, 2005. 20. V. Papakonstantinou, G. Flouris, et al. Spbv: Benchmarking linked data archiving systems. 2017. 21. V. Kotsev, N. Minadakis, et al. Benchmarking RDF Query Engines: The LDBC Semantic Publishing Benchmark. In BLINK, 2016. 16 60