=Paper=
{{Paper
|id=Vol-1585/mepdaw2016_paper_03
|storemode=property
|title=The EvoGen Benchmark Suite for Evolving RDF Data
|pdfUrl=https://ceur-ws.org/Vol-1585/mepdaw2016_paper_03.pdf
|volume=Vol-1585
|authors=Marios Meimaris,George Papastefanatos
|dblpUrl=https://dblp.org/rec/conf/esws/MeimarisP16
}}
==The EvoGen Benchmark Suite for Evolving RDF Data==
The EvoGen Benchmark Suite for Evolving RDF Data Marios Meimaris1,2 and George Papastefanatos2 1 University of Thessaly, Greece 2 ATHENA Research Center, Greece m.meimaris@imis.athena-innovation.gr gpapas@imis.athena-innovation.gr Abstract. Artificial and synthetic data are widely used for benchmark- ing and evaluating database, storage and query engines. This is usually performed in static contexts with no evolution in the data. In the con- text of evolution management, the community lacks systems and tools for benchmarking versioning and change detection approaches. In this pa- per, we address the generation of synthetic, evolving data represented in the RDF model, and we discuss requirements and parameters that drive this process. Furthermore, we discuss query workloads in the context of evolution. To this end, we present EvoGen, a generator for evolving RDF data, that offers functionality for instance and schema-based evolution, fine-grained change representation between versions as well as custom workload generation. 1 Introduction The Resource Description Framework3 (RDF) is a W3C recommendation for representing and publishing datasets in the form of Linked Open Data, a core technology used in the Data Web. The highly distributed and dynamic nature of the Data Web gives rise to constantly evolving datasets curated and managed under no centralized control, with changes found in both the schema and the instance levels. In this context, evolution management - either handled by each individual source during the publishing process or by third-party aggregators during the harvesting and archiving processes - become increasingly important. The significance of systems, frameworks and techniques for evolution manage- ment and archiving in the Data Web has been repeatedly pointed out in the literature as a means of addressing quality issues such as provenance tracking, timeline querying, change detection, change analysis, and so on [6,1,21]. Following the proliferation of RDF stores and SPARQL engines, there is a variety of benchmarking efforts, such as the Lehigh University Benchmark[9], the Berlin SPARQL Benchmark (BSBM) Specification [3] and the DBpedia SPARQL Benchmark [18]. Most of them provide real or synthetic datasets of varying size, query workloads and metrics for assessing the performance and functionality of 3 http://www.w3.org/RDF/ 2 research prototypes or commercial products. Although they offer various param- eters for configuring the characteristics of the synthetic data , such as its size and schema complexity, or the type of the generated query workload, their primary goal is to assess the storage efficiency and query performance of RDF systems that operate in a static context. These issues, however, have not been thoroughly addressed in versioning and evolving contexts, where performance and storage efficiency are greatly affected by evolution-specific parameters. Evolution in RDF data stems from low-level changes (or deltas) in the datasets, i.e., additions and deletions of triples through different time points. These deltas are semantically poor to capture the seman- tics of the dataset evolution and other parameters come in to play when bench- marking evolution management systems, such as schema vs instance evolution, change complexity, change frequency, data freshness and so on. Hence, any ex- perimentation on versioning and archiving systems must rely on evolution-aware data generators that produce arbitrarily large and complex synthetic data in- corporating configurable evolution-specific parameters in the data generation and the query workload production. Two main aspects must be considered to- wards this goal. First, benchmarking systems must be able to generate synthetic datasets of varying sizes, schema complexity and change granularity, in order to approximate different cases of evolution. Second, benchmarking systems must be configurable in generating representative query workloads with temporal and evolution characteristics. In this paper, we present, EvoGen, a synthetic Benchmark Suite for evolving RDF that offers synthetic data and workload generation capabilities. EvoGen is based on the widely adopted Lehigh University Benchmark. A preliminary version of EvoGen, [14], addressed generation of successive versions with a con- figurable shift (i.e. change in size between versions) parameter, without affecting the overall schema of the generated data. We extend the implementation of Evo- Gen to include configurable schema evolution, change logging and representation between versions, as well as query workload generation functionality. To this end, we build on LUBM’s existing benchmark queries, and we provide new ones that address the types of queries commonly performed in evolving settings [17], such as temporal querying, queries on changes, longitudinal queries across versions, etc. EvoGen primarily enables the benchmarking of versioning and archiving RDF systems and change detection and management tools; the provided syn- thetic workload can also be used for assessing the temporal functionality of traditional RDF engines. Contributions. The contributions of this paper are summarized as follows: – we present the requirements and characteristics for generating synthetic ver- sioned RDF, – we extend the LUBM ontology with 10 new classes and 19 new properties, – we extend EvoGen with configurable schema evolution based on our ex- tended LUBM ontology, 3 – we implement a change logging mechanism within EvoGen, that produces logs of the changes between consecutive versions following the representa- tional schema of the change ontology described in [22], – we provide an implementation for adaptive query workload generation, based on the evolutional aspects of the data generation process. This paper is outlined as follows. Section 2 provides an overview of related work. Section 3 discusses requirements for the benchmark, and Section 4 dis- cusses the parameters of the benchmark in the context of the EvoGen system. Section 5 describes the system’s implementation, and section 6 concludes the paper. 2 Related Work There exists a rich body of literature on RDF and SPARQL benchmarking. Existing works focus on several dimensions, such as datasets, workloads, and use cases [7]. For the dataset and workload dimensions, the requirements for static RDF benchmarks are concerned with providing datasets that are able to represent real-world scenarios and can provide workloads that simulate real- world use cases. For instance, the Berlin SPARQL Benchmark [3] defines two use cases that address different usage scenarios of the data, namely, the Explore use case, which aims at approximating navigational behaviour from customers, and the Business Intelligence use case, that simulates analytical types of queries. Other requirements that authors of benchmarks often cite include quality and quantity metrics in the generated data, such as distinct counts of resources, properties, classes etc., as well as maintaining the selectivity of query patterns between synthetic datasets generated with different tuning parameters (e.g. in [9] and [25]). While most of the existing approaches focus on static benchmarks, the core ideas and motives remain the same when applied to versioned data. Generally, two types of datasets are considered in benchmarking scenarios; synthetic data, which are artifically generated, and real-world data, which are taken from existing sources. In this work, we extend the Lehigh University Bench- mark (LUBM) [9], a widely adopted benchmark for RDF and OWL datasets. LUBM includes an implementation for generating synthetic data in OWL and DAML formats. Its broader scope includes benchmarking reasoning systems, as well as RDF storage and SPARQL querying engines [28], [11], [12], [2], [4], [10], [20]. It provides an ontology expressed in OWL, in which various relationships between classes exist so that reasoners can perform inferencing. Furthermore, LUBM comes with 14 SPARQL queries with varying sizes of query patterns, ranging from 1 to 6 triple patterns. Because of the fact that these can be limiting when stress testing SPARQL engines, they have been extended in the literature in order to provide more complex patterns (e.g. in [11]). SP2 Bench [25] is a gen- erator for RDF data, with the purpose of evaluating SPARQL querying engines. Its scope is mostly query efficiency instead of inferencing, and has been widely adopted in the literature [26,10,13]. Other approaches in the context of RDF and SPARQL benchmarking, such as FedBench [24] and the Berlin SPARQL 4 Benchmark (BSBM) [3], provide fixed data rather than custom data generation, hence they are not readily capable of providing a benchmark for evolving and otherwise versioned datasets. Nevertheless, the Berlin SPARQL Benchmark does define an update-driven use case, which works on top of the Explore use case. This use case, however, deals with triple additions and deletions that do not affect the schema of the data. Voigt et al. [29] present real-world datasets associated with query workloads as a means to benchmark RDF engines realistically. Specifically, they draw data from the New York Times linked data API4 , which includes data about arti- cles, people, and organizations, Jamendo 5 , which is an RDF dump of Creative Commons licensed music, Movie DB 6 , which is a dataset of movies drawn from Wikipedia, Geonames and Freebase, and YAGO2 7 , a linked knowledge base with data from Wikipedia, Geonames and WordNet. The authors also provide 15 queries for each dataset, and define 5 broad metrics concerning loading time, memory requirements upon loading, performance per query type, success rate for queries, and multi-client support. The reader is referred to [5] for an extensive study and comparison of RDF benchmarks. Finally, Fernandez et al. [6] discuss a series of metrics for bench- marking archiving systems in Linked Data contexts. Our approach aims at providing a highly customizable benchmarking suite for creating synthetic and evolving data, with instance-level and schema-level evolution and adaptive query workload generation. For this purpose, we extend LUBM and build on top of EvoGen, an existing synthetic RDF generator. The static component of LUBM is left as-is. For the dynamic (i.e., evolving) data gen- eration, we have implemented tunable functionality where the user can define numbers of versions and percentage of changes between datasets. Furthermore, we extended the original LUBM ontology with 10 new classes and 19 new prop- erties, in order for users to be able to tune schema evolution as well. 3 Requirements Benchmarking processes adhere to several functional and non-functional require- ments for the generation of synthetic data and query workloads, usually deter- mined by specific application use cases in each domain. According to [8], domain- specific benchmarks (in contrast to generic solutions) can provide fine-grained metrics and appropriate datasets for experimenting and assessing the details of a system operating in the context of this domain. In the case of evolving data, there is a multitude of dimensions to address when tailoring the benchmark to custom needs. 4 http://data.nytimes.com/ 5 http://dbtune.org/jamendo/ 6 https://datahub.io/dataset/linkedmdb 7 http://www.mpi-inf.mpg.de/departments/databases-and-information- systems/research/yago-naga/yago/ 5 3.1 Configurability of the data and change generation process The benchmark should be able to provide a viable degree of configurability through tunable parameters, regarding the data generation process, the context of the application that will be tested, and the adaptability of the query workload on the specificities of the generated evolving data. The differentiation between benchmarks for evolving settings, and benchmarks for static settings, is that the temporal dimension and the archiving strategy can randomize the data genera- tion process, affecting not only the size but the freshness and change frequency of the data (e.g., a large number of versions is produced with few changes be- tween them), the type and granularity of changes produced between versions and the schema of the generated data. For example, the benchmark must be configurable to the different strategy employed by the evaluated RDF archiving system; a full materialization strategy requires the generation of all versions of a dataset, delta-based strategy requires the generation of one data version (either the first or the most current) and all changes between versions, whereas a hybrid strategy combines these two approaches, requiring a mixed generation of data and changes. For a discussion of different archiving strategies for RDF, the reader is referred to [6,27]. This also implies that a dynamic and adaptive workload is required in order to be consistent with the schema and change information of each generated dataset version. 3.2 Extensibility with evolution-based parameters As evolving data are by definition dynamic in nature, new requirements are bound to arise as application contexts expand. For this reason, the benchmark is not considered to be exhaustive. Instead, we consider extensibility to be a crucial requirement when designing the parameters of the data generation pro- cess and the query workload. For example, there are different approaches for embedding temporal and version information in RDF based on the choices made by the model designer or the capabilities of the RDF store employed; an RDF reification approach uses an extra triple for annotating a resource, whereas a named graph approach uses quadruples to model time and group together re- sources with the same time or version information. The benchmark datasets and the workloads generated must be easily extensible to accommodate both approaches, or extended to other temporal model alternatives imposed by the application domain. 3.3 Evolution-aware workload generation For the aforementioned reasons, the workload of the benchmark must be gener- ated adaptively with respect to the required parameters and the generated data. Many traditional benchmarking techniques, with data generation functionality, usually rely on standardized or otherwise fixed query workloads operating on top of the fixed-schema generated data. For instance, the LUBM benchmark that provides the foundations to EvoGen, offers a set of 14 predefined queries 6 that try to address a variety of interesting query patterns with varying complex- ities. We argue that in evolving and versioning contexts, the fixed queries can only represent static contexts, and it is thus crucial to be able to extend the workload and provide adaptive workloads that reflect the generation process, which is in turn tailored to the user’s custom needs. For example, the number of versions and the variations, as well as the complexity of changes between the versions, leads to significantly different outcomes that can impact the same set of benchmarking tests in varying and possibly unpredictable ways. 4 EvoGen Characteristics 4.1 Synthetic data description EvoGen is based on the prototype implementation presented in [14], which served as a first attempt for a synthetic RDF data generator over evolving contexts. It is based on the widely used LUBM generator, which uses an ontology of concepts drawn from the world of academia. Specifically, LUBM creates a con- figurable number of university entities, which are split in departments. Fur- thermore, LUBM generates entities that describe university staff and students, research groups, and publications. Most of these classes are provided in different types of specializations, as defined in the LUBM schema ontology. For exam- ple, the generator creates varying numbers of lecturers, full professors, associate professors and assistant professors, as well as undergraduate and postgraduate students. The created entities are interrelated via direct (e.g., a professor can be an advisor of a student) or indirect properties (e.g., professors and students can be co-authors in publications), and their cardinalities adhere to relative ranges that are hard-coded in the generator. LUBM heavily relies on randomization over these types of associations, however, it is guaranteed that the schema will be populated relatively evenly across different runs. While in [14] we do not extend the original schema, in this work we provide 10 new classes and 19 new properties. The new classes are both specializations (subclasses) of existing ones (e.g. visiting professor, conference publication), and novel concepts in the ontology (e.g., research project, scientific event). Through this extension we are able to implement schema evolution which was not sup- ported in the original version, and at the same time keep the original LUBM schema intact to allow backwards compatibility with existing approaches. Finally, the current version of EvoGen follows the DIACHRON model[16], a named graph approach, for annotating datasets with temporal information. According to this, time is represented at the granularity of the dataset; all dataset resources refer to the same time point and separate named graphs are used for grouping resources in dataset versions. EvoGen, however, can be easily modified to accommodate other temporal modellings for data generation. Also, time is represented in a ordinal manner in which the time validity of the dataset versions is denoted by natural numbers. Again, other temporal representation such as absolute time or time intervals can be used. 7 ex : change1 rdf : type co : Add_Type_Class ; co : atc_p1 lubm : VisitingProfessor . ex : change2 rdf : type co : Add_Super_Class ; co : asc_p1 lubm : VisitingProfessor ; co : asc_p2 lubm : Professor . ex : change3 rdf : type co : Ad d_Pr opert y_In stanc e ; co : api_p1 AssociateProfessor13 ; co : api_p2 lubm : doctoralDegreeFrom ; co : api_p3 University609 . Listing 1: Example RDF in the change log 4.2 Change Generation We design and implement a component for semantic change generation, which relies on the change representation scheme presented in [22]. Changes are rep- resented as entities of the Change Ontology, which is able to capture both high level changes, such as adding a superclass, and low level changes, such as triple insertions and deletions. The Change Ontology has been adopted by the commu- nity and used in change detection and change representation [22] and in [23] for designing and representing multi-level changes. Also, it is tightly integrated with the temporal query language DIACHRON QL [17]. EvoGen optionally creates a change set between two consecutive versions, that includes all changes between the versions, both on the instance and on the schema level. 4.3 EvoGen Parameters We follow the approach established in our previous work [14], where we drive the generation process through a set of abstract parameters that reflect the user’s needs with respect to the type and amount of changes. Specifically, we reuse the notions of shift, monotonicity and strictness as high level characteristics of the generation process, and we define an extra parameter for class-centric schema evolution. In what follows, we describe these notions. Parameters regarding instance evolution. Following the definitions pro- vided [17,15], we treat evolution on the dataset level by default. In this context, a dataset D is diachronic, when it provides a time-agnostic representation of its content. The instantiation of a diachronic dataset at a given time point ti denotes the annotation of the dataset’s contents with temporal information regarding this time point. Given this, let D be a diachronic dataset, and Di . . . Di+n a set of dataset instantiations at time points ti . . . ti+n . Then, the shift of dataset D t between ti and ti+n , denoted as h(D)|ti+n i , is defined as the ratio of change in the size of the instantiations Di . . . Di+n of D. t |Di+n | − |Di | h(D)|ti+n = (1) i |Di | 8 The shift parameter shows how a dataset evolves with respect to its size, i.e., the number of resources contained in each version. Its directionality is captured by signed values, i.e., a positive shift points to the generation of versions with increasing size, whereas a negative shift points to versions with decreasing size. It essentially captures the relative difference of additions and deletions between two fixed time points, and as a parameter it allows for generating increasingly larger or decreasingly smaller versions through the generation process. In this version t of EvoGen, given an input shift h(D)|ti+n i , the changes are evenly distributed between all versions Di . . . Di+n . This is a limitation of the current version of EvoGen, but will be extended in the future. The monotonicity of a dataset D determines whether a positive or negative shift changes D monotonically in a given time period [ti , tj ]. A monotonic shift denotes that additions and deletions do not coexist within the same time period. Note that monotonicity is not necessarily an aspect of evolving datasets. How- ever, it can be invoked by the user in order to simulate datasets that are strictly increasing or decreasing in size, such as sensory data and historical data. Therefore, the set of triples that occur in a series of consecutive versions of D between ti and tj will be strictly increasing for a monotonic positive shift, and strictly decreasing in a monotonic negative shift. In order to make the ra- tio of low-level increasing (i.e., triple insertions) to decreasing (i.e., triple dele- tions) changes quantifiable, we use the notion of monotonicity rate, denoted as t m(D)|tii+n , as a parameter between 0 and 1: i+n t |ta |i m(D)|ti+n i = i+n i+n (2) |ta |i + |td |i l l where |ta |k and |td |k the number of added and deleted triples between time points tk and tl . Formally, we define a dataset D to be monotonically increasing when: h(D)|ttlk >0 and m(D)|ttlk = 1 , or more intuitively, when the shift is positive and there are no triple deletions between tk and tl . In a similar way, we define a dataset to be monotonically decreasing when h(D)|ttlk <0 and m(D)|ttlk = 0 , or more intuitively, when the shift is negative and there are no triple additions between tk and tl . Parameters regarding schema evolution. The ontology evolution parameter of a dataset represents the change on the ontology (i.e., schema) level, based on the change in the number of total classes in the schema. It can be used in conjunction with the schema variation parameter that will be defined in what 9 Fig. 1: Two resources of type Person with different characteristic sets. The char- acteristic sets are shown at the bottom. follows. The ontology evolution parameter, denoted as e(D)|ttlk , is the ratio of new classes to the total number of classes in tl : t |ci+n | − |ci | e(D)|ti+n = (3) i |ci | where |ci | is the total number of ontology classes at time ti . Next, we define the schema variation parameter, based on our former notion of strictness presented in [14]. Schema variation property, denoted as v(D)|ttlk , captures the different schema variations that a dataset D exhibits through time. Because of the schema looseness typically associated with RDF, we recall the notion of Characteristic Sets [19] as the basis for v(D). A characteristic set of a subject node s is essentially the collection of properties p that appear in triples with s as subject. Given an RDF dataset D, and a subject s, the Characteristic Set Sc (s) of s is: Sc (s) = {p | ∃o : (s, p, o) ∈ D} and the set of all Sc for a dataset D at time ti is: Sc (D) = {Sc (s) | ∃p, o : (s, p, o) ∈ D} The total number of combinations of the properties associated with a given class gives a maximum number of 2n − 1 characteristic sets associated with that class. This is shown in the example of Figure 1, where two instances of the profes- sor class correspond to two different characteristic sets; the first instance has the name, origin, worksAt and type properties, and the second instance does not have a worksAt property, but has a studiesAt property. Given this, we con- sider v(D)|ttlk to be a constant parameter between 0 and 1 that quantifies the percentage of different characteristic sets, with respect to the total number of possible characteristic sets, that the generator will generate in the evolving pro- cess. Therefore, for all classes |c| of a dataset D, the percentage of characteristic 10 sets for a given time period is given by the following: |c| X E(D)|ttlk = v(D)|ttlk × 2i − 1 (4) i=1 We call E the schema evolution parameter. In essence, (4) quantifies the number and quality of schema changes in the dataset as time passes. Parameters regarding query workload generation. EvoGen generates a query workload that is based on six query types associated with evolving data defined in [17]. We briefly provide an overview of the query types and the gen- erated workload in the following: 1. Retrieval of a diachronic dataset. This type of query is used to retrieve all information associated with a particular diachronic dataset, for all of its instantiations. It is a workload-heavy CONSTRUCT query that either re- trieves already fully materialized versions, or has to reconstruct past versions based on the associated changes. 2. Retrieval of a specific version. This is a specialization of the previous type, focusing on a specific (past) version of a dataset. The generator has to be aware of the context of the process, and create a query that refers to an existing past version. 3. Snapshot queries on the data. For this type of query, we use the original 14 LUBM queries and wrap them with a named graph associated with a generated version. 4. Longitudinal (temporal) queries. These queries retrieve the timeline of par- ticular subgraphs, through a subset of past versions. For this reason, we use the 14 LUBM queries and wrap them with variables that take values from particular version ranges, and we order by ascending version in order to provide a valid timeline. 5. Queries on changes. This type of querying is associated with the high level changes that are logged in the change set between two successive versions. We provide a set of simple change queries that provide the ability to bench- mark implementations that extract and store changes between RDF dataset versions, represented in the change ontology model. 6. Mixed queries. These queries use sub-queries from a mixture of the rest of the query types, and provide a way to test implementations that store changes alongside with the data and its past instantiations. Parameters regarding the type of the archiving strategy. Finally, we provide some degree of configurability with respect to EvoGen’s serialized out- put. More specifically, we allow for the user to request fully materialized versions, or the full materialization of the first version followed by a series of deltas. This allows using the generated data in scenarios where the archiving process uses dif- ferent archiving strategies, such as full materialization of datasets, delta-based storage, and hybrid storage, which is a combination of the two. 11 5 Implementation We build on the original EvoGen implementation8 , a prototype generator for evolving RDF data with configurable parameters. Specifically, we extended the configurability of EvoGen by implementing change logging, schema evolution, and query workload generation offered in different types of archiving policies, as discussed in section 4. The system extends the Lehigh University Benchmark (LUBM) generator, a Java based synthetic data generator. In this version, LUBM’s schema is extended to include 10 new classes and 19 new properties, which served as a basis for implementing schema evolution functionality. Specifically, we have implemented schema evolution on top of the original ontology, without affecting the original ontology’s structure for backwards compatibility. The high-level architecture of EvoGen can be seen in Figure 2. The imple- mented functionality includes instance-level monotonic shifts, as well as schema- level evolution by class-centric generation of characteristic sets, as defined in section 4. The parameters that can be provided as input by the user in EvoGen are as follows: 1. number of versions: integer denoting the total number of consecutive ver- sions. The number of versions needs to be larger than 1 for evolving data generation, else the original LUBM generator is triggered. t 2. shift: the value of shift as defined in equation (1) of section 4, i.e., h(D)|tji , for a time range [ti , tj ], represents the percentage of change in size (measured as triples) between versions Di and Dj . Currently, EvoGen generates mono- tonically incremental and decremental shifts between consecutive versions, and distributes the changes between all pairs of consecutive versions. 3. monotonicity: A boolean denoting the existence of monotonicity in the shift, or lack thereof. 4. ontology evolution: this parameter denotes the change in ontology classes with respect to the original ontology of LUBM, as given by equation (3). 5. schema variation: this parameter is used to quantify the total number of per- muted characteristic sets that will be created for each new class introduced in the schema, as defined by equation (4) in section 4. The Version Management component and the Change Creation component are the main components that deal with translating the input parameters to actual instance/schema cardinalities and weights. They compute how many new instances have to be created or how many existing instances have to be deleted for each class of the LUBM ontology, without affecting the structure of the data and the distribution of instances per class. The functionality is exposed through a Java API that can be invoked by importing EvoGen’s libraries into third party projects. The actual distribution of triple insertions and deletions is performed dynam- ically in a process that takes into account session information on the evolution 8 Source code is available at: https://github.com/mmeimaris/EvoGen 12 Fig. 2: High-level architecture of EvoGen. context of the generation. The process also involves several degrees of randomiza- tion with respect to URI and literal values, cardinalities of inter-class properties, selection of characteristic set permutations and so on. This component is respon- sible for all interactions with the Extended LUBM Generator component, which performs the actual serialization of dataset versions in the file system. In order to distribute the computed changes, we perform weighting to each class and de- rive concrete numbers for the instance cardinalities. This weighting is done in the Weight Assignment Module, which uses normalized weights in the range of 0..1 for each class, based on studying LUBM’s original data structure and total instances per class for various input dataset sizes. By multiplying these weights t with the desired shift value h(D)|tji , we end up with an approximation for the total number of instances per class. The Change Materialization module is responsible for creating the change log file. It interacts with the Change Creation module sequentially, and creates an instance of the Change Ontology for each insertion and deletion of class instances. The Version Management component keeps session information on each ver- sion during runtime, the schema of the dataset, the newly introduced classes and characteristic sets per version, the mapping of dataset versions to their re- spective files and folders in the file system and so on. Also, it is responsible for generating different types of archives, based on the user input; it can gen- erate successive full materialized datasets without any change set produced, or change-based archives that includes an initial dataset with all successive deltas, or finally combinations of these approaches (hybrid storage). 13 6 Conclusions and Future Work In this paper, we describe the latest version of EvoGen, a system for synthetic and evolving data generation, with instance and schema level capabilities. Fur- thermore, EvoGen provides custom workload generation, that creates queries based on the user’s choice of query types and the context of the generated data. As existing RDF benchmarks do not address dynamic data, i.e., data that change over time, we aim to bridge this gap with EvoGen, by providing a means to generate synthetic data with evolving entities, and evolving structure. For these reasons, we have defined and discussed several requirements and charac- teristics that concern the data generation, and we have implemented several of these characteristics within EvoGen. As future work, we intend to extend the requirements presented herein, and address issues of scalability and efficiency, as well as provide thorough experimen- tal evaluation of the system by using it to benchmark existing RDF versioning solutions. Acknowledgements. This work is supported by the EU-funded ICT project SlideWiki (agreement no 688095). References 1. S. Auer, T. Dalamagas, H. Parkinson, F. Bancilhon, G. Flouris, D. Sacharidis, P. Buneman, D. Kotzinos, Y. Stavrakas, V. Christophides, et al. Diachronic linked data: towards long-term preservation of structured interrelated information. In Proceedings of the First International Workshop on Open Data, pages 31–39. ACM, 2012. 2. A. Bernstein, M. Stocker, and C. Kiefer. Sparql query optimization using selec- tivity estimation. In Poster Proceedings of the 6th International Semantic Web Conference (ISWC), 2007. 3. C. Bizer and A. Schultz. The berlin sparql benchmark, 2009. 4. M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, and B. Bhattacharjee. Building an efficient rdf store over a relational database. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 121–132. ACM, 2013. 5. S. Duan, A. Kementsietsidis, K. Srinivas, and O. Udrea. Apples and oranges: a comparison of rdf benchmarks and real rdf datasets. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 145–156. ACM, 2011. 6. J. D. Fernández, A. Polleres, and J. Umbrich. Towards efficient archiving of dy- namic linked open data. In Proceedings of the 1st DIACHRON workshop, 2015. 7. I. Foundoulaki and A. Kementsietsidis. Assessing the performance of rdf engines: Discussing rdf benchmarks. http://www.ics.forth.gr/isl/ RDF-Benchmarks-Tutorial/index.html. Accessed: 2016-04-22. 8. J. Gray. Benchmark Handbook: For Database and Transaction Processing Systems. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992. 9. Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark for owl knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web, 3(2):158– 182, 2005. 14 10. M. F. Husain, L. Khan, M. Kantarcioglu, and B. Thuraisingham. Data intensive query processing for large rdf graphs using cloud computing tools. In Cloud Com- puting (CLOUD), 2010 IEEE 3rd International Conference on, pages 1–10. IEEE, 2010. 11. E. G. Kalayci, T. E. Kalayci, and D. Birant. An ant colony optimisation approach for optimising sparql queries by reordering triple patterns. Information Systems, 50:51–68, 2015. 12. Z. Kaoudi, K. Kyzirakos, and M. Koubarakis. Sparql query optimization on top of dhts. In The Semantic Web–ISWC 2010, pages 418–435. Springer, 2010. 13. A. Letelier, J. Pérez, R. Pichler, and S. Skritek. Static analysis and optimization of semantic web queries. ACM Transactions on Database Systems (TODS), 38(4):25, 2013. 14. M. Meimaris. Evogen: a generator for synthetic versioned rdf. In T. Palpanas, E. Pitoura, W. Martens, S. Maabout, and K. Stefanidis, editors, Proceedings of the Workshops of the EDBT/ICDT 2016 Joint Conference (EDBT/ICDT 2016) (EDBT/ICDT), number 1558 in CEUR Workshop Proceedings, Aachen, 2016. 15. M. Meimaris, G. Papastefanatos, and C. Pateritsas. An archiving system for man- aging evolution in the data web. In Proceedings of the 1st DIACHRON workshop, 2015. 16. M. Meimaris, G. Papastefanatos, C. Pateritsas, T. Galani, and Y. Stavrakas. To- wards a framework for managing evolving information resources on the data web. In PROFILES@ ESWC, 2014. 17. M. Meimaris, G. Papastefanatos, S. Viglas, Y. Stavrakas, and C. Paterit- sas. A query language for multi-version data web archives. arXiv preprint arXiv:1504.01891, 2015. 18. M. Morsey, J. Lehmann, S. Auer, and A. N. Ngomo. Dbpedia SPARQL benchmark - performance assessment with real queries on real data. In 10th International Semantic Web Conference - ISWC 2011, Bonn, Germany, October 23-27, 2011. 19. T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estima- tion for rdf queries with multiple joins. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 984–994. IEEE, 2011. 20. N. Papailiou, I. Konstantinou, D. Tsoumakos, and N. Koziris. H2rdf: adaptive query processing on rdf data in the cloud. In Proceedings of the 21st international conference companion on World Wide Web, pages 397–400. ACM, 2012. 21. G. Papastefanatos. Challenges and opportunities in the evolving data web. In Proceedings of the 1st International Workshop on Modeling and Management of Big Data (with ER 2013), pages 23–28, 2013. 22. V. Papavasileiou, G. Flouris, I. Fundulaki, D. Kotzinos, and V. Christophides. High-level change detection in rdf (s) kbs. ACM Transactions on Database Systems (TODS), 38(1):1, 2013. 23. Y. Roussakis, I. Chrysakis, K. Stefanidis, G. Flouris, and Y. Stavrakas. A flex- ible framework for understanding the dynamics of evolving rdf datasets. In The Semantic Web-ISWC 2015, pages 495–512. Springer, 2015. 24. M. Schmidt, O. Görlitz, P. Haase, G. Ladwig, A. Schwarte, and T. Tran. Fedbench: A benchmark suite for federated semantic data query processing. In The Semantic Web–ISWC 2011, pages 585–600. Springer, 2011. 25. M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel. Sp 2 bench: A sparql perfor- mance benchmark, icde. Shanghai, China, 2009. 26. M. Schmidt, M. Meier, and G. Lausen. Foundations of sparql query optimization. In Proceedings of the 13th International Conference on Database Theory, pages 4–33. ACM, 2010. 15 27. K. Stefanidis, I. Chrysakis, and G. Flouris. On designing archiving policies for evolving rdf datasets on the web. In Conceptual Modeling, pages 43–56. Springer, 2014. 28. M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. Sparql basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th international conference on World Wide Web, pages 595–604. ACM, 2008. 29. M. Voigt, A. Mitschick, and J. Schulz. Yet another triple store benchmark? prac- tical experiences with real-world data. In SDA, pages 85–94. Citeseer, 2012.