=Paper= {{Paper |id=Vol-1558/paper9 |storemode=property |title=EvoGen: a Generator for Synthetic Versioned RDF |pdfUrl=https://ceur-ws.org/Vol-1558/paper9.pdf |volume=Vol-1558 |authors=Marios Meimaris |dblpUrl=https://dblp.org/rec/conf/edbt/Meimaris16 }} ==EvoGen: a Generator for Synthetic Versioned RDF== https://ceur-ws.org/Vol-1558/paper9.pdf

EvoGen: a Generator for Synthetic Versioned RDF

Marios Meimaris
1
University of Thessaly,
2
ATHENA Research Center,
Greece
m.meimaris@imis.athena-innovation.gr

ABSTRACT For instance, versioning can take place on either the resource
Synthetic data are widely used for evaluation, testing, and level, or the dataset level, and at the same time, changes can
experimentation. However, there is a lack of systems, tools be detected and represented as simple (e.g. triple addition-
and datasets that can be used for benchmarking in the con- s/deletions), or more complex (e.g. schema changes, group-
text of evolution. In the case of RDF, generation of synthetic ings of triple additions that result in higher level changes).
data that change through time must take into account evolv- Moreover, other requirements that are often present in evolv-
ing paradigms and characteristics that make sense, rather ing RDF have to do with metadata, such as provenance and
than arbitrary insertions and deletions of triples. In this temporal annotations. There exist several tools for storage
paper, we discuss requirements for generation of synthetic and archiving that come with varying functionalities, how-
evolving datasets by abstracting several characteristics of ever, there is a lack of support when it comes to benchmark-
the process, and present EvoGen, a tool for evolving dataset ing them. Two main aspects must be considered towards
generation that is based on the widely used Lehigh Univer- this goal. First, the existence of real and synthetic datasets
sity Benchmark (LUBM) generator. of varying size and complexity that represent several cases
of evolution is an important element of such a benchmark.
Then, the performance evaluation of these tools require the
Categories and Subject Descriptors existence of appropriate query workloads and representative
H.2.8 [Information Systems Applications]: Database evolving operations. In this paper, we intend to address
Management—Database Applications the former, i.e. generation of synthetic datasets in evolving
contexts.
Keywords Contributions. The contributions of this paper can be
summarized as follows:
RDF, Data Management, Benchmarks, Synthetic Data
• we discuss the requirements and characteristics of the
1. INTRODUCTION process of creating synthetic versioned RDF data,
The Resource Description Framework1 (RDF) is a W3C • we describe EvoGen, a prototype implementation for
recommendation for modelling and publishing data in Linked configurable synthetic dataset generation in evolving
Data and Semantic Web contexts. Due to the wide adop- contexts that extends the Lehigh University Bench-
tion of RDF, as well as the dynamicity of data published mark (LUBM) generator with support for the defined
in the Data Web, the need for handling evolution in such characteristics.
datasets becomes increasingly relevant. The importance of
evolution management in the context of the Data Web has This paper is outlined as follows. Section 1 introduces
been stressed out in the literature as a means of addressing the subject. Section 2 provides an overview of related work.
issues such as provenance tracking, longitudinal querying, Section 3 discusses conceptual aspects of different evolution
semantic change detection and representation, dealing with paradigms in RDF. Section 4 describes the implemented sys-
broken URIs and so on [6, 1]. Evolution in RDF data, in tem, and Section 5 concludes the paper.
its simplest form, is the act of inserting and deleting triples
over time. However, in real settings, evolution takes place
in different schemes and settings, depending on the context.
2. RELATED WORK
The Lehigh University Benchmark (LUBM) [7] includes
1
http://www.w3.org/RDF/ an implementation for RDF synthetic data generation. LUBM
was originally aimed at providing datasets for benchmarking
reasoners and systems with reasoning/inferencing capabili-
ties for OWL and DAML ontologies. In fact, the generator
creates both explicit and implicit relationships between the
data. Nevertheless, it has been used extensively in the evalu-
c 2016, Copyright is with the authors. Published in the Workshop Pro- ation of SPARQL engines and RDF stores in general as well
ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor- [19, 9, 10, 2, 4, 8, 15]. LUBM provides a set of 14 bench-
deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this
paper is permitted under the terms of the Creative Commons license CC- mark queries consisting of 1-6 conjunctive triple patterns,
by-nc-nd 4.0 however, these have been appended to include queries with
more complicated patterns in several other works, especially with respect to ti :
when it comes to evaluating SPARQL engines that need
more complicated queries (e.g. in [9]). SP2 Bench [17] is a t |Di+n | − |Di |
h(D)|tii+n = (1)
benchmark RDF dataset generator for evaluating SPARQL |Di |
engines. It is targeted at query efficiency rather than rea-
When defining h in a given time period, the changes
soning and has been widely used in the literature [18, 8,
are distributed across all versions that exist within that
11]. There exist several other works in benchmarking RDF
period.
systems and tools, such as FedBench [16] and the Berlin
SPARQL Benchmark [3], however not all of them provide
synthetic data generation capabilities, even less so in the 2. Monotonicity: the monotonicity of a dataset captures
case of evolving data. An extensive comparison of RDF whether or not a dataset with an incremental or decre-
benchmarks is done in [5]. Fernandez et al. [6] provide met- mental shift changes monotonically in a given time pe-
rics for benchmarking archiving systems in RDF and Linked riod [ti , tj ]. A shift is monotonic when only additions
Data settings. or deletions of triples occur in all consecutive versions
Our approach aims at filling the gap of synthetic data of a dataset between ti and tj . For example, the evo-
generation in evolving contexts. For this purpose, we have lution of a dataset between ti and tj with incremental
chosen to extend LUBM in order to provide capabilities of (decremental) shift is monotonic iff between ti and tj
versioned data, because it is a widely adopted and estab- only triple additions (deletions) take place. More for-
lished benchmark, and can be used in storage, querying, as mally, an RDF dataset D is monotonically incremental
well as reasoning scenarios. if it has an incremental shift, and the following holds
for any arbitrary time points tk and tl :

@tk , tl : |Dl | > |Dk | , ti ≤ tk < tl ≤ tj (2)
3. CHARACTERISTICS OF EVOLUTION
and decremental when
Generation processes for evolving datasets have to meet
several functional, as well as non-functional requirements @tk , tl : |Dl | < |Dk | , ti ≤ tk < tl ≤ tj (3)
regarding the data creation. These include parameterization
and configurability, as well as scalability and efficiency. The Monotonicity is an important characteristic that can
generation of evolving datasets must be abstracted from the be used for supporting use cases where datasets are
generation of synthetic data in general, in order to identify, only changing in one direction. In real-world scenarios,
and consequently quantify, the inherent characteristics that this is especially useful for creating datasets that are
are specific to evolving contexts. only increasing (e.g. log files, publication databases
In order to define and quantify input parameters, we go etc.).
on to outline high-level, dataset-independent characteristics
that will enable parameterization in the generation process. 3. Strictness: strictness is a boolean property that a dataset
We consider D as a diachronic RDF dataset as defined in exhibits when it follows predictable schema patterns
[13, 12], and a set of n + 1 distinct temporal instantiations through time. Because of the schema looseness typi-
of D at time points ti . . . ti+n . A diachronic dataset D is cally associated with RDF, an abstraction of schema
a time-agnostic representation that is used to refer to D in RDF datasets is needed in order to define strictness.
statically through time, without referencing its particular We recall the notion of Characteristic Sets [14] as the
versions. In the scope of this work, we regard versions to be needed abstraction. A characteristic set of a subject
discrete snapshots of a dataset through time. Given these, node s is essentially the collection of all distinct prop-
we introduce the notions of shift, monotonicity and strict- erties p that appear in triples with s as subject. Given
ness to capture the type, volume, and structural aspects of an RDF dataset D, and a subject s, the Characteristic
changes that D undergoes between ti and ti+n . These are Set Sc (s) of s is:
defined as follows.
Sc (s) = {p | ∃o : (s, p, o) ∈ D}
1. Shift: the shift of an evolving dataset captures the and the set of all Sc for a dataset D at time ti is:
modification of its size through different versions. In
essence, it represents the direction towards which its Sc (D) = {Sc (s) | ∃p, o : (s, p, o) ∈ D}
size leans through the passing of time. The shift of a
dataset is given with respect to a time period [ti , ti+n ],
A dataset D is strict in a given time period [ti , tj ] if the
and depending on its value, it can lead to incremental,
set of all Characteristic Sets Sc (D) remains the same within
decremental or unchanged data. For example, when
this time period:
comparing two versions of an RDF dataset D, at times
ti and tj , the shift of D will be incremental if the @tk , tl : Sc (Dtk ) 6= Sc (Dtl ), ti ≤ tk < tl ≤ tj (4)
dataset size increased (i.e. more insertions than dele-
tions), decremental if it decreased (i.e. more deletions Whether the actual subject and object nodes change is not
than insertions) and unchanged if it remained the same relevant unless it causes a change in SC (D), as we are essen-
(i.e. equal number of insertions and deletions). In or- tially interested in structural changes, rather than instance-
der to quantify the shift h of D between ti and ti+n , we level changes. In essence, (4) holds when there are no inher-
t
regard it as a function h(D)|tii+n of low-level changes ent structural changes between versions of D within a given
(triple insertions/deletions) in D between ti and ti+n time period.
4. IMPLEMENTATION
For our purposes, we implemented EvoGen 2 , a generator
of RDF datasets with configurable parameters according to
the characteristics discussed in Section 3. The system is
an extension of the existing Lehigh University Benchmark
(LUBM) generator, which is written in Java and is a pure
write-only solution that does not contain any abstractions
and models for RDF datasets. While the original LUBM
provides support for generation of OWL as well as DAML
files, our implementation is only aimed at RDF/XML files,
mainly because, unlike the original LUBM generator, Evo-
Gen is not meant to be a benchmark for reasoning systems.
The high-level architecture of EvoGen can be seen in Figure
1.
In this first version of EvoGen, the requirement for con-
figurable strictness has not been implemented and is left as Figure 1: High-level architecture of EvoGen.
future work. Thus, the implemented functionality only con-
cerns changes on the instance level, with a strict structure. to be deleted, calculating the actual number of insertion-
In fact, the strictness of the structure is inherited from the s/deletions for each class, based on dynamic weighting of
schema used by LUBM in its original implementation. the instances of each class with respect to the total dataset
Along with LUBM’s original system, which requires pa- size, randomizing parts of the process with respect to ac-
rameters concerning the size of the dataset (in number of tual created/removed instances and so on. This component
universities), EvoGen’s generation process requires the fol- is responsible for communicating with the Extended LUBM
lowing parameters: Generator component, which performs the actual serializa-
tion of dataset versions in the file system. The core LUBM
1. number of versions: an integer denoting the total num-
implementation generates data based on a fixed schema, by
ber of distinct versions (dataset materializations at dif-
iterating through each class and creating elements in each
ferent time points). The number of versions needs
respective schema-imposed sub-structure. It performs some
to be larger than 1 in order for EvoGen to generate
degree of randomization on URIs, mappings of elements to
datasets, else the original LUBM generator is invoked.
other elements (e.g. Professor32 teaches at University21 ).
t
2. shift: this is the value h(D)|tji defined in equation (1) In order to distribute the required changes so that the gen-
for a time period [ti , tj ], i.e. a percentage of changes eration process outputs versions with the desired shift, we
between versions Di and Dj with respect to the size assign weights to each class-based sub-structure and dictate
of Di . In this version of EvoGen, only strictly in- the cardinality of the instances of each class to the gen-
cremental and decremental shifts between consecutive erator. This weighting is done in the Weight Assignment
versions have been implemented. This means that ev- Module, which uses hard coded, normalized weights in the
ery version will be shifted with respect to its previ- range of 0 . . . 1 for each class, based on ranges acquired by
ous version, instead of generating an aggregated shift observing the output of the original, non-versioned LUBM
between the first and the last version, which would datasets for varying sizes. By multiplying these weights with
t
require distributing the required changes along a num- the desired shift value h(D)|tji , we can get an approximate
ber of versions and maintaining the h(D) value only number of instances that need to be created for each class.
between the first and the last version in the given pe- The Version Management component holds and updates
riod. Instead, this is left as future work. session information on each version during runtime, the schema
3. monotonicity: A boolean denoting the existence of of the dataset, the mapping of versions to descriptive meta-
monotonicity in the shift, or lack thereof. data and files in the file system and so on.

The above parameters instantiate the generation process
with the use of two basic components, namely the Version 5. PRELIMINARY EVALUATION
Management component and the Change Creation compo- In order to preliminarily evaluate and validate the output
nent. These are responsible for translating input parameters of EvoGen, we perform a series of generation tasks for dif-
dynamically and mapping them to appropriate structures, ferent combinations of numbers of universities and changes,
holding information on the types of entities to create, the and a fixed number of 10 versions, and we measure the
size and quantity of the created entities, as well as the deci- achieved shift with respect to the required one. Specifically,
sions on which existing entities to evolve. These sit on top of we perform 10 runs of generations for three different val-
an extended LUBM generator, which in essence is the core ues of h, namely h(D) = 0.2, h(D) = 0.4, and h(D) = 0.6
LUBM generator modified to accommodate serializations of and we report the percentage difference between the mean
different versions in the file system. The functionality is ex- of the achieved h and the required one. The results can
posed through a Java API that can be invoked by importing be seen in Figure 2. With a small number of universities,
the system’s jar file and accessing its methods directly. the achieved shift differs significantly with respect to the
The Change Creation component is responsible for creat- required one, but as the number of universities, i.e. the
ing the elements to be added, or determining the elements dataset size, increases, the error decreases. Therefore, for
2 a reasonably large number of dataset size, (e.g. > 5 uni-
Source code is available at:
https://github.com/mmeimaris/EvoGen versities), EvoGen performs as expected. Note, however,
the 2011 ACM SIGMOD International Conference on
Management of data, pages 145–156. ACM, 2011.
[6] J. D. Fernández, A. Polleres, and J. Umbrich. Towards
efficient archiving of dynamic linked open data. In
Proceedings of the 1st DIACHRON workshop, 2015.
[7] Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark
for owl knowledge base systems. Web Semantics:
Science, Services and Agents on the World Wide Web,
3(2):158–182, 2005.
[8] M. F. Husain, L. Khan, M. Kantarcioglu, and
B. Thuraisingham. Data intensive query processing for
large rdf graphs using cloud computing tools. In Cloud
Computing (CLOUD), 2010 IEEE 3rd International
Conference on, pages 1–10. IEEE, 2010.
[9] E. G. Kalayci, T. E. Kalayci, and D. Birant. An ant
Figure 2: Average of achieved shift over 10 runs for 10 ver- colony optimisation approach for optimising sparql
sions, for increasing number of universities. queries by reordering triple patterns. Information
Systems, 50:51–68, 2015.
that this preliminary evaluation does not take into account [10] Z. Kaoudi, K. Kyzirakos, and M. Koubarakis. Sparql
scalability and efficiency issues, which is left as future work. query optimization on top of dhts. In The Semantic
Web–ISWC 2010, pages 418–435. Springer, 2010.
[11] A. Letelier, J. Pérez, R. Pichler, and S. Skritek. Static
6. CONCLUSIONS AND FUTURE WORK analysis and optimization of semantic web queries.
In this paper, we discuss several characteristics of gen- ACM Transactions on Database Systems (TODS),
erating synthetic versioned RDF datasets, and we describe 38(4):25, 2013.
EvoGen, an implementation that addresses a subset of these [12] M. Meimaris, G. Papastefanatos, and C. Pateritsas.
characteristics. Furthermore, we perform a preliminary eval- An archiving system for managing evolution in the
uation of the system in order to measure how well the desired data web. In Proceedings of the 1st DIACHRON
shift is achieved. workshop, 2015.
As future work, we intend to fully address the discussed [13] M. Meimaris, G. Papastefanatos, S. Viglas,
characteristics, enable the insertion and deletion of schema Y. Stavrakas, and C. Pateritsas. A query language for
elements as well, and address issues of scalability and effi- multi-version data web archives. arXiv preprint
ciency when creating very large datasets. Finally, we intend arXiv:1504.01891, 2015.
to design query workloads based on the created data, in [14] T. Neumann and G. Moerkotte. Characteristic sets:
order to further support benchmarking versioning and evo- Accurate cardinality estimation for rdf queries with
lution management systems for RDF datasets. multiple joins. In Data Engineering (ICDE), 2011
Acknowledgements. This work is supported by the EU- IEEE 27th International Conference on, pages
funded ICT project ”DIACHRON” (agreement no 601043). 984–994. IEEE, 2011.
[15] N. Papailiou, I. Konstantinou, D. Tsoumakos, and
7. REFERENCES N. Koziris. H2rdf: adaptive query processing on rdf
[1] S. Auer, T. Dalamagas, H. Parkinson, F. Bancilhon, data in the cloud. In Proceedings of the 21st
G. Flouris, D. Sacharidis, P. Buneman, D. Kotzinos, international conference companion on World Wide
Y. Stavrakas, V. Christophides, et al. Diachronic Web, pages 397–400. ACM, 2012.
linked data: towards long-term preservation of [16] M. Schmidt, O. Görlitz, P. Haase, G. Ladwig,
structured interrelated information. In Proceedings of A. Schwarte, and T. Tran. Fedbench: A benchmark
the First International Workshop on Open Data, pages suite for federated semantic data query processing. In
31–39. ACM, 2012. The Semantic Web–ISWC 2011, pages 585–600.
[2] A. Bernstein, M. Stocker, and C. Kiefer. Sparql query Springer, 2011.
optimization using selectivity estimation. In Poster [17] M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel.
Proceedings of the 6th International Semantic Web Sp 2 bench: A sparql performance benchmark, icde.
Conference (ISWC), 2007. Shanghai, China, 2009.
[3] C. Bizer and A. Schultz. The berlin sparql benchmark, [18] M. Schmidt, M. Meier, and G. Lausen. Foundations of
2009. sparql query optimization. In Proceedings of the 13th
[4] M. A. Bornea, J. Dolby, A. Kementsietsidis, International Conference on Database Theory, pages
K. Srinivas, P. Dantressangle, O. Udrea, and 4–33. ACM, 2010.
B. Bhattacharjee. Building an efficient rdf store over a [19] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and
relational database. In Proceedings of the 2013 ACM D. Reynolds. Sparql basic graph pattern optimization
SIGMOD International Conference on Management of using selectivity estimation. In Proceedings of the 17th
Data, pages 121–132. ACM, 2013. international conference on World Wide Web, pages
[5] S. Duan, A. Kementsietsidis, K. Srinivas, and 595–604. ACM, 2008.
O. Udrea. Apples and oranges: a comparison of rdf
benchmarks and real rdf datasets. In Proceedings of