Reflections on: Triple Storage for Random-Access Versioned Querying of RDF Archives Short version of article in Journal of Web Semantics Ruben Taelman, Miel Vander Sande, Joachim Van Herwegen, Erik Mannens, Ruben Verborgh IDLab, Department of Electronics and Information Systems, Ghent University – imec Identifier: https://rdfostrich.github.io/article-iswc2019-journal-ostrich/ Abstract. In addition to their latest version, Linked Open Datasets on the Web can also contain useful information in or between previous versions. In order to exploit this information, we can maintain history in RDF archives. Existing ap- proaches either require much storage space, or they do not meet sufficiently ex- pressive querying demands. In this extended abstract, we discuss an RDF archive indexing technique that has a low storage overhead, and adds metadata for reduc- ing lookup times. We introduce algorithms based on this technique for efficiently evaluating versioned queries. Using the BEAR RDF archiving benchmark, we evaluate our implementation, called OSTRICH. Results show that OSTRICH in- troduces a new trade-off regarding storage space, ingestion time, and querying efficiency. By processing and storing more metadata during ingestion time, it sig- nificantly lowers the average lookup time for versioning queries. Our storage technique reduces query evaluation time through a preprocessing step during in- gestion, which only in some cases increases storage space when compared to oth- er approaches. This allows data owners to store and query multiple versions of their dataset efficiently, lowering the barrier to historical dataset publication and analysis. 1. Introduction In the area of data analysis, there is an ongoing need for maintaining the history of datasets. Such archives can be used for looking up data at certain points in time, for requesting evolving changes, or for checking the temporal validity of these data [1]. While the RDF data model itself is atemporal, Linked Datasets typically change over time [2] on dataset, schema, and/or instance level [3]. Such changes can include addi- tions, modifications, or deletions of complete datasets, ontologies, and separate facts. While some evolving datasets, such as DBpedia [4], are published as separate dumps per version, more direct and efficient access to prior versions is desired. In 2015, a survey on archiving Linked Open Data [1] illustrated the need for im- proved versioning capabilities, as current approaches have scalability issues at Web- scale. They either perform well for versioned query evaluation—at the cost of large storage space requirements—or require less storage space—at the cost of slower querying. Furthermore, no existing solution performs well for all existing versioned query types. An RDF archive solution should have a scalable storage model, efficient compression, and indexing methods that enable expressive versioned querying [1]. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In this article, we argue that supporting both RDF archiving and SPARQL at once is difficult to scale due to their combined complexity. Instead, we propose an elemen- tary but efficient versioned triple pattern index that can be used as a basis for query engines. We focus on the performance-critical features of stream-based results, query result offsets, and cardinality estimation, which allow more memory-efficient process- ing and more efficient query planning [5, 6]. We offer an open-source implementation of this approach, called OSTRICH, which we evaluate using an existing RDF ar- chiving benchmark. This article is structured as follows. In the following section, we start by introduc- ing the related work in Section 2. Next, we introduce our storage approach in Section 3, and our querying algorithms in Section 4. After that, we present and discuss the evaluation of our implementation in Section 5. Finally, we present our conclusions in Section 6. 2. Related Work In this section, we discuss existing solutions and techniques for indexing and com- pression in RDF storage, without archiving support. Then, we compare different RDF archiving solutions. Finally, we discuss suitable benchmarks and different query types for RDF archives. This section does not contain an exhaustive list of all relevant solu- tions and techniques, instead, only those that are most relevant to this work are mentioned. 2.1. General RDF Indexing and Compression RDF storage systems typically use indexing and compression techniques for reduc- ing query times and storage space. RDF-3X [6] is an RDF storage technique that is based on a clustered B+Tree with 18 indexes in which triples are sorted lexicographically. These indexes correspond to different triple component orders. A dictionary is used to compress common triple components. When evaluating SPARQL queries, optimal indexes can be selected based on the query’s triple patterns. In our storage approach, we will reuse the con- cept of multiple indexes and encoding triple components in a dictionary. HDT [7] is a binary RDF representation that is highly compressed and provides in- dexing structures that enable efficient querying. Furthermore, it also uses a dictionary to reduce storage requirements. HDT archives are read-only, which leads to high effi- ciency and compressibility, but makes them unsuitable for cases where datasets change frequently. Because of these reasons, we will reuse HDT snapshots as part of our storage solution. 2.2. RDF Archiving Fernández et al. formally define an RDF archive [8] as follows: An RDF archive graph A is a set of version-annotated triples. Where a version-annotated triple is de- fined as an RDF triple with a label representing the version in which this triple holds. Furthermore, an RDF version of an RDF archive is the set of triples that exist at a given version in the archive. Systems for archiving Linked Open Data are categorized into three storage strate- gies [1]: The Independent Copies (IC) approach creates separate instantiations of datasets for each change or set of changes. Example: SemVersion [9] The Change-Based (CB) approach instead only stores change sets between ver- sions. Example: R&WBase [10] The Timestamp-Based (TB) approach stores the temporal validity of facts. Ex- ample: X-RDF-3X [11] These storage strategies can also be combined into hybrid approaches, such as TailR [12] that combines the CB and IC approaches. 2.3. RDF Archiving Benchmarks BEAR [8] is a benchmark for RDF archive systems. It offers three different cate- gories of datasets with varying dataset sizes and temporal granularity. Next to that, triple pattern queries for different versioned query types are provided. BEAR provides baseline RDF archive implementations based on HDT [7] and Jena’s [13] TDB store for the IC, CB, and TB approaches, but also hybrid IC/CB and TB/CB approaches. We use this benchmark for evaluation our approach. 2.4. Query atoms To cover the retrieval demands in RDF archiving, three foundational query types were introduced [8], which are referred to as query atoms: 1. Version materialization (VM) retrieves data using a query targeted at a single version. Example: Which books were present in the library yesterday? 2. Delta materialization (DM) retrieves a query’s result change sets between two versions. Example: Which books were returned or taken from the library be- tween yesterday and now? 3. Version query (VQ) annotates a query’s results with the versions (of RDF ar- chive A) in which they are valid. Example: At what times was book X present in the library? Typically, VM queries are efficient in storage solutions based on IC. DM queries are efficient in CB solutions, and VQ queries perform well in TB solutions. However, these query types typically perform sub-optimally in the other approaches. With our solution, we aim to make all query types sufficiently efficient. 3. Hybrid Multiversion Storage Approach In order to efficiently evaluate all query types, we introduce a hybrid IC/CB/TB storage solution. Our approach consists of an initial dataset snapshot—stored in HDT [7]—followed by a delta chain. The delta chain uses multiple compressed B+Trees for a TB-storage strategy (similar to X-RDF-3X [11]), applies dictionary- encoding to triples, and stores additional metadata to improve lookup times. Our storage technique is partially based on a hybrid IC/CB approach similar to the approach followed by TailR [12]. This approach starts by a snapshot, and stores changes as deltas that are related to each other as can be seen in Fig. 1. In order to avoid increasing reconstruction times, we construct the delta chain in an aggregated deltas [14] fashion: each delta is independent of a preceding delta and relative to the closest preceding snapshot in the chain, as shown in Fig. 2. Hence, for any version, reconstruction in our approach only requires at most one delta and one snapshot, as opposed to the whole delta chain as needed for approaches like TailR. Fig. 1: Delta chain in which deltas are relative to the previous delta or snapshot, as done by TailR [12]. Fig. 2: Delta chain in which deltas are relative to the snapshot at the start of the chain, as part of our approach. In order to cope with the newly introduced redundancies in our delta chain struc- ture, we introduce a delta storage method similar to the TB storage strategy, which is able to compress redundancies within consecutive deltas. In contrast to a regular TB approach, which stores plain timestamped triples, we store timestamped triples in a separate store for additions and deletions. Each addition and deletion store uses three B+tree indexes with a different triple component order (SPO, POS and OSP) to im- prove lookup efficiency when querying. For deletions specifically, we also store the relative position of each triple inside the delta to the deletion stores. When querying, this speeds up the process of patching a snapshot’s triple pattern subset for any given offset. This position information serves two purposes: 1) it allows the querying algorithm to exploit offset capabilities of the snapshot store to resolve offsets for any triple pattern against any version; and 2) it allows deletion counts for any triple pattern and version to be determined efficiently. The use of the relative position during querying will be further explained in Section 4. 4. Versioned Query Algorithms In this section, we introduce algorithms for performing VM, DM and VQ triple pat- tern queries based on the storage structure introduced in Section 3. Each of these querying algorithms are based on result streams, enabling efficient offsets and limits, by exploiting the index structure from Section 3. 4.1. Version Materialization Version Materialization (VM) is the most straightforward versioned query type, it allows you to query against a certain dataset version. Our algorithm takes a triple pat- tern, a version, and a numerical offset as input, and returns a triple stream as output. Starting from the initial snapshot, it does a sort-merge join of the deletions stream for the given version. At the end of the stream, all additions for the given version are appended. In order to apply the proper offset to the stream, we iteratively jump to the correct position in the snapshot and deletions streams based on the relative position that was stored as metadata in each triple. 4.2. Delta Materialization The goal of delta materialization (DM) queries is to query the triple differences be- tween two versions. Furthermore, each triple in the result stream is annotated with ei- ther being an addition or deletion between the given version range. Within the scope of this work, we limit ourselves to delta materialization within a single snapshot and delta chain. Because of this, we distinguish between two different cases for our DM algorithm in which we can query triple patterns between a start and end version, the start version of the query can either correspond to the snapshot version or it can come after that. For the first query case, where the start version corresponds to the snapshot version, the algorithm is straightforward. Since we always store our deltas relative to the snap- shot, filtering the delta of the given end version based on the given triple pattern di- rectly corresponds to the desired result stream. Furthermore, we filter out local changes, as we are only interested in actual change with respect to the snapshot. For the second case, the start version does not correspond to the snapshot version. The algorithm iterates over the triple pattern iteration scope of the addition and dele- tion trees in a sort-merge join-like operation, and only emits the triples that have a dif- ferent addition/deletion flag for the two versions. 4.3. Version Query For version querying (VQ), the final query atom, we have to retrieve all triples across all versions, annotated with the versions in which they exist. In this work, we again focus on version queries for a single snapshot and delta chain. For multiple snapshots and delta chains, the following algorithms can simply be applied once for each snapshot and delta chain. Our version querying algorithm is again based on a sort-merge join-like operation. We start by iterating over the snapshot for the given triple pattern. Each snapshot triple is queried within the deletion tree. If such a deletion value can be found, the versions annotation contains all versions except for the versions for which the given triple was deleted with respect to the given snapshot. If no such deletion value was found, the triple was never deleted, so the versions annotation simply contains all ver- sions of the store. Result stream offsetting can happen efficiently as long as the snap- shot allows efficient offsets. When the snapshot iterator is finished, we iterate over the addition tree in a similar way. Each addition triple is again queried within the dele- tions tree and the versions annotation can equivalently be derived. 5. Evaluation In this section, we evaluate our proposed storage technique and querying algo- rithms. We start by introducing OSTRICH, an implementation of our proposed solu- tion. After that, we describe the setup of our experiments, followed by presenting our results. Finally, we discuss these results. 5.1. Implementation OSTRICH stands for Offset-enabled STore for TRIple CHangesets, and it is a soft- ware implementation of the storage and querying techniques described in this article It is implemented in C/C++ and available on GitHub (https://zenodo.org/record/ 883008) under an open license. In the scope of this work, OSTRICH currently sup- ports a single snapshot and delta chain. OSTRICH uses HDT [7] as snapshot technol- ogy. Furthermore, for our indexes we use Kyoto Cabinet (http://fallabs.com/kyotocab- inet/), which provides a highly efficient memory-mapped B+Tree implementation with compression support. For our dictionary, we use and extend HDT’s dictionary implementation. We compress this delta dictionary with gzip, which requires decom- pression during querying and ingestion. We provide a developer-friendly C/C++ API for ingesting and querying data based on an OSTRICH store. Additionally, we provide command-line tools for ingesting data into an OSTRICH store, or evaluating VM, DM or VQ triple pattern queries for any given limit and offset against a store. Furthermore, we implemented Node Java- Script bindings (https://zenodo.org/record/883010) that expose the OSTRICH API for ingesting and querying to JavaScript applications such as Comunica [15]. We used these bindings to expose an OSTRICH store (http://versioned.linkeddatafrag- ments.org/bear) containing a dataset with 30M triples in 10 versions using TPF [5], with the VTPF feature [16]. 5.2. Experimental Setup As mentioned before in Section 2, we evaluate our approach using the BEAR benchmark. For a complete comparison with other approaches, we re-evaluated BEAR’s Jena and HDT-based RDF archive implementations. After that, we evaluated OSTRICH for the same queries and datasets. We were not able to extend this bench- mark with other similar systems such as X-RDF-3X, RDF-TX and Dydra, because the source code of systems was either not publicly available, or the system would require additional implementation work to support the required query interfaces. Our experi- ments were executed on a 64-bit Ubuntu 14.04 machine with 128 GB of memory and a 24-core 2.40 GHz CPU. 5.3. Results In this section, we present the results of our evaluation. We report the ingestion re- sults, compressibility, query evaluation times for all cases and offset result. All raw results and the scripts that were used to process them are available on GitHub (https:// github.com/rdfostrich/ostrich-bear-results/). Figures 3, 4 and 5 show the query duration results for the BEAR-B queries on the complete BEAR-B-hourly dataset for all approaches. OSTRICH again outperforms Jena-based approaches in all cases. HDT-IC is faster for VM queries than OSTRICH, but HDT-CB is significantly slower, except for the first 100 versions. For DM queries, OSTRICH is comparable to HDT-IC, and faster than HDT-CB, except for the first 100 versions. Finally, OSTRICH outperforms all HDT-based approaches for VQ queries by almost an order of magnitude. Fig. 3: Median BEAR-B-hourly VM query results for all triple patterns for all versions. Fig. 4: Median BEAR-B-hourly DM query results for all triple patterns from version 0 to all other versions. Fig. 5: Median BEAR-B-hourly VQ query results for all triple patterns. 5.4. Discussion The results from previous section show that the OSTRICH query evaluation effi- ciency is faster than all Jena-based approaches, mostly faster than HDT-CB, and mostly slower than HDT-IC. VM queries in OSTRICH are always slower than HDT- IC, because HDT can very efficiently query a single materialized snapshot in this case, while OSTRICH requires more operations for materializing. VM queries in OS- TRICH are however always faster than HDT-CB, because the latter has to reconstruct complete delta chains, while OSTRICH only has to reconstruct a single delta relative to the snapshot. For DM queries, OSTRICH is slower or comparable to HDT-IC, slower than HDT-CB for early versions, but faster for later versions. This slowing down of HDT-CB for DM queries is again caused by reconstruction of delta chains. For VQ queries, OSTRICH outperforms all other approaches for datasets with larger amounts of versions. For BEAR-A, which contains only 10 versions in our case, the HDT-based approaches are slightly faster because only a small amount of versions need to be iterated. 6. Conclusions In this article, we introduced an RDF archive storage method with accompanied al- gorithms for evaluating versioned queries, with efficient result offsets. By storing ad- ditional data during ingestion, we achieve a significant query efficiency improvement. With OSTRICH, we provide a technique for publishing and querying RDF archives at Web-scale. And with lookup times of 1ms or less in most cases, OSTRICH is an ideal candidate for Web querying, as the network latency will typically be higher than that. At the cost of increased ingestion times, lookups are fast. Several opportunities exist for advancing this technique in future work, such as improving the ingestion ef- ficiency, increasing the DM offset efficiency, and supporting dynamic snapshot cre- ation. Furthermore, branching and merging of different version can be investigated. Our approach succeeds in reducing the cost of publishing RDF archives on the Web. It lowers the barrier towards intelligent clients that require evolving data, with the goal of time-sensitive querying over the ever-evolving Web of data. References 1. Fernández, J.D., Polleres, A., Umbrich, J.: Towards efficient archiving of Dynam- ic Linked Open Data. In: Proceedings of the First DIACHRON Workshop on Managing the Evolution and Preservation of the Data Web 2. Umbrich, J., Decker, S., Hausenblas, M., Polleres, A., Hogan, A.: Towards dataset dynamics: Change frequency of Linked Open Data sources. 3rd International Workshop on Linked Data on the Web (LDOW). (2010). 3. Meimaris, M., Papastefanatos, G., Viglas, S., Stavrakas, Y., Pateritsas, C., Anag- nostopoulos, I.: A Query Language for Multi-version Data Web Archives. Expert Systems. 33, 383–404 (2016). 4. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: A nucleus for a Web of open data. In: The semantic web 5. Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L., De Meester, B., Haesendonck, G., Colpaert, P.: Triple Pattern Fragments: a Low-cost Knowledge Graph Interface for the Web. Journal of Web Semantics. (2016). 6. Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for RDF. Proceedings of the VLDB Endowment. 1, 647–659 (2008). 7. Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Bi- nary RDF Representation for Publication and Exchange (HDT). Web Semantics: Science, Services and Agents on the World Wide Web. 19, 22–41 (2013). 8. Fernández, J.D., Umbrich, J., Polleres, A., Knuth, M.: Evaluating Query and Stor- age Strategies for RDF Archives. Semantic Web Journal. (2018). 9. Volkel, M., Winkler, W., Sure, Y., Kruk, S.R., Synak, M.: Semversion: A version- ing system for RDF and ontologies. In: Second European Semantic Web Confer- ence May 29–June 1, 2005 (2005). 10. Vander Sande, M., Colpaert, P., Verborgh, R., Coppens, S., Mannens, E., Van de Walle, R.: R&Wbase: git for triples. In: Proceedings of the 6th Workshop on Linked Data on the Web (2013). 11. Neumann, T., Weikum, G.: x-RDF-3X: fast querying, high update rates, and con- sistency for RDF databases. Proceedings of the VLDB Endowment. 12. Meinhardt, P., Knuth, M., Sack, H.: TailR: a platform for preserving history on the web of data. In: Proceedings of the 11th International Conference on Semantic Systems. pp. 57–64. ACM (2015). 13. McBride, B.: Jena: A semantic web toolkit. IEEE Internet computing. 6, (2002). 14. Im, D.-H., Lee, S.-W., Kim, H.-J.: A version management framework for RDF triple stores. International Journal of Software Engineering and Knowledge Engi- neering. 22, 85–106 (2012). 15. Taelman, R., Van Herwegen, J., Vander Sande, M., Verborgh, R.: Comunica: a Modular SPARQL Query Engine for the Web. In: Proceedings of the 17th In- ternational Semantic Web Conference (2018). 16. Taelman, R., Vander Sande, M., Verborgh, R., Mannens, E.: Versioned Triple Pat- tern Fragments: A Low-cost Linked Data Interface Feature for Web Archives. In: Proceedings of the 3rd Workshop on Managing the Evolution and Preservation of the Data Web (2017).