<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TrieDF: Eficient In-memory Indexing for Metadata-augmented RDF</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olivier Pelgrin</string-name>
          <email>olivier@cs.aau.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Gala´rraga</string-name>
          <email>luis.galarraga@inria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katja Hose</string-name>
          <email>khose@cs.aau.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalborg University</institution>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inria</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Jena rdflib TrieDF</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Metadata, such as provenance, versioning, temporal annotations, etc., is vital for the maintenance of RDF data. Despite its importance in the RDF ecosystem, support for metadata-augmented RDF remains limited. Some solutions focus on particular annotation types but no approach so far implements arbitrary levels of metadata in an application-agnostic way. We take a step to tackle this limitation and propose an in-memory tuple store architecture that can handle RDF data augmented with any type of metadata. Our approach, called TrieDF, builds upon the notion of tries to store the indexes and the dictionary of a metadataaugmented RDF dataset. Our experimental evaluation on three use cases shows that TrieDF outperforms state-of-the-art in-memory solutions for RDF in terms of main memory usage and retrieval time, while remaining application-agnostic.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>During the last 20 years, the Web has seen a proliferation of large collections of RDF
data, i.e., triples ⟨ subject, predicate, object ⟩ describing real-world concepts
ranging from common-sense to specialized domains. The triples are structured in what
we call an RDF graph or knowledge graph (KG). KGs find applications in multiple
AI-related tasks, such as question answering, information retrieval, smart assistants, etc.</p>
      <p>Building and maintaining a large-scale KG is a titanic efort. It does not only require
sophisticated RDF stores and collaborative tools, but also procedures and protocols to
extract, cleanse, and integrate data from potentially heterogeneous sources. This is true
regardless of whether the KG is manually or (semi-)automatically populated. A central
aspect in KG construction is metadata management. RDF metadata includes, but is not
limited to, provenance, validity intervals, spatial annotations, confidence statements,
and versioning. As existing KGs grow and new initiatives come to existence, the need
to manage statements about triples, i.e., RDF tuples, becomes more and more crucial.</p>
      <p>
        There exist multiple solutions to represent metadata about RDF triples. Popular
solutions are named graphs and reification. That said, these approaches are not free of
limitations. A large number of fine-grained RDF graphs can be a challenge for quad
stores [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Reification, on the other hand, quintuples the number of statements in a
dataset; not to mention the fact that it also complexifies the queries. For these reasons,
RDF engines support at most one level of metadata in an out-of-the-box fashion. This
means that current stores can model statements about triples, but not statements about
quads, i.e., 5-tuples. They can neither model a versioned collection of graphs nor RDF
statements originating from diferent sources with multiple validity intervals. Support
for higher levels of metadata – equivalent to n-ary relationships – remains limited to
very specific scenarios such as archiving [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>This work takes a step to tackle the aforementioned limitations and proposes TrieDF,
an in-memory RDF tuple store. TrieDF stores tuples of arbitrary length in a trie, an
inmemory prefix-based tree originally used for compact storage and eficient retrieval
of strings. TrieDF models everything as a trie, namely all indexes and the dictionary.
Our evaluation shows that such an architecture yields a significant speed-up in retrieval
with little penalty in memory consumption w.r.t. existing triple/quad stores. We also
illustrate the utility of TrieDF at handling 5-tuples in the context of provenance and
version management.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>RDF Graphs. An RDF graph  ⊆ (I ∪ B) × I × T is a set of triples ⟨ s, p, o ⟩,
with subject , predicate , and object  that model binary assertions about entities, for
example, ⟨ :Denmark, :locatedIn, :Europe ⟩. The sets I, B, and T are countably
infinite sets of IRIs, blank nodes, and RDF terms respectively, with T = I ∪ B ∪ L
and L the set of literals. IRIs are web-scoped identifiers for entities such as http://
dbpedia.org/resource/Denmark (abbreviated :Denmark for default prefix http://
dbpedia.org/resource/); blank nodes are anonymous file-scoped identifiers; literals
are non-referenceable data such as strings, numbers, and dates.</p>
      <p>Metadata-augmented RDF Graphs. Given an RDF graph , a metadata-augmented
graph Γ :  → T  is an injective function that annotates each triple of the graph with
a -tuple of RDF terms. We can also see Γ as a set of -tuples  =⟨ s, p, o, ... ⟩ with
 =  + 3. Metadata-augmented RDF graphs can model -ary relationships in contrast
to standard RDF graphs that can only model binary relationships.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>The need to store and manage metadata for RDF triples has given rise to a large
literature body that we survey in two stages. First, we survey diferent techniques to
encode metadata-augmented triples and store them in triple stores. In a second stage we
discuss existing solutions to manage additional components in triples.
3.1</p>
      <p>
        Encoding Metadata-augmented Triples
Reification is the process of encoding an n-ary statement through a set of binary
relationships. For instance, consider the versioned triple ⟨ :Aalborg, :cityIn, :Denmark,
3 ⟩ – that states that the triple is present in revision 3 of the graph. Under the standard
reification, the triple is assigned a surrogate IRI (or blank node)  that is linked to all
the components of the quad, resulting in 4 new triples: ⟨ u, :subject, :Aalborg ⟩, ⟨
u, :predicate, :cityIn ⟩, ⟨ u, :object, :Denmark ⟩, and ⟨ u, :version, 3 ⟩. Since
reification incurs a significant overhead both for storage and querying, other approaches
have proposed more compact encoding strategies. The authors of [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] propose singleton
properties as unique keys for statements in a context. In our example, such a context could
be the graph revision where a triple occurs, e.g., ⟨ :Aalborg, cityIn#3, :Denmark ⟩. A
more flexible scheme, called companion properties [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], proposes singleton properties
per subject, for example, ⟨ :Aalborg, cityIn#3.si, :Denmark ⟩, where si is a local
identifier that can be used to model subject-level metadata.
      </p>
      <p>
        Reification and single properties have been used to encode one level of metadata,
e.g., versioning, probabilities, provenance, temporal validity, etc. However, porting these
strategies to scenarios with arbitrary levels of statement-centered metadata, e.g.,
provenance plus versioning, requires a careful application-dependent combination of the
diferent schemes. This need has motivated the development of RDF-star [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a data
model that treats triples as first-class citizens and allows for nested statements such
as ⟨ ⟨ :Aalborg, :cityIn, :Denmark ⟩, :version, 3 ⟩. Support for RDF-star is
gaining traction in current RDF engines. Some commercial solutions such as RDFox,
GraphDB, and Stardog can parse TriG, an RDF-star serialization text format. That said,
none of those solutions can so far handle arbitrary levels of nestedness.
3.2
      </p>
      <p>
        Beyond RDF Triples
RDF Named Graphs. Even though the RDF graph data model can be used to store
metadata for RDF statements “natively”, it was rather conceived as an analogy to
documents and database tables. A named RDF graph is associated to an IRI  , and
stores a presumably large collection of triples within a well-defined context, e.g., a
particular data source. Nevertheless, named graphs have been used to store more
finegrained metadata such as validity intervals, revision numbers, and changesets [
        <xref ref-type="bibr" rid="ref17 ref5">5, 17</xref>
        ].
This can represent a challenge for classical RDF graph stores, such as Jena, Virtuoso,
RDF4J, etc., that are not optimized for a large number of small RDF graphs. Since RDF
named graphs cannot model metadata for quads, a few approaches have proposed highly
specific solutions in the context of RDF/S inference [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Property Graphs. In this data model, both nodes (entities) and edges (relationships) can
be assigned attributes, such as labels, timestamps, probabilities, sources, etc. Despite
this flexibility, property graphs cannot store arbitrary levels of metadata for triples out
of the box. Like named graphs, they rather provide a generic solution to store metadata
about triples. This agnosticism has propelled adaptations of the graph model and existing
engines (e.g., Neo4J, GraphDB) to particular applications, such as version and history
management [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and workflow provenance [
        <xref ref-type="bibr" rid="ref12 ref6 ref8">6, 8, 12</xref>
        ]. As for named graphs, solutions are
application-dependent and not trivially portable to arbitrary settings.
      </p>
      <p>
        Relational Databases. Notable designs to store RDF in tables are the three-column
table and the entity-relationship model (a relation per class, a column per predicate).
Storing metadata about RDF in a relational setting does not require any extension to
the original data model, and standard engines provide eficient support for the most
common metadata types such as temporal metadata [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], version control [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and
validity intervals. That said, the relational model is not optimized for the schema-free
nature of RDF, which in the first place motivated the design of triple stores.
We now elaborate on TrieDF, our in-memory architecture to manage
metadataaugmented RDF. TrieDF stores RDF tuples of arbitrary length in diferent indexes.
The indexes do not store actual RDF terms but rather references to those terms in a
space eficient dictionary (Figure 1b).
      </p>
      <p>TrieDF borrows inspiration from tries – prefix-based trees for string storage. Nodes
in a trie store single characters and each string is associated to a path in the tree. Strings
with the same prefix, e.g., web, weave, weasel, share common nodes. TrieDF treats tuples
as “strings” of items. Those items are either references to RDF terms (for indexes) or IRI
chunks (for the dictionary). We elaborate on these use cases in the following sections.
4.1</p>
      <p>Trie-based Indexes
Consider a version annotated RDF graph with quads ⟨ s, p, o, v ⟩ such that  models
the versions where the triple ⟨ s, p, o ⟩ is present. Figure 1b depicts an SPOV index for
the quads ⟨ 1, 2, 3, 1 ⟩, ⟨ 1, 2, 3, 2 ⟩, and ⟨ 1, 5, 6, 2 ⟩ – the triples are encoded using a
dictionary. We can infer that the triple ⟨ 1, 5, 6 ⟩ was added in the second revision of the
graph. We now describe the implementation of TrieDF and the operations it supports.
Implementation. Logically, each element of a tuple is associated to a node in the tree.
Physically, nodes store only references to their children. Those references are organized
in a red-black tree keyed by the value of the child node. If | | is the number of nodes
in a trie  , an index in TrieDF has a space complexity of  (| |max), where max is the
highest out-degree of a node in  . This happens because red-black trees exhibit  ()
space complexity in the number of keys.</p>
      <p>Tries are optimized for tuple lookup and prefix-based retrieval. These operations, as
well as additions and deletions, are implemented as in standard tries. Therefore, looking
up a tuple  incurs a time complexity of  (||log(max)).</p>
      <p>
        Users of TrieDF can define indexes of arbitrary depth for all permutations of the
elements of a tuple. Moreover, TrieDF can also operate as a standard triple store. In
that case, the system stores triples in indexes SPO, POS, and OSP in line with standard
engines. The nodes in the indexes store integers, precisely the memory addresses of the
RDF terms in the dictionary [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which is also implemented as a trie as explained next.
4.2
      </p>
      <p>Trie-based Dictionary Encoding
Dictionary encoding maps the terms of an RDF dataset to an integer space for the
sake of eficient space consumption and query processing. Dictionary encoding is often
implemented using two hash tables, i.e., one for encoding (string to integer) and one for
decoding (integer to string).</p>
      <p>
        Even though dictionary encoding reduces memory consumption drastically for RDF
engines, it does not tackle the inherent redundancy of RDF terms. For instance, common
prefixes in IRIs are still stored multiple times. Similarly to the work of Bazoobandi et
al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the dictionary for IRIs is stored in an trie. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a node is usually associated
to a single character, although multiple nodes can be compressed (fusioned) into a
single node if they form a single branch subtree. A drawback of such an approach is that
(a) A standard trie
      </p>
      <p>
        (b) Trie-based index and dictionary
updates may cause fusioned nodes to split. In that case the tree must be rearranged, which
increases update time. To make updates simpler – at the expense of some redundancy
– TrieDF compresses IRIs by automatically coalescing all the nodes that lie between
occurrences of the “/” character [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. In other words, each node is logically associated
to a chunk of an IRI as depicted in Figure 1b. If a node marks the end of an IRI, the
memory address of the node serves as the integer identifier used by the tuple indexes
described in the previous section.
      </p>
      <p>
        In practice dictionary tries are bidirectional. That is, nodes store references to their
children and parent. That way, a dictionary can retrieve the identifier associated to an
IRI (lookup) and vice versa (reverse lookup). (Figure 1b). Because the benefit of a
prefix-based representation is significantly less pronounced for literals [
        <xref ref-type="bibr" rid="ref2 ref20">2, 20</xref>
        ], they are
currently stored in one-node tries.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>We evaluate TrieDF along three dimensions: loading time, main-memory consumption,
and retrieval time (i.e., the time to return all the tuples that match a prefix). For this
purpose, we test it in three use cases and compare it to other relevant in-memory
solutions. Our scenarios cover a standard RDF graph, a versioned RDF graph (requiring
quads), and a collection of 5-tuples.</p>
      <p>TrieDF was written in C++. All experiments were run in a server with 256 GB of
RAM, a 16-core CPU (AMD EPYC 7281), and an 8 TB HDD. The source code and
experimental data is available at https://relweb.cs.aau.dk/triedf.
5.1</p>
      <p>
        TrieDF for Triples
We first assess the performance of TrieDF when used as a standard in-memory triple
store on DBpedia 2016-10 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], YAGO 3.1 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], YAGO 4 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and Wikidata [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For
      </p>
      <sec id="sec-4-1">
        <title>DBpedia we use the themes mapping-based objects and mapping-based literals. For</title>
        <p>YAGO 3.1 we consider the knowledge base’s core, namely, the themes facts, meta facts,
literal facts, date facts, and labels. For YAGO 4, we use the facts theme from the
English only distribution. In regards to Wikidata we chose the simple-statements file of
the 2016-08 RDF export3. Details about the dataset sizes are available in the following
table, in which we compare TrieDF with Jena4 and RDFLib5, two fully-fledged
inmemory platforms for RDF/SPARQL management. They are widely used in production
environments and ofer mature and well-tested implementations.</p>
        <sec id="sec-4-1-1">
          <title>Triples Size RDF terms</title>
          <p>Details about the experimental datasets for the triples evaluation.</p>
          <p>Loading time. The following table shows the loading times of Jena, RDFLib, and
TrieDF for the experimental datasets. Jena is consistently faster than the others systems,
which can be explained by (i) a fast dictionary, (ii) a highly optimized RDF parser, and
(iii) batching of insertions. Jena stores the dictionary in classical hash tables, which
allows for constant time lookup and insertion of terms, in contrast to TrieDF that incurs
a logarithmic lookup complexity. This, in contrast, optimizes for compactness (see
paragraph on memory consumption). That said, TrieDF ranks second close to Jena, and
outperforms RDFLib by a large margin.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Jena RDFLib TrieDF</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>DBpedia</title>
        <p>587.14
2816.16
727.20
YAGO 3.1
1281.65
7102.30
2105.37</p>
      </sec>
      <sec id="sec-4-3">
        <title>YAGO 4</title>
        <p>289.42
1626.85
358.20</p>
      </sec>
      <sec id="sec-4-4">
        <title>Wikidata</title>
        <p>1665.48
9587.51
2800.77</p>
        <sec id="sec-4-4-1">
          <title>Loading time of the triples evaluation in seconds.</title>
          <p>
            Retrieval time. We measure the average runtime of the diferent systems on single
triple pattern queries of the following shapes: (i) ⟨ s, p, ?o ⟩, (ii) ⟨ ?s, p, o ⟩, (iii) ⟨ s,
?p, ?o ⟩, (iv) ⟨ ?s, ?p, o ⟩, and (v) ⟨ ?s, p, ?o ⟩ (? denotes a variable). For each query
shape, we generated 200 queries by drawing the bound components randomly from
the datasets, and then adding variables to the unbounded components as in [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. The
queries are implemented as classical retrieval operations, e.g., ⟨ :Denmark, :capital,
?o ⟩ retrieves all tuples prefixed with ⟨ :Denmark, :capital ⟩ in the SPO trie index.
          </p>
          <p>Figure 2 depicts the query runtime of the tested systems on the diferent datasets. The
results show that TrieDF is at least one order of magnitude faster than the competitors
for queries with one variable. When there are two variables, TrieDF still exhibits lower
median runtimes, but its variance is large, specially for YAGO. This can be explained
by the fact that those queries may sometimes have a large number of results. This
phenomenon also afects Jena. While RDFLib is mostly insensitive to large result sets,
it lags behind the other systems in terms of average retrieval time.</p>
          <p>Memory consumption. Figure 3a shows the peek memory consumption of the diferent
systems during loading and query execution. We observe that TrieDF outperforms Jena
3 http://tools.wmflabs.org/wikidata-exports/rdf/index.html
4 https://jena.apache.org/
5 https://rdflib.readthedocs.org</p>
          <p>Jena
rdflib</p>
          <p>TrieDF
100 1 variable
80
)60
B
G
(
y
ro40
m
e
M
20</p>
          <p>Jena
rdflib</p>
          <p>TrieDF
100 1 variable
2 variables
2 variables
100 1 variable
2 variables
2 variables
(a) DBpedia
(b) Wikidata
(c) YAGO 3.1
(d) YAGO 4</p>
          <p>Jena
rdflib</p>
          <p>TrieDF
100 1 variable
0 DBpedia Wikidata Yago 3.1 Yago 4
(a) Triples
in all datasets, and all the competitors in YAGO 3. For the other datasets, our approach
uses at most 13% more memory than RDFLib. This happens for two reasons. First,
RDFLib’s indexes are of fixed depth, which allows for some space savings: leaves do
not need to accommodate for an additional pointer to its potential children. Second,
RDFLib does not implement explicit dictionary encoding but rather relies on Python’s
internal variable handling (Python variables are actually keys pointing to their actual
value in a hash table) to store references to RDF terms in the indexes. Relying on Python
incurs important memory savings for RDFLib, however, the system remains two orders
of magnitude slower than TrieDF at retrieval.
5.2</p>
          <p>
            TrieDF for Quads
In this section we evaluate TrieDF at managing RDF quads ⟨ s, p, o, v ⟩, that represent
triples annotated with a revision number  . Our evaluation is based on the BEAR [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]
benchmark datasets, and an archive consisting of the DBpedia versions from the 3.5
to the 2016-10 release, where one release is equivalent to one revision [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. For each
release, we use the same DBpedia themes as in our experiments with triples. As for
BEAR, we use both the BEAR-B and BEAR-C datasets. We omitted BEAR-A because
our competitors could not parse the input files due to formatting issues.
          </p>
          <p>
            In order to provide versioning capabilities to TrieDF, we store RDF quads in SPOV,
POSV, and OSPV indexes. We also compare TrieDF with the Jena in-memory models
and RDFLib. For both competitors, we use named graphs to store each revision. This
storage strategy, called independent copies, optimizes for data retrieval [
            <xref ref-type="bibr" rid="ref17 ref4">4, 17</xref>
            ] at the
expense of high memory consumption.
          </p>
          <p>Loading time. The table below shows the loading times of the evaluated systems.
No results are provided for Jena in DBpedia, since the system runs out of memory.
RDFLib lags behind Jena and TrieDF, with TrieDF being the fastest at loading the bulky
revisions of BEAR-C, and Jena being the fastest at loading the more granular revisions
of BEAR-B.</p>
          <p>
            Retrieval time. We measure the average runtime of the diferent systems on 100 single
triple pattern queries of diferent types on versioned RDF graphs, namely version
materialization (VM), delta materialization (DM), and version queries (VQ). The queries
were randomly generated according to the experimental setup in [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ].
Figure 4 shows the runtime of the evaluated systems for each type of query. We observe
that TrieDF outperforms all the competitors by far, although the performance gap can
vary drastically. In particular, TrieDF’s indexes are optimized for VQ queries, for which
they are 3 orders of magnitude faster than the competitors. Even though the independent
copies approach used in Jena and RDFLib is optimal for VM queries (and to a lesser
extent for DM queries), TrieDF is still one order of magnitude faster than the competitors.
Memory consumption. Figure 3b shows the peek memory consumption of the diferent
systems during loading and query execution. We first highlight that Jena uses much
more memory than the other systems. The reason is that Jena implements a classical
independent copies approach, where each revision is entirety stored in a diferent graph.
This leads to a lot of duplicated data. In contrast, RDFLib stores graphs (called contexts)
in separate hash tables that map triples to graphs and graphs to triples. This mitigates
redundancy. In the same vibe, TrieDF stores version identifiers in the last component of
each index, which reduces redundancy. While RDFLib showcases the lowest memory
consumption, TrieDF strikes the best trade-of with at most 28% more peek memory
usage than RDFLib in exchange for a speed-up of 3 orders of magnitude for retrieval.
In this section we evaluate TrieDF on a 5-tuples setup where triples are annotated with
provenance and version identifiers ,  , i.e., we store tuples ⟨ s, p, o, q, v ⟩. We use a
dump6 of 27M tuples of the NELL [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] dataset. The NELL extractors collect
metadataaugmented knowledge iteratively from the Web. This metadata includes, among other
ifelds, confidence scores for the extracted triples, extraction sources, extraction methods,
and the iteration of promotion, i.e., the iteration at which a triple is considered true and
“oficially” added to the knowledge base. We use the two latter fields in this evaluation.
Since RDF storage systems do not support 5-tuples, we compare TrieDF against
relational database systems with support for in-memory tables. After a comparison between
SQLite and MariaDB, we chose the former due to its good performance in our setting.
5-tuples in TrieDF are represented via SPOQV, POSQV, and OSPQV indexes. We test
SQLite with and without those indexes.
          </p>
          <p>Loading time. We report loading times of 36.55s, 16.98s, and 47.27s for TrieDF, SQLite,
and SQLite with indexes respectively. We observe that SQLite loads data significantly
faster when no indexes are built, however when indexing is enabled, TrieDF is faster.
Retrieval time. We tested both systems on 100 queries of the same types defined for the
quads evaluation with randomly bounded  and  . As suggested by Figure 4d, TrieDF
achieves similar retrieval performance than an indexed 5-column SQLite in-memory
table. The median runtime of TrieDF is better for VM and VQ queries.
Memory consumption. Figure 3c shows the peak memory usage of both SQLite
and TrieDF when loading and querying the NELL dataset. We observe that indexing
multiplies memory consumption by a factor of 3 in SQLite. TrieDF still uses 26% more
memory than indexed SQLite, however TrieDF cannot leverage its trie-based dictionary
to its full capacity. This happens because NELL does not use prefixed IRIs. Despite this
rather suboptimal setting, TrieDF still shows comparable performance to SQLite.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have presented an in-memory architecture based on tries to index and access
annotated RDF triples eficiently. Our solution provides the user with a flexible architecture
to manage arbitrary RDF metadata in main memory. Our experimental evaluation has
shown that such as an approach strikes an interesting trade-of between retrieval time
and memory footprint: it can yield a speed-up of up to 3 orders of magnitude in retrieval
time in return to little (and sometimes no penalty) in memory usage. We believe that
TrieDF is a first step towards a holistic solution to manage knowledge beyond RDF
triples. As future work we envision to explore diferent strategies to reduce TrieDF’s
memory footprint. We also envision to couple our architecture with suitable in-disk
storage and provide SPARQL query support.
6 http://rtw.ml.cmu.edu/rtw/resources
Acknowledgements. This research was partially funded by the Danish Council for
Independent Research (DFF) under grant agreement no. DFF-8048-00051B and the
Poul Due Jensen Foundation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.G.</given-names>
          </string-name>
          :
          <article-title>DBpedia: A Nucleus for a Web of Open Data</article-title>
          . In: ISWC/ASWC. pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bazoobandi</surname>
            ,
            <given-names>H.R.</given-names>
          </string-name>
          , de Rooij,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Urbani</surname>
          </string-name>
          , J., ten
          <string-name>
            <surname>Teije</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bal</surname>
            ,
            <given-names>H.E.</given-names>
          </string-name>
          :
          <article-title>A Compact In-Memory Dictionary for RDF Data</article-title>
          . In: ESWC (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Erxleben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , Gu¨nther,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , Kro¨tzsch,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mendez</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          , Vrandecˇic´, D.:
          <article-title>Introducing Wikidata to the Linked Data Web</article-title>
          . In: ISWC. pp.
          <fpage>50</fpage>
          -
          <lpage>65</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Ferna´ndez, J.D.,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knuth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Evaluating query and storage strategies for RDF archives</article-title>
          .
          <source>In: SEMANTICS</source>
          . pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Gala´rraga, L.,
          <string-name>
            <surname>Jakobsen</surname>
            ,
            <given-names>K.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
          </string-name>
          , T.B.:
          <article-title>Answering Provenance-Aware Queries on RDF Data Cubes Under Memory Budgets</article-title>
          . In: ISWC (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Gala´rraga, L.,
          <string-name>
            <surname>Mathiassen</surname>
            ,
            <given-names>K.A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>QBOAirbase: The European Air Quality Database as an RDF Cube</article-title>
          .
          <article-title>In: ISWC (Posters, Demos &amp; Industry Tracks)</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <year>1963</year>
          .
          <article-title>CEUR-WS.org (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Haeusler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trojer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kessler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farwick</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nowakowski</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breu</surname>
          </string-name>
          , R.:
          <article-title>ChronoGraph: A Versioned TinkerPop Graph Database</article-title>
          . In: DATA (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>E.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lissandrini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghose</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Løkke</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomsen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Transparent Integration and Sharing of Life Cycle Sustainability Data with Provenance</article-title>
          .
          <source>In: ISWC</source>
          . pp.
          <fpage>378</fpage>
          -
          <lpage>394</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Foundations of RDF★ and SPARQL★: An Alternative Approach to StatementLevel Metadata in RDF</article-title>
          . In: AMW (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Herna´ndez,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Riveros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Rojas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Zerega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.: Querying</given-names>
            <surname>Wikidata: Comparing</surname>
          </string-name>
          <string-name>
            <surname>SPARQL</surname>
          </string-name>
          , Relational and
          <string-name>
            <given-names>Graph</given-names>
            <surname>Databases</surname>
          </string-name>
          . In: ISWC (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elmore</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parameswaran</surname>
            ,
            <given-names>A.G.</given-names>
          </string-name>
          :
          <article-title>OrpheusDB: Bolt-on Versioning for Relational Databases</article-title>
          .
          <source>PVLDB</source>
          <volume>10</volume>
          (
          <issue>10</issue>
          ),
          <fpage>1130</fpage>
          -
          <lpage>1141</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kashliev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Storage and Querying of Large Provenance Graphs Using NoSQL DSE</article-title>
          . In: BigDataSecurity/HPSC/IDS. pp.
          <fpage>260</fpage>
          -
          <lpage>262</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kaufmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>P.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>May</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kossmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Benchmarking Bitemporal Database Systems: Ready for the Future or Stuck in the Past? In: EDBT (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Mitchell,
          <string-name>
            <surname>T.M.</surname>
          </string-name>
          , et al.:
          <article-title>Never-Ending Learning</article-title>
          .
          <source>In: AAAI</source>
          . pp.
          <fpage>2302</fpage>
          -
          <lpage>2310</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bodenreider</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          :
          <article-title>Don't like RDF reification?: making statements about statements using singleton property</article-title>
          .
          <source>In: WWW</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Pediaditis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flouris</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fundulaki</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christophides</surname>
          </string-name>
          , V.:
          <article-title>On Explicit Provenance Management in RDF/S Graphs</article-title>
          . In: TAPP (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Pelgrin</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , Gala´rraga, L.,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Towards Fully-fledged Archiving for RDF Datasets</article-title>
          .
          <source>Semantic Web Journal</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>YAGO: A Large Ontology from Wikipedia and WordNet</article-title>
          .
          <source>J. Web Semant</source>
          .
          <volume>6</volume>
          (
          <issue>3</issue>
          ),
          <fpage>203</fpage>
          -
          <lpage>217</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Tanon</surname>
            ,
            <given-names>T.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.:</given-names>
          </string-name>
          <article-title>YAGO 4: A Reason-able Knowledge Base</article-title>
          . In: ESWC. pp.
          <fpage>583</fpage>
          -
          <lpage>596</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karnstedt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Comparing data summaries for processing live queries over Linked Data</article-title>
          .
          <source>World Wide Web</source>
          <volume>14</volume>
          (
          <issue>5-6</issue>
          ),
          <fpage>495</fpage>
          -
          <lpage>544</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            , Zhang, W.,
            <surname>Liu</surname>
          </string-name>
          , L.:
          <article-title>TripleBit: a Fast and Compact System for Large Scale RDF Data</article-title>
          .
          <source>PVLDB</source>
          <volume>6</volume>
          (
          <issue>7</issue>
          ),
          <fpage>517</fpage>
          -
          <lpage>528</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>