<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Wisent: An In-Memory Serialization Format for Leafy† Trees</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hubert Mohr-Daurat</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Holger Pirk</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Imperial College London</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Eficient data exchange is one of the key ingredients for high-performance, composable data management systems. Eficient data exchange formats exist for “flat” (i.e., relational) data. For nested data, however, users must resort to formats such as XML, JSON or their binary counterparts such as BSON or CBJSON. These formats provide the flexibility to store metadata as well as actual data but lead to costly serialization and deserialization. This makes them unfit to represent flat data, forcing users to combine flat formats (Arrow, Parquet and even CSV) for data with JSON or XML documents for metadata. This practice, known as “Data Packages”, is error-prone, labor-intensive and increases system complexity. To address this problem, we propose Wisent, a new exchange format designed to represent nested data while eficiently encoding the “flat” sections of the tree. We call such trees “leafy”. To this end, Wisent serializes trees bread-first rather than the conventional depth-first serialization. We found that breadth-first serialization enables lazy decoding during tree navigation and in-place/conversion data processing of the flat sections using tight loops. We implemented a Wisent deserializer in C++, which outperforms the state-of-the-art data serialization protocols by several orders of magnitude. Wisent is not just faster to encode and decode but also much simpler to implement: to demonstrate that aspect, we implemented Wisent decoders in Swift and Python and found that it can be implemented in roughly 100 lines of code while achieving performance that is dominated only by the overhead of the respective host language.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>† we use the term “leafy” to say that most nodes
are leafs or nodes that only have a single child
which, itself, is a leaf.</p>
    </sec>
    <sec id="sec-2">
      <title>Processing such trees, however, is expensive as parsers</title>
      <p>must assume that every node has children even if most
of them are merely leafs and, therefore, without children.</p>
      <p>Expensive parsers are acceptable when application
performance is bound by interconnect bandwidth: when
loading data through ethernet or from slow disks, e.g.,
a parser need not process more than hundreds of
megabytes per second. However, with fast interconnects
like Infiniband, NVMe or inter-process/inter-container
communication becoming more commonplace, parsing
is a likely bottleneck for many applications.</p>
      <p>To address this problem, we propose Wisent, a new
exchange format that allows the CPU-eficient encoding
and, in particular, decoding of leafy trees. Wisent is
designed with five objectives:
• it shall support the representation of nested data
• it shall minimize the storage overhead for leafs
• it shall allow eficient serialization, deserialization
and in-place processing
• it shall support lazy/selective access to parts of
the data without requiring the deserialization of
the entire tree
• it shall be simple enough to not require the use of
a parsing library but allow the implementation of
an eficient parser in fewer than 100 lines of code
in a modern programming language like Python
or Swift</p>
    </sec>
    <sec id="sec-3">
      <title>Note that we intentionally do not define space efi</title>
      <p>ciency as an objective. As the use cases for Wisent
are those with high-bandwidth communication channels
(many of which even support random access), space
eficiency is secondary. In addition, as lazy access is one of
Wisent’s objectives, space overhead potentially afects
disk space and memory address space but only marginally
afects physical RAM usage.</p>
      <sec id="sec-3-1">
        <title>2. Data Serialization:</title>
      </sec>
      <sec id="sec-3-2">
        <title>State-of-the-Art</title>
        <sec id="sec-3-2-1">
          <title>Flat Data Serialization</title>
          <p>
            A number of storage formats have been proposed for flat
data serialization. While Apache Arrow [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] only allows
in-process transfer, the project integrates the Feather [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]
serialization columnar format, designed for short-term
inter-process communication (such as network sockets
or shared memory). Apache Parquet [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] and ORC [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] also
use a decomposed storage layout but are designed for
longer-term storage on disk. In contrast, Apache Avro [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]
uses a n-ary storage layout to serialize data to disk.
          </p>
          <p>None of these formats are designed to embed metadata
and other custom nested data. For this use case, they
require to use side-by-side another nested-data-oriented
serialization format. The nested data commonly embeds
iflenames or URLs for the related flat data resources.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>LeanStore [7] uses a fragmented data layout and is im</title>
      <p>plemented with intricate techniques to avoid the cost of
serialization and deserialization from and to disk.
However, the type of data handled by LeanStore’s serialization
is limited to flat data and b-trees.</p>
      <sec id="sec-4-1">
        <title>Nested Data Serialization</title>
        <p>
          JSON [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is the most common text-based format for
nested data serialization. The text representation makes
it straightforward to integrate into any programming
language but comes with a high cost for serialization and
deserialization runtime performance. Binary-based formats
such as BSON [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], Protocol Bufers [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and Thrift [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
are common performance-oriented alternatives. Protocol
Bufers and Thrifts eficiently serialize and deserialize
custom nested data structures but require compiling the
data structures into a target language’s code. All these
serialization formats use a depth-first tree representation,
less adapted to eficient serialization and deserializing
of leafy trees, such as trees embedding flat data. With
Wisent, we propose a bread-first tree representation
eficient for this use case.
        </p>
        <p>
          Substrait [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is a format specification designed to
serialize trees of relational operations, and the reference
serialization protocol is implemented on the top of
Protocol Bufers. Flat data, such as tables, are referenced and
not embedded in the nested data, as Wisent allows.
        </p>
        <sec id="sec-4-1-1">
          <title>3. Design Principles</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Three key design principles distinguish Wisent from</title>
      <p>state-of-the-art formats like JSON, BSON or XML:
Breadth-first flattening (where existing formats perform
depth-first), decomposition of data and structure (where
existing formats mix the two) and a focus on simplicity
of parsing where existing (binary) formats focus on
compactness. Let us discuss those principles in the following.</p>
      <p>Breadth-first Flattening. At the heart of every file
format for hierarchical data (JSON, BSON, XML, ...) lies
a transformation we call “flattening”: the
transformation of a hierarchy of “nodes” into an equivalent
onedimensional array of data items describing both the
content as well as the structure of the tree. To the best of our
knowledge, every serialization format for hierarchical
data prescribes a depth-first flattening of the tree. While
this corresponds to the “natural” way of thinking about
nested data (i.e., as documents), depth-first flattening,
by definition, scatters the children of a node throughout
the (serialized) output bufer. This leads to two related
problems: first, it requires some form of list of
“pointers/indices/ofsets/markers” to represent the children of
a node (in xml, the closing tag plays that role). This leads
to substantial storage overhead for leafy trees. Second,
processing all children of a node requires resolving those
Data
File</p>
      <p>Id
17</p>
      <p>Title
“Population”
Content</p>
      <p>Name
markers. While iterating through the children by looking
for closing tags is particularly egregious, even if the
children are represented as ofsets in an in-memory array,
the random memory accesses required to gather the leaf
values are unnecessarily costly.</p>
      <p>Breadth-first flattening, in contrast, places the children
of a node in a single contiguous memory area. This not
only eliminates the need for per-child pointers or markers
in favor of a pair of pointers (one to the first, one to the
last child). It also ensures that all children can be accessed
with optimal locality.</p>
      <p>Decomposition of Data and Structure.
State-of-theart formats for hierarchical data mix the representation
of the structure of the tree (opening/closing tags, sibling
ofsets, end-markers, etc.) with the data itself (attribute
values, text nodes, labels, etc.). This leads to a format in
which the location of any data item (inner node or leaf)
depends on the structure and data that appear before
it in the output bufer. It is, for all practical purposes,
unpredictable. This unpredictability is what requires a
json/xml parser to parse the entire subtree of a node only
to access its right sibling.</p>
      <p>In contrast, we propose to decompose the structural
elements from the data elements of a tree: data
elements/values are flattened (breadth-first) in a bufer
without any indication of their exact position in the tree.
The parent-child relationships are encoded in a
separate bufer. While such separation leads to worse
locality for positional/navigational access to inner nodes,
it allows access to the children and siblings of a node
without having to parse other parts of the tree. This,
in turn, enables “lazy parsing”, i.e., the navigation to
a specific node without having to parse any nodes
other than those along the path.</p>
      <p>Simplicity over Size. The primary objective of most
binary serialization formats is the minimization of the
size of the result. This focus encourages several
sophisticated techniques, such as using terminators over size
ifelds, the dense packing of values and even data
compression. Implementing this arsenal of techniques makes
parsers complicated and hard to write, which is why most
users use external libraries for parsing. Such external
libraries often constitute optimization boundaries for the
compiler: to manage the complexity of the optimization
process, the compiler optimizes these libraries internally
but not in the context of the code they are used in. This
leads to substantial runtime overhead.</p>
      <p>The design we propose, on the other hand, favors
simplicity over output size. It is simple enough to allow the
implementation of a (non-verifying) parser in fewer than
100 lines of code in a modern (i.e., reasonably succinct)
programming language. This allows users to integrate
the parser with the application code rather than using
an external library.</p>
      <sec id="sec-5-1">
        <title>4. The Wisent Data Format</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>To illustrate the Wisent serialization format, consider the</title>
      <p>serialization of the example tree (Figure 2) in Figure 3.</p>
      <p>The bufer/file is divided into four sections: an
Argument Vector that holds the breadth-first flattened
representation of every node in the input tree, a Type Bytefield
encoding the types of the values in the Argument
Vector, a Structure Vector that reflects the structure of the
tree and a String Bufer that holds the content of every
reference types (strings, node names, etc.). Let us, now,
discuss each of these components in more detail.
inr inr inr inr inr int: str inr inr flt: flt:
17 .17 … .45
name first last</p>
      <p>Document Data Id Title File Population
Argument
Vector</p>
      <p>Type (Tag)
Vector</p>
      <p>Structure
Vector</p>
      <p>String buffer</p>
      <p>In Wisent, data is serialized breadth-first: all nodes of Opportunistically encoding a run of types like this
a given level are stored in a consecutive memory region. does not impact the space eficiency of the format. It does,
The color-coded levels in Figures 2 and 3 illustrate how however, allow the eficient processing of the argument
all nodes in a level are stored in a contiguous memory values without the need to determine the type of every
region. Value nodes (ints, floats) are stored directly in argument individually – a major performance gain.
the Argument Vector, aligned to CPU words (we
discuss sub-word alignment later in this section). Reference
types (currently only strings) are represented by an ofset 5. Wisent Encoding and Decoding
into the String Bufer. Expressions (i.e., inner nodes) are
represented as ofsets in the Expression Vector. Like most data exchange formats, Wisent should be easy</p>
      <p>For illustrative purposes, argument types are repre- and eficient to write and, more importantly, read. We
sented as labels in Figure 3: integers as int, floats as focus on two distinct scenarios for encoding and one for
decoding: encoding fragmented trees (i.e., in-memory
flt and inner nodes as inr. In the implementation,
however, types are represented in a Type Vector that data structures made up from values and pointers),
enis separate from but aligned with the Argument Vec- coding from depth-first trees (i.e., converting existing
tor (i.e., the value at position  in the type array en- JSON documents), and lazily decoding for the purpose of
codes the type of the argument at position  in the Ar- data analytics.
gument Vector). To simplify alignment, we currently
use entire CPU words to encode types but might re- 5.1. Encoding from Fragmented Trees
duce this to 1-byte words in the future.</p>
      <p>The Structure Vector is an array of triples of ofsets:
one pointing into the string bufer indicating the label of
the node, one pointing to the first child in the Argument
Vector and one pointing to the last child. As all children
of a node are stored in a consecutive memory region, no
pointers to individual children are required.</p>
      <p>The String Bufer is a mere sequence of
(nullterminated) strings.</p>
      <p>As in Depth-First Serialization, Breadth-First
Serialization starts from the tree’s root and traverses it towards the
leaves. Rather than descending a single path to each of
the leaves one after the other, however, Breadth-First
Serialization traverses all paths simultaneously. This requires
maintaining a list of references to all nodes at a level while
traversing the tree. The size of this “serialization-frontier”
is bounded by the breadth of the tree, i.e., the maximum
number of nodes at any level (not counting leaves). While
this can be large in theory, the leafy trees we focus on
4.1. Opportunistic Type have many leaves but few inner nodes. This makes the
Run-Length-Encoding size of the serialization frontier a minor concern.</p>
      <p>For the purpose of this paper, we aimed to encode all
While Wisent, as described so far, encodes leafy trees is four fragments into a single bufer. While this is not a
a succinct and eficient representation, it sufers from a strict requirement of the format, it simplifies memory
substantial performance hazard: it represents the type of management in a shared-memory scenario and allows
every argument individually, i.e., without exploiting the serializing into a file if required. To this end, the
Wisentlikelihood of siblings having the same type. The array of encoder makes two passes over the input tree: one to
lfoats in the leaf of Figure 3, e.g., would be accompanied calculate the number of nodes and inner nodes (which
by a Type Vector of equal size. determine the sizes of the Argument, Type and
Expres</p>
      <p>While the waste of space is only a minor concern out- sion Vectors) and one to perform the serialization itself.
weighed by the advantages of the representation (fast As Wisent is optimized for leafy trees, the first traversal
type resolution and good locality of the values), the is an insignificant cost factor.
runtime cost of determining every argument type in- While the number of nodes in the input tree, and
dividually is a major concern. Most importantly, it pre- thereby the size of the first three sections, can be
devents simple, eficient processing of the arguments (e.g., termined eficiently, determining the size of the string
for(int i=0; i&lt;size; i++) sum+=input[i]). bufer is costly: it requires determining the length of
evTo address this problem, we propose to opportunisti- ery string node (i.e., searching for its null-terminator).
cally Run-Length-Encode the types as follows: if a node While finding the null terminator is inevitable when
copyhas five or more (adjacent) children of the same type, ing strings into the bufer, it is expensive and unnecessary
the type tag of the first value has its most significant to calculate the length twice (once for allocation and once
bit set (a bit that is otherwise unused). Further, the when filling the bufer). Instead, we use the Operating
type tags of the following four values are interpreted System’s ability to re-allocate, i.e., increase the size of a
as a single 32-bit integer encoding the length of the previous memory allocation.
run of identically typed arguments.
5.2. Encoding from Depth-First Trees
from the first to the last ofset. If the types are
RunLength-Encoded (see Section 4.1), there is no need to
check children’s types, making the processing as simple
as iterating over a plain C-array in memory.</p>
      <p>Unlike fragmented trees, depth-first flattened trees
provide no direct access to entire tree elements: they are
intended to be parsed by scanning the entire structure
from the root to the leaves. Accessing a node child other
than the first requires scanning the whole structure from 6. Evaluation
the current to the target position. A naïve
implementation for the serialization would require either a full copy 6.1. Experimental Setup
or redundant access to the tree structure. To eficiently
process the tree without this overhead, serialization is To assess the benefits of the Wisent serialization format,
implemented instead with a two-passes approach. we evaluate the performance of various deserializer
im</p>
      <p>The first pass is not only necessary for calculating plementations for Wisent and compare them with JSON
the size to allocate the bufer but also for the starting and BSON deserializers. We also compare a C++ serializer
position of the nodes at each level. During this parsing, implementation with JSON and BSON serializers.
the number of nodes at each level is calculated. Next, the Wisent Deserializers. We implemented three
deseriper-level node count of the arguments is prefix-summed alizer backends; C++, Swift and Python. The C++
deserito deduce the starting position at each level. alizer is naturally implemented with pointers access to</p>
      <p>In the second pass, the arguments are again scanned the bufer and casting to the underlined types. Traversing
in depth-first order and scattered into their appropriate the expression’s arguments is implemented with
stanposition in the Argument Vector. Equally, the expressions dard iterators, which makes them compatible with any
are stored in breadth-first order in the Expression Vector. std functions. The Swift deserializer implementation uses</p>
      <p>Both passes are computationally eficient because UnsafePointer&lt;Int8&gt; and manual pointer arithmetic to
they are sequential scans of the input tree’s memory extract pointers to sections of bufer, swift-native types
bufer. In terms of memory usage, they require a min- and structs to access elements of sections and (lazily
genimal state: the first pass requires constructing a list erated) Sequences to access arguments. The Python
desewhose size is the depth of the tree; The second pass re- rializer implements the argument traversal using the
genquires two more stacks of the same size (one for the erator pattern. Raw Bufer data is extracted to underlined
argument’s next position and one for the expression types with the help of Python modules struct and ctypes.
child’s current position) and two more values (one to Wisent Server. To demonstrate inter-process
commukeep track of the current level and one for the current nication with the Wisent format, a Wisent server
applicaposition in the Expression Vector). tion implements the serialization of JSON files and CSV
ifles into the Wisent format and stores the data in shared
memory. The client benchmark application reads the data
5.3. Lazy Decoding from shared memory and performs the deserialization.</p>
      <p>Baseline Serialization Formats. We compare the
Wisent serialization format with three other methods:
ifrst, reading data directly from raw JSON and CSV files
(i.e., no serialization); second, embedding both metadata
(from JSON) and columnar data (from CSV) into a single
JSON; third, embedding metadata and columnar data into
JSON and then, serializing the JSON object into the BSON
format, i.e., binary format alternative to JSON.</p>
      <p>
        Baseline Deserializers. Besides the serialization
formats, We integrated several baselines for the
implementation of the JSON deserialization: NLohmann JSON
version v3.11.2 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], RapidJSON version v1.1.0 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and
SimdJson version v3.2.0 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In addition, we used RapidCSV
version v8.75 for loading data from CSV files.
      </p>
      <p>Hardware. All experiments are performed on a server
with two Intel Xeon Silver 4114 2.20 GHz CPUs, each
with 10 physical cores, a 14 MB LLC cache and 196 GB of
memory. We use Ubuntu 22.04 with Linux kernel
4.15.0212-generic and compile all code with Clang version 14
using the compiler flags -02.</p>
      <p>Wisent documents are designed to allow lazy decoding,
i.e., minimize the amount of unnecessary data that needs
to be decoded to access a specific node (identified by its
path from the root). Navigating to a specific node starts
at the root, i.e., the first argument in the Argument
Vector (virtually always an inner node). Descending to one
of its children requires dereferencing its value (an ofset
into the Structure Vector) to get its first child. Accessing
a child by its position is a simple arithmetic operation
(i.e., adding the position to the index of the first child).</p>
      <p>Descending one level, thus, requires only two memory
accesses (plus one to verify the node type as “inner” if
required). If the child is specified by name rather than
position, descending one level requires a loop over the
children. However, as both the children ofsets, the
structure triples and the strings designating their names are
located in (respective) contiguous memory regions, the
data accesses within that loop have optimal locality.</p>
      <p>Once a node is found, iterating over its children (e.g., to
operate on its leaves) is as simple as running an iterator
Wisent</p>
      <p>1
RapidJSON</p>
      <p>10
Selectivity (%)
20
e 32 s
m
itun64 ms
R
32 ms
16 ms
)
B
1(M6384
e
ag4096
s
yu1024
ro 256
eMm 64</p>
      <p>Wisent
6.2. Deserialization Performance
To assess the performance of Wisent deserializer
implementation, in this experiment, we vary the dataset size
from scale 1/256 (i.e., 951 KB) to scale 8 (i.e., 2GB) and run
an aggregation on one column of floating point values.</p>
      <p>We measure both the runtime and the memory usage.</p>
      <p>Runtime. The result in Figure 4 shows that the Wisent
deserializer outperforms the baselines with two to four
orders of magnitude; the breadth-first approach in Wisent
allowing contiguous data in memory and RLE
optimization brings significant runtime performance benefits.</p>
      <p>SimdJson and RapidJSON have the next best performance
due to ad-hoc optimizations for traversing JSON data.</p>
      <p>To assess the impact of partial data reading on the
performance of the Wisent deserializer implementation, in
this experiment, the aggregated column is filtered with
a predicate on another column’s data. We vary the
selectivity from 0.4% to 100%. To avoid side efects from
column data fitting the cpu cache, the dataset is fixed to
the largest size at scale 32 (i.e., 8GB). We measure both the
runtime and the memory usage. For this experiment, we
choose as the baseline RapidJSON, which is the fastest
implementation supporting the largest dataset (SimdJSON
does not support such large JSON documents).</p>
      <p>Runtime. The result in Figure 6 shows that Wisent
benefits from columnar data stored contiguously in the
bufer to improve data access with 20% performance
beneift on the lowest selectivity compared with the maximum
selectivity. Due to their diferent memory access pattern,
other implementations do not gain from low selectivity.</p>
    </sec>
    <sec id="sec-7">
      <title>Workload. All the experiments are run using the</title>
      <p>
        OPDS Weather dataset [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], modified to reduce the
number of columns from 256 to 100 (to assess the
scalability with a significant number of rows without being
limited by a large number of columns). This dataset
uses the Data Package format: the main file to load is a
datapackage.json file, which contains metadata and
the path to a CSV file for loading the columnar data. We
also generate a smaller dataset size (by iteratively remov- 6.3. Benefits of Lazy Decoding
ing every other row) and a larger one (by duplicating all
the dataset rows). We, therefore, experimented with a
range of data sizes from 951 KB to 8GB.
      </p>
      <p>Memory Usage. Figure 5 shows that Wisent
deserializer takes four to ten times less memory than the
baselines for the smallest datasets. This gap drastically
increases with larger datasets, up to a thousand times
more compact for Wisent. At first glance, the result
might seem surprising since Wisent is not implemented
for compactness. However, Wisent support for lazy
decoding (and in-place deserialization) explains why the
physical memory usage is negligible even when shared
memory usage is not. In contrast, other
implementations require storing all data in temporary memory for
reading, causing higher memory usage.</p>
      <p>C++</p>
      <p>Swift</p>
      <p>Python</p>
      <p>Swift
93</p>
      <p>Python</p>
      <p>135</p>
    </sec>
    <sec id="sec-8">
      <title>Memory Usage. The result in Figure 7 shows that</title>
      <p>memory usage with Wisent is lower for lower selectivity
due to lazy decoding. RapidJSON requires all data to be
copied in temporary memory; the memory usage is two
orders of magnitude higher regardless of the selectivity.
6.4. Deserialization Implementation with
High-Level Languages</p>
    </sec>
    <sec id="sec-9">
      <title>To assess the ease of implementing a Wisent deserializer</title>
      <p>in various languages and benefit from built-in
abstractions, we implemented a deserializer in Swift and Python CSV Wisent JSON BSON
in addition to the C++ implementation. We also measured
their performance compared to the C++ implementation. Figure 9: Serialization Runtime Performance</p>
      <p>Ease of implementation. As shown in Table 1, only
93 and 135 lines of code are required to implement a
deserializer, respectively in Swift and Python. This shows
no obstacle in implementing a Wisent deserializer in As shown in Figure 9, Wisent is faster to serialize than
high-level languages that are not primarily intended for JSON and BSON. The cost for Wisent serialization is
low-level data manipulation code. primarily due to the CSV parsing. Without including the</p>
      <p>Reusability of built-ins language abstractions. CSV parsing cost, Wisent outperforms JSON and BSON
The required helpers to implement such a deserializer with a factor of 1.4x and 3.7x, respectively. This result
are standard and built-in into the three language im- shows no prohibitive cost to serialize into Wisent format.
plementations, as described in Section 6.1. It shows
that, despite the code simplicity, implementing a Wisent 7. Conclusion and Future Work
deserializer is straightforward to benefit from the
language abstractions, making the implementation com- We introduce a novel approach to the serialization of
posable and replaceable. trees, specifically focusing on the eficient exchange</p>
      <p>Runtime Performance. Figure 8 shows the results of of “leafy” trees through high-throughput channels like
executing the aggregation query on the dataset at scale shared memory, RDMA or non-volatile memory. We
1 (i.e., 250 MB). Swift is five times slower, and Python is propose to serialize such trees breadth-first rather than
1400x slower than the C++ implementation. For Swift, depth-first serialization, which is the de-facto standard.
this is due to miss-opportunities to compile as eficient We demonstrate that, when combined with decomposed
code as C++ (e.g., vectorization) and Swift’s current im- storage of data, structure and variable-sized datatypes,
plementation of LazySequence, adding unnecessary vir- breadth-first serialization enables lazy decoding during
tual function calls. For Python, this is due to the runtime tree navigation and in-place data processing without the
overhead of interpretation and dynamic typing. need for parsing or conversion.</p>
      <p>While we outline the key ideas of Wisent in this paper,
6.5. Serialization Performance we still consider this efort a work in progress. Moving
forward, we plan to address some technical issues with
To assess the performance of Wisent serialization, in Wisent, most importantly by supporting the sub-word
this experiment, we run the serialization of the OPSD alignment of types (which is required to, e.g., eficiently
Weather dataset at scale 8 (i.e., 2GB). We compare Wisent process single-precision floating point values). Further,
with the serialization of CSV data into a JSON file and the we will study Wisent in a wider context, e.g., a distributed
serialization of the JSON file into BSON (which includes processing scenario. Finally, we plan to integrate Wisent
the serialization from CSV to JSON as well). We also into existing systems. As virtually all high-performance
show the overhead of loading the CSV file separately systems store in-memory data in C-arrays, this
integrasince this cost is factored into all the implementations. tion will be possible without conversion overhead.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Frictionless</given-names>
            <surname>Standards</surname>
          </string-name>
          , Data Package,
          <year>2023</year>
          . URL: https://specs.frictionlessdata.io/data-package/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Arrow</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://arrow.apache. org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Arrow</surname>
          </string-name>
          , Feather File Format,
          <year>2023</year>
          . URL: https://arrow.apache.org/docs/python/feather. html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Parquet</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://parquet.apache. org/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Apache</surname>
            <given-names>ORC</given-names>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://orc.apache.org/ specification/ORCv1/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Avro</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://avro.apache.org/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Leis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haubenschild</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kemper</surname>
          </string-name>
          , T. Neumann,
          <article-title>LeanStore: In-Memory Data Management beyond Main Memory</article-title>
          ,
          <source>in: 2018 IEEE 34th International Conference on Data Engineering (ICDE)</source>
          , IEEE, Paris,
          <year>2018</year>
          , pp.
          <fpage>185</fpage>
          -
          <lpage>196</lpage>
          . URL: https://ieeexplore. ieee.org/document/8509247/. doi:
          <volume>10</volume>
          .1109/ICDE.
          <year>2018</year>
          .
          <volume>00026</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>JSON</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://www.json.org/json-en. html.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] MongoDB, BSON Format,
          <year>2023</year>
          . URL: https://www. mongodb.com/basics/bson.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Google</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://protobuf.dev/overview/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Slee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kwiatkowski</surname>
          </string-name>
          , Thrift: Scalable
          <string-name>
            <surname>Cross-Language Services Implementation</surname>
          </string-name>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Substrait</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://substrait.io.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lohmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>JSON</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://json. nlohmann.me.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>RapidJSON</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://rapidjson.org.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Langdale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lemire</surname>
          </string-name>
          , Parsing Gigabytes of JSON per Second,
          <source>The VLDB Journal</source>
          <volume>28</volume>
          (
          <year>2019</year>
          )
          <fpage>941</fpage>
          -
          <lpage>960</lpage>
          . URL: http://arxiv.org/abs/
          <year>1902</year>
          .08318. doi:
          <volume>10</volume>
          .1007/s00778-019-00578-5. arXiv:
          <year>1902</year>
          .08318.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pfenninger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stafell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Weather</given-names>
            <surname>Data</surname>
          </string-name>
          ,
          <year>2019</year>
          . URL: https://data.open
          <article-title>-power-system-data</article-title>
          . org/weather_data/2019-04-
          <fpage>09</fpage>
          . doi:
          <volume>10</volume>
          .25832/ WEATHER\_DATA/2019-04-09.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>