<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deduplication in Multi-Tenant Database Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Technische Universität Dresden</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jannis Kowalick</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Krause</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dirk Habich</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wolfgang Lehner</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Improving the memory density of in-memory columnar databases is a necessary step towards more eficient sharing of hardware in multi-tenant systems. Many general-purpose tools attempt to achieve this by scanning systems for duplicate memory regions, and deduplicating these redundancies. In this paper, we show that existing tools sacrifice eficiency for general-applicability, and outline the possibilities unlocked by tapping into the architectural guarantees, and existing metadata for in-memory databases. Based on our previous work on deduplication, where we used a hardware-accelerated delta mechanism to increase memory savings, we detail a process for semantics-based steering of this mechanism. This significantly reduces the search space for duplicates, by only comparing database columns which exhibit similar features. Furthermore, we gracefully handle cases where steering causes the comparison of mismatched columns. We benchmark our solution against the widely available Linux Kernel Samepage Merging (KSM) tool, evaluating both based on their achieved memory savings and execution speed for deduplicating the Internet Movie Data Base (IMDB) and Star Schema Benchmark (SSB) datasets. Our solution manages to perform an order of magnitude faster than KSM, while detecting close to the same amount of duplicate data.</p>
      </abstract>
      <kwd-group>
        <kwd>multi-tenancy</kwd>
        <kwd>in-memory databases</kwd>
        <kwd>OLAP</kwd>
        <kwd>Data Streaming Accelerator</kwd>
        <kwd>memory-deduplication</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Many years ago, hosting and administrating databases was a necessary requirement for organizations
building their digital infrastructure. Today, this work can be delegated via Database as a Service (DBaaS)
oferings, or indirectly through Software as a Service (SaaS). Backing these systems,
multi-tenant
databases are often used, consolidating many customers (tenants) into the same database instance, while
upkeeping isolation guarantees. This improves the overall resource utilization, since each tenant does
not need to provision separate hardware for their own maximum load [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the case of in-memory
analytical database systems, this eficient resource utilization is not necessarily a given, since memory
capacity cannot be multiplexed across tenants like processing power can [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Hence, there is a hard
limit to how many tenants can be co-located on a single machine.
      </p>
      <p>Data De-Duplication (DeDup) is a promising strategy to save memory in such systems. If the same
object of data is redundantly stored by multiple tenants, or in multiple locations by the same tenant,
it can be collated into a single one, freeing up space. Even though each tenant may have their own
completely unique schema, there are practical situations in which such redundancies appear:
• Multiple tenants can own copies of the same common datasets, such as currency/country
information and date dimensions.
• Some tenants set up development and testing copies of their live database.
• Engineers create snapshots of tables for retrospective analysis.
• Intermediate stages of ETL processes include copies of the same columns with each processing
step.
Germany</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
a
t
a
d
a
t
e
M
a
t
a
D
n
m
u
l
o
C</p>
      <p>
        In some of these situations, exact copies of tables or columns are redundantly stored, while in others,
data may only be similar or have been slightly updated over time. To efectively deal with both of these
scenarios, our prior work introduced Delta De-Duplication (De2Dup) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. By storing the columns of one
dataset as a delta in reference to a base dataset, we showed that even tables which diverged from their
initial identical state could still be eficiently deduplicated. At the core of this strategy is the De 2Dup
manager, a new database component tasked with orchestrating the delta deduplication and instructing
a specialized hardware accelerator to perform the bulk work.
      </p>
      <p>Until now, we have not specified how the De2Dup manager chooses which data to attempt
deduplication on. In this paper, we therefore propose a concrete strategy for steering delta deduplication
based on data similarity in in-memory analytical databases. As Figure 1 illustrates, the strategy exploits
the database’s common characteristics such as (1) per-column metadata to detect similarity, and (2)
logical data ordering to decide how to steer the accelerator towards (3) a simple pairwise comparison.
As a result, we are able to significantly limit the search-space for duplicates, and speed up the scanning
process by an order of magnitude compared to a general-purpose approach. Our contributions are:
• We analyze related work in the area of memory deduplication, by evaluating it under the lens of
in-memory database systems.
• We propose a metadata-based strategy for column similarity-detection.
• We benchmark the performance-benefits of our steered approach by directly comparing it to an
existing general-purpose deduplication solution.</p>
      <p>
        The paper is structured as follows: First, we summarize our own previous work in the area of
deduplication, which this paper seeks to expand. Then in Section 3, we characterize the state of literature
in the area of in-memory data deduplication, and assess its usefulness for in-memory databases. Then,
our proposed structure of a steered duplicate search in columnar databases is detailed, pointing out the
unique properties of this setting, which allow for significantly faster deduplication. Subsequently, in
Section 5 we describe the setup that is used to measure the concrete deduplication performance that
our steered approach enables, and interpret the benchmark results. Lastly, we conclude with a short
summary and propose possible future research.
2. De2Dup
This section recaps our previous work in the space of data deduplication [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and contextualize our
current contribution. Our past work is based on the same premise as Section 1, in that multi-tenant
databases may contain a significant amount of similar data, e.g., from copies of the same base-dataset
being edited by diferent tenants. To reduce these redundancies, we proposed an extension to current
database architectures, adding the De2Dup manager component. Its task is to choose columns with
potentially similar data, and to start the background-execution of their deduplication. Columns were
chosen as the granularity for orchestration, since they contain continuous ranges of memory, and have
significant metadata associated with them.
      </p>
      <p>
        As the deduplication executor, we used the Data Streaming Accelerator (DSA), an on-chip
hardwareaccelerator included in Intel’s recent Xeon server chips [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It supports a specialized create_delta
operation, generating a delta between two areas of memory. Using this operation, we showed that
significant memory savings are possible when delta deduplicating similar but not identical data,
compared to deduplication methods based on binary-equality. Furthermore, by ofloading the operation to
a hardware-accelerator, CPU resources are freed up for query execution.
      </p>
      <p>Subsequently, we used the delta-deduplicated data to measure its impact on query processing
performance. We compared two delta-reconstruction methods: Page Reconstruction (PR) re-materializes the
delta deduplicated column on the granularity of memory pages, by applying the deltas onto a copy of the
base page, and then uses this recreated page in normal query execution. Through benchmarks, we saw
that this model sufers from scalability issues with an increasing number of page reconstructions, even
if there are few changes per page. Our second method is Delta Weaving (DW), in which we redesign
our physical query operators to allow processing the base column and the corresponding deltas at the
same time, by alternating between them. For reasonable numbers of deltas, this approach performed
very well, with leeway to improve in the future.</p>
      <p>
        We want to note that, since our last contribution, the delta functionalities of the DSA have been
deprecated by Intel, and even disabled through a microcode update due to potential system instabilities in
its current implementation. Nonetheless, we see a huge importance in continuing this line of research,
since the most recent chips from Intel incorporate even more accelerators along the DSA (e.g.,
InMemory Analytics Accelerator (IAA), Quick Assist Technology (QAT), Dynamic Load Balancer (DLB)),
that hold potential for accelerating such workloads [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Furthermore, by presenting our results, we
hope to explore which features the next generation of hardware accelerators could support.
      </p>
      <p>Based on our past results, we see De2Dup as a promising mechanism for memory-eficient databases,
with many more avenues to explore in the future. One such area left for future research is defining the
concrete workings of the De2Dup manager, and how it can find column pairs to deduplicate. This will
be the focus of our current contribution.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Related Work</title>
      <p>In this section, we highlight existing approaches to in-memory deduplication, first specifically in the
context of databases, and then for general-purpose solutions. With this, we will show how current
solutions do not match the requirements for in-memory database deduplication for analytical workloads.</p>
      <p>
        For many years, columnar databases have employed delta-encoding for sorted columns, encoding
each value as the numeric diference to the previous value [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Combined with bit-packing techniques,
this can be used to efectively compress columnar data in a lightweight way [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Note that these ”deltas”
are distinct from those referred to in the rest of this paper, since they are the logical result of numeric
subtraction, instead of a byte-by-byte diference. Though this technique is able to reduce redundancies,
it is limited to a local scope, and unable to deduplicate across column or schema boundaries. In
dbDedup [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], hash fingerprints are used to find similar data in a NoSQL database, and deduplicate it by
calculating a byte-by-byte delta. In this case, the stored values are mostly very large strings (e.g., the
content of a Wikipedia article), and workloads focused on point-queries, so applying the deduplication
on the sub-value-granularity proved efective. For our use-case of analytical databases, where there
tend to be many smaller values and queries require large table scans, this granularity would prove to be
problematic. We believe that a delta on the granularity of the value being stored (e.g. 64-bit integers) is
a much better fit for such use-cases.
      </p>
      <p>
        Beyond the area of databases, memory deduplication is a prominent research topic in the virtualization
space, with many focusing on consolidating more Virtual Machines (VMs) onto the same hardware [
        <xref ref-type="bibr" rid="ref8">8,
9, 10</xref>
        ]. The basic premise is that there are arbitrary ”guest” processes running in a virtualized sandbox
on a host machine. These processes are anonymous, and the host has no semantic knowledge about
the content of their memory allocations. Therefore, techniques must be developed to eficiently find
redundancies despite this lack of semantic knowledge. We see this research as being highly relevant to
our use-case, since it acts as the baseline for what potential gains are achievable through deduplication
in a general-purpose setting.
      </p>
      <p>In [9], Veni and Bhanu broadly classify VM deduplication techniques into three categories:
Conventional OS techniques, Para-virtualization-based techniques, and Memory scanning-based techniques.
Conventional OS techniques build upon concepts from the Operating System space, such as semantic
sharing and forks. Based on a common lineage of data, these enable multiple processes to share read-only
references to the same physical memory [11]. These concepts have been applied to VMs, leveraging
knowledge about the used base-image to ”fork” of multiple versions of a VM [ 12]. In the space of
databases, these concepts are used by Dageville et al. [13] and Aguilar Saborit et al. [14] to eficiently
clone existing database tables and schemas by only copying the metadata referencing the actual stored
data. However, since these concepts only work if data has a common lineage, they are not a good fit for
our scenario of isolated tenants. Furthermore, any later changes to said data will never be deduplicated,
even if redundancies exist, making it less efective for long-lived database processes.</p>
      <p>Para-virtualization-based techniques include any systems which require modification of the guest OS
to function. As a consequence, the guest OS is able to communicate semantic information about its
memory to the host, leading to more deduplication possibilities [10, 15]. However, having to modify
a (possibly closed-source) guest OS or application limits its applicability. For our work, the issue of
communicating semantics between tenant’s schemas and the De2Dup manager does not exist, since it
has full access to table metadata. This enables possibilities for steering the De2Dup process, which are
impossible in the generic VM space.</p>
      <p>
        Memory scanning-based techniques rely neither on common lineage nor communicated semantics. By
periodically scanning all VMs memory regions, they are able to not only find redundancies with no
further semantic knowledge about their origin, but also adapt to data changes over time. The Kernel
Samepage Merging (KSM) feature of the Linux kernel [16] is an example of such a scanning-based
technique. As a background process, it searches a subset of the system’s anonymous memory pages
for duplicates, and replaces any pair of matches with a single Copy on Write (CoW) page. To keep
the number of page comparisons low, KSM sorts all scanned pages into a tree data structure based on
their content. For any subsequent page, a simple look-up can confirm if duplicate content has already
been inserted. Though being completely application-agnostic makes KSM a very versatile tool, it also
adds a lot of complexity in its implementation. KSM must keep track of all mergeable pages in central
data structures, make sure pages aren’t frequently changing, and perform frequent rescans to detect
new redundancies. Nevertheless, we choose KSM as the baseline deduplication implementation, since
it is the most widely accessible deduplication tool. MDedup++ [9] extends KSM by steering it away
from pages which frequently change, leading to fewer CoW pages being copied. Furthermore, when
a matching page-pair is deduplicated, it prioritizes the comparison of the two logical next pages in
continuous memory, leading to much faster scan times depending on the types of allocations of the
workload. Especially for columnar databases, this exploitation of the spatial ordering of pages is highly
relevant, since large bufers of continuous values are common. Diference Engine [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] similarly keeps
track of frequently changing pages through heuristics, and applies diferent deduplication techniques
accordingly. Pages which were not recently written to are considered for sharing, while those not
recently read can be either delta-deduplicated or compressed. For delta-deduplication to be efective, a
similarity-based search process based on hash fingerprinting is introduced. Through this, the system is
able to eficiently deduplicate near-identical pages, further lowering memory usage. Lastly, XLH [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
focuses on improving KSM’s reaction time to new deduplication opportunities. By hooking into the
virtual file system layer of the VM, any memory pages which recently read from a file can be identified
and prioritized by KSM. As a result, short-lived sharing opportunities are detected more quickly.
      </p>
      <p>From all of these approaches, we can conclude that: (1) general purpose scanning approaches
need to deal with frequently changing data, (2) deduplication can be efectively steered based on data
ifngerprinting and semantic metadata, and (3) depending on the workload, scanning continuous memory
space can lower deduplication time. Based on these conclusions about VM deduplication, we argue
that in-memory columnar databases can benefit greatly from the existing research. Since columnar
databases often exhibit immutability guarantees in regard to their stored data, contain a plethora of
metadata about all tables, and extensively use continuous memory regions, many of the mentioned
heuristics are not necessary. Instead, a deeply integrated module can be designed, which exploits
these architectural guarantees and metadata to perform a fast, low-overhead scan. To the best of our
knowledge, this makes our deduplication approach the first to exploit this semantic knowledge for
deduplication in databases.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Steered Duplicate Search</title>
      <p>In this section, we want to introduce a concrete design for the De2Dup manager component, tasked
with steering our delta deduplication towards similar data.</p>
      <p>We assume the following setting: Consider an in-memory columnar database hosting a large number
of tables owned by diferent tenants, with a subset of those tables containing partially duplicate data.
When data is ingested, it is organized into a columnar storage format, which is potentially further
subdivided into a set of partitions, as is common in modern columnar databases [17]. The expressions
used for partitioning are explicitly set during table creation. Additionally, lightweight statistics about the
values contained in each column and partition are collected, including min/max values and
value/nullcounts. The De2Dup manager has access to all information regarding each tenant’s schema, table
ordering/partitioning, statistics, and modification timestamps via a central metadata store (visualized in
Figure 1).</p>
      <p>Based on this setting, our overarching goal is to eficiently find pairs of columns from diferent tables
which are as similar as possible — no matter if they originate from diferent tenant’s schemas, or from
the same schema. Generating such pairings based on database schemas and the data they include is
already a broad research topic called Schema Matching (SM) [18]. Some of the main application areas of
SM are the mapping of columns from a data source to those of an existing data warehouse structure (data
integration), and autofilling forms in web-browsers [ 18]. For these use cases, many approaches have
been proposed, utilizing similarity metrics based on table properties, the semantics of column names
(e.g., via word embeddings), min/max values, and statistical distributions [19, 18, 20]. However, we
believe our use-case to be distinct from existing solutions for several reasons. Firstly, in SM approaches,
it is often desired to find a single definitive mapping between two schemas, since this is what is needed
in most of its use-cases [21]. In contrast, we are attempting to find as many similar columns as possible
across a large range of schemas. Moreover, our setting exhibits a large tolerance to false-positives, since
a mismatch will merely cause a quickly aborted table scan, instead of incorrect data mappings. Lastly,
SM approaches deal with the dificulties of mapping schemas from completely diferent systems, while
ours is more homogenous. For example, usually naming schemes might be completely diferent across
systems, leading to the need to match columns like postal_code and ZIP together based on semantics.
In our case, many examples of duplicate data stem from a common dataset (e.g., database table/dump,
parquet file), which contain consistent table/column names. These names can of course be changed,
e.g., during data integration, or to align with internal naming standards, though this will often be
purely syntactical. For instance, the table currencies may be ingested as dim_currency, snapshotted
as currencies_20250715, or further processed into currencies_cleaned. For these reasons, we opt
to employ an adapted three-step process to find similar column pairs.</p>
      <p>Column Search First, for each column in our database, we generate a ranked list of  similar columns.
To achieve this, each column is represented by its features in the following structure:
(fully_qualiifed_column_name, count_values, count_nulls, min, max) . A distance between two of these structures is
calculated using a weighted sum of all the individual distances between their elements. For string types,
we employ the Levenshtein distance to capture syntactic similarity, and for numeric types simply the
absolute diference. To find the  closest matching columns, we enumerate the distances to all other
columns of the same data-type and save the  best results.</p>
      <p>Partition Alignment Next, for each pair of similar columns, we further match up all partitions from
both in pairs, based on their partition keys. Splitting the process up by partition ensures that insertions
in a subset of partitions (e.g., appending new data for a certain month) do not misalign the further
comparison process. However, a column could also consist of a single partition, which allows skipping
this step, since the alignment is inherently given.</p>
      <p>
        Delta Scan Lastly, for each pair of partitions found, we submit a set of delta operations to the DSA to
perform the comparison on a granularity of 4 KiB sized memory pages. These operations will encounter
one of three scenarios during the page scan: In the first case, the found pages are exactly equal, in which
case no deltas are generated, one of the pages becomes dereferenced and is freed. On the other end of the
spectrum, the found pages could be a complete mismatch. This can be due to the column search yielding
a false-positive, or because the page has been significantly updated from its original form. In this case,
the submitted delta operation returns early as soon as it reaches a configured threshold of deltas, and
the page pair is not deduplicated. Using this early return as a heuristic, we can additionally abort any
such false-positive matching after the first few pages of the partition, saving additional processing
power. Lastly, if the pages are mostly equivalent, with only a few diferences, the resulting deltas are
stored, and the underlying page is freed. At this point, when a delta deduplicated page is touched by a
query, the content needs to be reconstructed by either PR or DW, which happens transparently for the
user [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To decide which page should be used as the baseline, and which should be encoded as deltas,
we choose whichever partition has remained unedited for longer based on modification timestamps. We
argue that more ”stable” pages are a better suited candidate to serve as a baseline for delta generation,
since any change to the base page would force all deltas referencing it to be updated.
Data Changes In the case of KSM, pages are marked as CoW, i.e. a single update nullifies the saving
efect of a complete page. In our case, we can approach this issue threefold. One possibility is to employ
PR one any update, insert the update and consequently perform a new run of De²Dup. However, this
approach is rather expensive, as it requires significant computation time from the DSA. The second
approach is analogous to DW: We probe the original page at the updated position and if the updated
value difers from the original value, we create a new delta or update a delta, if one already existed for
this position. Thirdly, we could reconstruct the page on the first update and only apply De 2Dup after a
certain time, e.g., after a transaction ends or after a certain amount of time has passed.
Parameter Selection Our approach is configurable in three ways. The number of most similar
columns  to find for each candidate column can be set based on the amount of computing power we
are willing to spend on delta deduplication. With  = 1 , we attempt delta deduplication with only the
best matching column, while with higher  , more and more opportunities can be explored. Next, the
matching itself works based on a weighted sum of the distances between all features of two columns. By
setting these weights  , we are able to grant higher importance to matching statistics, or to matching
naming schemes. Concrete weights will depend on the properties of the given schema, and could best
be determined based on tests with a real-world dataset. Such a dataset is currently not available to
us, so the weights used in the following evaluation can be chosen freely, since matching column pairs
will be exactly equal on all features. Lastly, the maximum number of deltas which should be generated
before the DSA aborts the operation can be configured. As we have shown before, above a threshold of
80% of two pages being diferent, the generated deltas consume more space than the original page [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
due to the overhead of indexes stored in the delta representation. Furthermore, processing large deltas
does hurt query performance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], so a threshold of 25% deltas is likely a good starting point. Again, this
storage savings/performance tradeof should be made based on real-world data the system will run on.
      </p>
    </sec>
    <sec id="sec-4">
      <title>5. Evaluation</title>
      <p>This section evaluates the applicability of our steered approach and how our tailored column selection
performs against KSM. Since KSM is a kernel background task, we cannot measure distinct components
like the identification of matching pages or the isolated deduplication task. Hence, we evaluate both
KSM and our approach based on their efective savings and runtime in an end-to-end experiment.</p>
      <p>
        Our steering approach is supplied with pre-existing matching features for all tenant’s columns
in our test datasets. This includes all semantic information and statistics listed in Section 4. Based
on this data, each column in the dataset is matched up with another column deemed most similar.
Subsequently, work-descriptors are dispatched to the DSA, generating deltas for the two columns [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
The total memory savings of this method are measured as the number of pages processed by it, minus
the additional space used to store delta records. Our time measurements begin with the start of the
matching process, and end when the last page has been scanned by the DSA.
      </p>
      <p>To evaluate KSM, we configure our environment in such a way, that the only bufers available
for KSM to scan are our datasets, and no other system processes exposes any memory marked as
”mergeable”. Additionally, we set KSM’s configuration variables to utilize 100% of a single CPU core
(advisor_max_cpu = 100), not sleep in-between scans (sleep_millisecs = 0), and process all pages of
the dataset during a single scan (e.g., pages_to_scan = 2700000 for a 10 GiB dataset)1. Note that the
convergence speed of KSM is heavily influenced by these settings, and the reported numbers should
thus be considered as a possible lower bound of its runtime. In real world scenarios, it would likely be
higher due to its nature as an unobtrusive background process. We can measure the time KSM took to
process the entire dataset between enabling the tool, and it’s reporting of no pages left to scan. With
these steps, we are able to treat KSM as if it were an on-line foreground process running on a single
core. Its achieved memory savings are reported through a Linux sysfs interface, which we monitor.</p>
      <p>Next, we specify the setting both approaches will execute in. In our testing-setup, one of two
datasets is loaded into memory: The Internet Movie Data Base2 (IMDB) dataset, used in the Join Order
Benchmark (JOB) [22], consists of 21 tables comprised of a total of 109 columns. Secondly, the Star
Schema Benchmark (SSB) dataset, with a scale-factor of 10 [23], which contains 5 tables and 58 columns.
Both dataset’s columns are separately encoded in either the Apache Arrow [24], or Apache Parquet
format [25]. This choice was made to represent data encoded in a typical format for in-memory databases
(Arrow), as well as a format representing compressed in-memory data (Parquet). Consequently, we
have four possible setups: IMDB Arrow (6.0 GiB), IMDB Parquet (857.8 MiB), SSB Arrow (9.5 GiB), and
SSB Parquet (1.3 GiB). To emulate another tenant with duplicate data being present, we additionally
create an exact copy of the given dataset, and register a second set of column features for our matching
algorithm. To show the efect of slowly diverging datasets on both methods, we additionally have the
option of injecting the copied dataset with an increasing amount of (completely) modified pages.</p>
      <sec id="sec-4-1">
        <title>Hardware and Software</title>
        <p>All benchmarks are executed on an Intel Xeon w5-3425 CPU, running Ubuntu 24.04.2 LTS with kernel
version 6.8.0-60-generic and microcode version 0x2b000639.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Schema Matching</title>
        <p>First, we present the runtime of our column matching algorithm in an isolated experiment. Figure 3
shows the runtime for matching performed on our duplicated tenant schemas. As outlined in Section 4,
we compare fully qualified column names (FQCN) and column statistics to determine possible candidates,
which ultimately leads to an ( 2) complexity, as every column is probed against every other. However,
given the small number of columns that must be compared when either dataset is present in isolation,
the runtime is suficiently small to be considered negligible. The scaling becomes an obvious issue if
1https://docs.kernel.org/admin-guide/mm/ksm.html
2https://imdb.com
s200k
e
g
a
P
ed100k
v
a
S</p>
        <p>0k
both datasets are ingested and all resulting FQCNs are checked across dataset boundaries. This issue
stands despite the actual runtime being still quite low with approx, 30 ms. Potential optimizations are,
e.g., using data-structures optimized for eficient kNN queries in many dimensions, or more aggressive
pruning of matches based on very large statistical diferences.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Steered Delta Deduplication</title>
        <p>Next, we investigate the runtime behavior of both KSM and our steered approach. Figure 2a visualizes
the number of duplicate pages found over the scan’s runtime in the two copies of the IMDB Parquet
dataset. The steered scan approach exhibits a very short and consistent execution, quickly detecting
all 857 MiB of duplicate pages across both tenants. KSM behaves much diferently during its
scanprocess, clearly showing when it performs its initial scan (flat section), and when it subsequently starts
deduplicating in the second pass. The same behavior is also visible in the larger SSB Parquet dataset
in Figure 2b, with the steered scanning approach finishing 36x faster than KSM.</p>
        <p>The same scans were also executed on the IMDB Arrow and SSB Arrow dataset, visualized in Figure 4a
and Figure 4b. Though KSM still takes significantly longer to fully scan the memory region, it is able to
ifnd 21.4% (IMDB) and 14.2% (SSB) more redundant pages than our steered scan. This indicates that the
read-optimized Apache Arrow format contains a lot of intra-table redundancies, which KSM can detect
by considering all combinations of pages as potential sharing partners. This experiment underlines
the diferent nature of both approaches. We only consider corresponding columns to perform delta
deduplication, at a small cost to the deduplication ratio. On the other hand, KSM performs a thorough
scan of the system, which manages to reach the maximum possible deduplication ratio for each dataset.
One possible enhancement is embracing the holistic strategies of other general-purpose deduplication
solutions, like content-based similarity-fingerprinting of pages. This might allow for savings beyond
the strict pairwise column search, while still benefitting from the steerability of our matching.
3M
s
eg2M
a
P
d
ve1M
a
S
0 0 0 0 0 0 0 0 0 0 0
1 2 3 4 5 6 7 8 9 0</p>
        <p>1
% of Pages Changed
False positive FQCN matches can occur if two columns exhibit syntactically similar but semantically
diferent names or if the content of one column has been heavily modified, in relation to its origin. In
these cases, the work needing to be performed to attempt delta deduplication of a column is higher
than for exact matches, since more deltas are written. In Figure 5, the impact of increasingly more
mismatching pages on KSM and De2Dup is shown. Clearly, the extra work performed by the DSA
manifests itself in slightly slower execution: in the worst case, it finishes the scan 9% slower than
deduplicating a good match. However, this slowdown can be completely negated by configuring the
delta operations to short-circuit if two pages difer too much, in this case 50%. Conversely to our
approach, KSM benefits from very few matching pages in the dataset, likely due to fewer page merges
having to be performed, and improving lookup-times for its, now smaller, tree-based data-structures.
Ultimately, this experiment highlights the resilience of delta-creation towards false-positive matches.
Despite KSM improving its own performance when there are fewer savings opportunities, in absolute
terms, it still finishes 24.5x slower than the steered scan.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion and Future Work</title>
      <p>
        In this paper, we expanded on our previous work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] by introducing a strategy for finding potentially
redundant data in multi-tenant database systems. We showed that using a steered search, we are able
to outperform existing general-purpose solutions in terms of scanning speed, even in cases where
steering yields false-positive matches. There are many more optimizations to be made, such as more
scalable column-matching, better heuristics for early returns on the column level, and more resilience
to diverse types of column edits besides updates and appends. Taking inspiration from related work,
the use of hash fingerprinting to detect similar content should be investigated as well [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However,
to further test our steering approach, a suitable dataset representing tables in multi-tenant databases
must first be created. Furthermore, due to the deprecation of the DSA’s delta operations, future work
may focus on creating an independent delta format, optimized for columnar storage. In terms of
hardware-acceleration, Intel has shown that it’s on-chip integration of specialized accelerators opens
up possibilities for low-latency, energy-eficient ofloading of workloads [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Especially worth further
exploring is the In-Memory Analytics Accelerator (IAA), with initial research suggesting its usefulness
in columnar databases [26].
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partly funded by (1) the German Research Foundation (DFG) priority program SPP 2377
under grant no. LE 1416/30-1, (2) the European Union’s Horizon research and innovation program
under grant agreement no. 101189551 (CHORYS), and (3) the German Research Foundation (DFG) via a
Reinhart Koselleck-Project (LE-1416/28-1).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
ference (USENIX ATC 13), 2013, pp. 279–290. URL: https://www.usenix.org/conference/atc13/
technical-sessions/presentation/miller.
[9] T. Veni, S. M. S. Bhanu, MDedup++: Exploiting Temporal and Spatial Page-Sharing
Behaviors for Memory Deduplication Enhancement, The Computer Journal 59 (2016) 353–370.
URL: https://academic.oup.com/comjnl/article-lookup/doi/10.1093/comjnl/bxu149. doi:10.1093/
comjnl/bxu149.
[10] G. Miłós, D. G. Murray, S. Hand, M. A. Fetterman, Satori: Enlightened page sharing, in:
Proceedings of the 2009 Conference on USENIX Annual Technical Conference, USENIX’09,
USENIX Association, USA, 2009, p. 1. URL: https://dl.acm.org/doi/10.5555/1855807.1855808.
doi:10.5555/1855807.1855808.
[11] A. Silberschatz, P. B. Galvin, G. Gagne, Operating System Concepts, 10th edition ed., Wiley,</p>
      <p>Hoboken, NJ, 2018.
[12] M. Vrable, J. Ma, J. Chen, D. Moore, E. Vandekieft, A. C. Snoeren, G. M. Voelker, S. Savage,
Scalability, fidelity, and containment in the potemkin virtual honeyfarm, SIGOPS Oper. Syst. Rev.
39 (2005) 148–162. URL: https://doi.org/10.1145/1095809.1095825. doi:10.1145/1095809.1095825.
[13] B. Dageville, T. Cruanes, M. Zukowski, V. Antonov, A. Avanes, J. Bock, J. Claybaugh, D. Engovatov,
M. Hentschel, J. Huang, A. W. Lee, A. Motivala, A. Q. Munir, S. Pelley, P. Povinec, G. Rahn,
S. Triantafyllis, P. Unterbrunner, The Snowflake Elastic Data Warehouse, in: Proceedings of the
2016 International Conference on Management of Data, SIGMOD ’16, Association for Computing
Machinery, New York, NY, USA, 2016, pp. 215–226. URL: https://dl.acm.org/doi/10.1145/2882903.
2903741. doi:10.1145/2882903.2903741.
[14] J. Aguilar Saborit, R. Ramakrishnan, K. Bocksrocker, A. Halverson, K. Kosinsky, R. O’Connor,
N. Poliakova, M. Shafiei, H. M. Ansari, B. Crivat, C. Cunningham, T. Kim, P. Kon Kim, I. R.
Madan, B. Matuszyk, M. Miles, S. Mohanan, C. Petculescu, E. R. Wirshing, E. Yousefi Amin Abadi,
Extending Polaris to Support Transactions, in: Companion of the 2024 International Conference
on Management of Data, SIGMOD ’24, Association for Computing Machinery, New York, NY, USA,
2024, pp. 321–333. URL: https://dl.acm.org/doi/10.1145/3626246.3653392. doi:10.1145/3626246.
3653392.
[15] W. Qiu, M. Copik, Y. Wang, A. Calotoiu, T. Hoefler, User-guided Page Merging for Memory
Deduplication in Serverless Systems, in: 2023 IEEE International Conference on Big Data (BigData),
2023, pp. 159–169. URL: https://ieeexplore.ieee.org/abstract/document/10386487. doi:10.1109/
BigData59044.2023.10386487.
[16] A. Arcangeli, I. Eidus, C. Wright, Increasing memory density by using KSM, Proceedings of the</p>
      <p>Linux Symposium (2009). URL: https://www.kernel.org/doc/ols/2009/ols2009-pages-19-28.pdf.
[17] L. Sun, M. J. Franklin, S. Krishnan, R. S. Xin, Fine-grained partitioning for aggressive data skipping,
in: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’14, Association for Computing Machinery, New York, NY, USA, 2014, pp. 1115–1126.</p>
      <p>URL: https://dl.acm.org/doi/10.1145/2588555.2610515. doi:10.1145/2588555.2610515.
[18] P. A. Bernstein, J. Madhavan, E. Rahm, Generic schema matching, ten years later, Proc. VLDB
Endow. 4 (2011) 695–701. URL: https://dl.acm.org/doi/10.14778/3402707.3402710. doi:10.14778/
3402707.3402710.
[19] R. Castro Fernandez, E. Mansour, A. A. Qahtan, A. Elmagarmid, I. Ilyas, S. Madden, M. Ouzzani,
M. Stonebraker, N. Tang, Seeping Semantics: Linking Datasets Using Word Embeddings for Data
Discovery, in: 2018 IEEE 34th International Conference on Data Engineering (ICDE), IEEE, Paris,
2018. URL: https://ieeexplore.ieee.org/document/8509314/. doi:10.1109/icde.2018.00093.
[20] P. A. Bernstein, S. Melnik, J. E. Churchill, Incremental schema matching, in: Proceedings of the
32nd International Conference on Very Large Data Bases, VLDB ’06, VLDB Endowment, Seoul,
Korea, 2006, pp. 1167–1170. URL: https://dl.acm.org/doi/10.5555/1182635.1164235.
[21] E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching, The VLDB
Journal 10 (2001) 334–350. URL: http://link.springer.com/10.1007/s007780100057. doi:10.1007/
s007780100057.
[22] V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, T. Neumann, How good are query
optimizers, really?, Proc. VLDB Endow. 9 (2015) 204–215. URL: https://dl.acm.org/doi/10.14778/
2850583.2850594. doi:10.14778/2850583.2850594.
[23] P. O’Neil, B. O’Neil, X. Chen, The Star Schema Benchmark (SSB) (2009).
[24] Apache Software Foundation, Apache Arrow, 2025. URL: https://arrow.apache.org, accessed:
202507-05.
[25] Apache Software Foundation, Apache Parquet, 2025. URL: https://parquet.apache.org, accessed:
2025-07-05.
[26] A. Baumstark, L. Martins, K.-U. Sattler, Uncore your Queries: Towards CPU-less Query Processing,
in: Proceedings of the 21st International Workshop on Data Management on New Hardware,
DaMoN ’25, Association for Computing Machinery, New York, NY, USA, 2025, pp. 1–10. URL:
https://dl.acm.org/doi/10.1145/3736227.3736243. doi:10.1145/3736227.3736243.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jacobs</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Aulbach,</surname>
          </string-name>
          <article-title>Ruminations on Multi-Tenant Databases</article-title>
          , in: Datenbanksysteme in Business,
          <source>Technologie und Web (BTW</source>
          <year>2007</year>
          )
          <article-title>- 12. Fachtagung des GI-Fachbereichs ”Datenbanken und Informationssysteme” (DBIS), Gesellschaft für Informatik e</article-title>
          . V.,
          <year>2007</year>
          , pp.
          <fpage>514</fpage>
          -
          <lpage>521</lpage>
          . URL: https: //dl.gi.de/handle/20.500.12116/31821.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vrable</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Snoeren</surname>
          </string-name>
          , G. Varghese,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Voelker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vahdat</surname>
          </string-name>
          , Diference engine:
          <article-title>Harnessing memory redundancy in virtual machines</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>53</volume>
          (
          <year>2010</year>
          )
          <fpage>85</fpage>
          -
          <lpage>93</lpage>
          . URL: https://dl.acm.org/doi/10.1145/1831407.1831429. doi:
          <volume>10</volume>
          .1145/1831407.1831429.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kowalick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pietrzyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Habich</surname>
          </string-name>
          , W. Lehner,
          <article-title>De2Dup: Extended Deduplication for Multi-Tenant Databases</article-title>
          ,
          <source>in: Proceedings of the 21st International Workshop on Data Management on New Hardware</source>
          , DaMoN '25,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3736227.3736236. doi:
          <volume>10</volume>
          .1145/3736227.3736236.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ranganathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lantz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanjeepan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cabrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kwatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sankaran</surname>
          </string-name>
          , I. Jeong,
          <string-name>
            <given-names>N. S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <source>Intel Accelerators Ecosystem: An SoC-Oriented Perspective : Industry Product, in: 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>848</fpage>
          -
          <lpage>862</lpage>
          . URL: https://ieeexplore.ieee.org/abstract/document/10609705. doi:
          <volume>10</volume>
          .1109/ISCA59077.
          <year>2024</year>
          .
          <volume>00066</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Batkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cherniack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. O</given-names>
            <surname>'Neil</surname>
            , P. O'Neil
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rasin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zdonik</surname>
          </string-name>
          ,
          <article-title>C-store: A column-oriented DBMS</article-title>
          ,
          <source>in: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB '05</source>
          ,
          <string-name>
            <given-names>VLDB</given-names>
            <surname>Endowment</surname>
          </string-name>
          , Trondheim, Norway,
          <year>2005</year>
          , pp.
          <fpage>553</fpage>
          -
          <lpage>564</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lemire</surname>
          </string-name>
          , L. Boytsov,
          <article-title>Decoding billions of integers per second through vectorization</article-title>
          ,
          <source>Software: Practice and Experience</source>
          <volume>45</volume>
          (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          . URL: http://arxiv.org/abs/1209.2137. doi:
          <volume>10</volume>
          .1002/spe. 2203. arXiv:
          <volume>1209</volume>
          .
          <fpage>2137</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pavlo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sengupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Ganger</surname>
          </string-name>
          , Online Deduplication for Databases,
          <source>in: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , pp.
          <fpage>1355</fpage>
          -
          <lpage>1368</lpage>
          . URL: https://dl.acm.org/doi/10. 1145/3035918.3035938. doi:
          <volume>10</volume>
          .1145/3035918.3035938.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Franz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rittinghaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hillenbrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bellosa</surname>
          </string-name>
          , {XLH}:
          <article-title>More Efective Memory Deduplication Scanners Through Cross-layer Hints</article-title>
          , in: 2013 USENIX
          <string-name>
            <given-names>Annual</given-names>
            <surname>Technical</surname>
          </string-name>
          Con-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>