<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BabelMR: A Polyglot Framework for Serverless MapReduce</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabian Mahling</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Rößler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Bodner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tilmann Rabl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hasso Plattner Institute, University of Potsdam</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The MapReduce programming model and its open-source implementation Hadoop have democratized large-scale data processing by providing ease-of-use and scalability. Subsequently, systems such as Spark have dramatically improved eficiency. However, for a large number of users and applications, using these frameworks remains challenging, because they typically restrict them to specific programming languages or require cluster management expertise. In this paper, we present BabelMR, a data processing framework that provides the MapReduce programming model to arbitrary containerized applications to be executed on serverless cloud infrastructure. Users provide application logic in Map and Reduce functions that read and write their inputs and outputs to the ephemeral filesystem of a serverless function container. BabelMR orchestrates the data-parallel programs across stages of concurrent cloud function executions and eficiently integrates with serverless storage systems and columnar storage formats. Our evaluation shows that BabelMR reduces the entry hurdle to analyzing data in a distributed serverless environment in terms of development efort. BabelMR's I/O and data shufle building blocks outperform handwritten Python and C# code, and BabelMR is competitive with state-ofthe-art serverless MapReduce systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Another entry barrier for users of distributed data
processing systems is the management of the underlying
Data processing workflows are becoming increasingly compute infrastructure. Conventional compute services
complex. Interdisciplinary teams of data engineers, an- based on virtual servers require decisions on instance
alysts, and scientists build data pipelines that combine types, cluster sizes, and pricing models [14, 15]. Virtual
diferent programming languages, runtime environments, containers need orchestration and observability [16, 17].
and computing frameworks [1]. The resulting multi- In our view, these hurdles are not inherent issues
language programs are commonly integrated via three of scalable data processing and recent developments in
diferent approaches. First, relational database systems serverless computing can help to alleviate them.
are extended to support user-defined functions (UDFs). Multiple cloud providers have introduced serverless
UDFs are constrained to be written in specific language compute services that allocate resources based on
condialects because of security or performance concerns of sumption, removing the need for manual provisioning
the database vendors and do not allow to express complex and scaling [18, 19, 20]. The services have supported
analytical algorithms. Second, MapReduce systems are lightweight functions initially and now also run
fullmore flexible and support general analytical dataflows lfedged containers [ 21]. Recent research has indicated
[2, 3]. They prioritize a few popular languages (for ex- the potential of data processing systems built on
serverample, Python [4] and SQL [5]) and provide hooks for less functions. PyWren [22], Starling [23], and Lambada
other languages in their individual OS environment [6]. [24] have demonstrated performance and scalability that
Third, OS-level containers are employed to encapsulate are competitive to systems on regular virtual machines.
arbitrary applications along with their dependencies [7]. Building on this work, we present the BabelMR
frameRedshift, BigQuery, and Snowflake have integrated UDFs work for multi-language, data-intensive applications.
as containers [8, 9, 10]. SAP HANA, Databricks, and Users of BabelMR package applications in containers
ClickHouse have containerized their entire architectures and declare them to be mappers or reducers. Collocated
[11, 12, 13]. The user code and system components are with the user’s application artifacts in a container is the
coupled loosely via REST APIs, which greatly simplifies BabelMR execution engine, which eficiently integrates
integration but impedes the performance optimizations with cloud storage and standard file formats. Between the
available to the prior approaches. BabelMR engine and the user application, data is passed
locally through the container’s ephemeral filesystem.</p>
      <p>Joint Workshops at 49th International Conference on Very Large Data The distributed computation of the containers within
Bases (VLDBW’23) — Workshop on Serverless Data Analytics (SDA’23), and across map and reduce stages is orchestrated by the
August 28 - September 1, 2023, Vancouver, Canada BabelMR coordinator and carried out on the serverless
* These authors contributed equally. compute service AWS Lambda. This enables users to
g$erfda.broiaenss.mlear@hlisntugd@esnttu.hdpein.ut.nhip-ip.uontsi-dpaomts.ddeam(P..dReö(Fß.leMr)a;hling); write their applications in any language and run them in
thomas.bodner@hpi.de (T. Bodner); tilmann.rabl@hpi.de (T. Rabl) a data-parallel fashion on serverless cloud infrastructure
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License with no operational overhead.</p>
      <p>CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
unpartitioned S3 objects
λ
λ
λ</p>
      <p>λ
partitioned S3 objects
λ</p>
      <p>λ
pre-aggregated S3 objects</p>
      <p>λ
aggregated S3 objects
e Import
g
a
t
S
p
a
M
eg Import
a
t
S
e
c
u
d
e
R
0.parquet
map0.parquet
map1.parquet
...</p>
      <sec id="sec-1-1">
        <title>Export</title>
        <p>.i
n
c
s
v</p>
      </sec>
      <sec id="sec-1-2">
        <title>Import</title>
      </sec>
      <sec id="sec-1-3">
        <title>Export</title>
        <p>i.
n
c
s
v</p>
      </sec>
      <sec id="sec-1-4">
        <title>Import Map</title>
      </sec>
      <sec id="sec-1-5">
        <title>Reduce</title>
        <p>Import
v
s
c
.t
u
o</p>
      </sec>
      <sec id="sec-1-6">
        <title>Export</title>
        <p>Import
v
s
c
.t
u
o</p>
      </sec>
      <sec id="sec-1-7">
        <title>Export</title>
        <p>Repartition map0.parquet Export
reduce0.parquet</p>
      </sec>
      <sec id="sec-1-8">
        <title>Export</title>
      </sec>
      <sec id="sec-1-9">
        <title>BabelMR</title>
      </sec>
      <sec id="sec-1-10">
        <title>User Code</title>
      </sec>
      <sec id="sec-1-11">
        <title>Shared Object Storage</title>
      </sec>
      <sec id="sec-1-12">
        <title>Local Ephemeral Filesystem</title>
        <p>AWS Lambda handles concurrent invocations
automatically, by spinning up multiple instances of VMs that
run the user code. The inherent elasticity this provides
is ideal for the sporadic nature of the workloads that
serverless data processing targets.</p>
        <p>Amazon S3 [29] is an object storage service providing
elastic scalability. Objects are stored in buckets and are
identified by a unique key. Compared to other serverless
storage systems (e.g., key-value stores or filesystems),
storage costs are low and network bandwidth is high.
This renders S3 most suitable for big data processing.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Overview of BabelMR</title>
      <p>In this section, we present the most relevant serverless
compute and storage services of AWS and introduce the
architecture of BabelMR. The presented concepts apply
to the services of other cloud providers [25, 26] as well.</p>
      <p>• We use queries from the TPC-H and TPCx-BB
benchmarks in multiple languages to evaluate the 2.2. System Architecture
development eficiency and the performance of
BabelMR, as compared to state-of-the-art server- In BabelMR, data pipelines are run in stages. Per stage,
less data processors. Lambda-based workers execute the map or reduce user
functions concurrently, as shown in Figure 1. We
codeploy the BabelMR execution engine and the user code
on the Lambda functions. During a function invocation,
the BabelMR engine imports a partition of the dataset into
the local filesystem. Then, user code is executed, which
reads the locally stored files containing key-value batches.</p>
      <p>The user code stores its output back to disk. Finally, the
results are read by the BabelMR engine, partitioned, and
written back to shared storage.</p>
      <sec id="sec-2-1">
        <title>2.1. Serverless Cloud Infrastructure</title>
        <p>AWS Lambda [18] is a serverless, event-driven compute
service. Users deploy applications in Lambda as so-called
functions. The user’s application code is embedded into
the Lambda Runtime, which is executed on lightweight,
stateless virtual machines [27, 28]. For each function, the
memory size can be configured from 128 MB to 10 GB.
Compute power scales linearly with the memory size up
to 6 vCPUs. Each function has ephemeral (local) storage
that can be configured between 512 MB and 10 GB.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Programming Model and Interface</title>
        <p>BabelMR exposes a MapReduce-style interface. A job in
BabelMR is configured with map and reduce function
binaries and object storage locations for the input and
output data. We allow the specification of multiple inputs
(e.g., diferent tables) for map jobs. This facilitates the
implementation of map-side joins (see Code Listing 2).</p>
        <p>Additionally, the key and value attributes are specified,
which are used for data partitioning and distribution.</p>
        <p>Zip S3
104
)
s
/
B
(tM103
u
p
h
g
u
o
rTh 102
4
3
2
1</p>
        <p>Runtime Customization. We use a custom Lambda In BabelMR, the user’s application is co-deployed with
runtime to orchestrate user function binaries written in its runtime environment and the BabelMR engine. This
any language, and thus are not limited to the languages may quickly exceed the size limitation of Zip-based
supported by the standard runtimes [30]. Our file-based, Lambda functions. Container image-based functions do
inter-process communication spares the implementation not entail such a strict limit and have worked for all of
of language-specific adapters. Within a stage and worker, our use cases, ranging from simple scalar UDFs to
fullthe BabelMR engine loads a partition of the input data lfedged, in-memory database engines [ 33]. To evaluate
and writes a file to the local filesystem, containing key- the impact of these deployment methods on function
value pairs. The user code must load and process the file startup times, we build diferent function packages, each
contents and output a file that again contains key-value containing a dynamically sized BLOB and a minimal,
runpairs together with a schema description. The BabelMR ning function. Then, we measure the time it takes for a
engine exports the output data to remote shared storage. function to be initialized and operational.
Container Integration. To deploy their applications to Figure 2 shows that container images deployed via
AWS Lambda, users directly upload zipped binaries to ECR impose no initialization overhead. Instead,
initialtheir functions, upload the zipped binary to S3 or upload ization times decrease compared to Zip-based uploads
a container image [31] to Amazon Elastic Container Reg- and stay constant for diferent package sizes. This is due
istry (ECR) [32]. The uncompressed sizes of the Zip files to Lambda employing lazy loading for images [21]. Thus,
and container images are limited to 250 MB and 10 GB, files within the images are only loaded on the running
respectively. function, when they are accessed/read - in this case only
the minimal function. When reading the blob, we get
a throughput of roughly 150 MB/s for lazy loading files
1 # Base image containing BabelMR engine from ECR for 50 concurrently invoked Lambda functions.
2 FROM hpides/babelmr:latest After the blob is read for the first time, subsequent read
3 # Installation of runtime environment speeds did not difer based on the deployment variant.
54 CROUPNYyruemquiinrsetmaelnltsp.yttxhton.3/-y Deploying the code packages as a Docker image
over6 RUN python3 -m pip install -r /var/task/ comes potential size limitations. It also simplifies the
requirements.txt user’s integration with BabelMR, as we can provide a
7 COPY *.py ./ base image, which contains all our dependencies. Based
8 # Will be executed after BabelMR import on this image, the MapReduce executable and the
execu9 ENV MR_CMD="python3 /var/task/app.py" tion environment are added (in our tests, the Dockerfiles
Code Listing 1: Dockerfile to install the user’s execution were less than 10 lines long, see Listing 1). Using
imageenvironment and code on top of our provided base image. based Lambda functions is, therefore, the selected variant
in our system.
# Read only specified row groups of file.
def read_partitioned(fd, columns, opener):
dfs = []
inputfile = fd["bucket"] + "/" + fd["key"]
partitions = fd["partitions"]
pf = ParquetFile(inputfile, open_with=opener)
for i, rg in enumerate(pf):
# Check if current partition is specified.
if partitions == [] or i in partitions:</p>
        <p>dfs.append(rg.to_pandas(columns=columns))
return pd.concat(dfs)
def read_partitioned_parallel(columns, opener,</p>
        <p>file_descriptions):
# Spin up multiple threads for parallel read.
with ThreadPoolExecutor() as executor:
result_dfs = list(
executor.map(
lambda fd: read_partitioned(
fd, columns, opener
),
file_descriptions))
return pd.concat(result_dfs)
Code Listing 2: Python code for the first map function of
TPCx-BB Q1 in BabelMR.</p>
        <p>Code Listing 3: Parallel import of Parquet files written
in Python for the BabelMR variant All Custom.
2.4. Execution instance with a single thread. We flush the file streams
and then read the files serially. We executed the
benchIn a MapReduce system, the I/O-bound tasks of remote mark with 10 repetitions on 32 concurrently invoked
data transfer and data shufling are often the bottleneck Lambda functions. Figure 3 shows the median read and
and focus of optimization. In BabelMR, the worker-local write speeds and the memory used with respect to the
communication between the engine process and the user function’s RAM. The measured write speed increases
application’s process also needs to be eficient. with the function size from 32 MB/s to 512 MB/s. The
Distributed Communication. BabelMR employs scan read speed for a 128 MB function is 34 MB/s and goes up
and shufle operators that are optimized for processing on to 9.9 GB/s for the 4 GB function. This is likely due to
serverless infrastructure. These operators largely follow ifle accesses being cached in memory - as also indicated
the design of the state-of-the-art Starling and Lambada by the memory used.
systems. They encapsulate eficient reading, writing, and Thus, when RAM is suficiently available, using the
format conversions between CSV, Apache Parquet [34], filesystem as intermediate storage is reasonable. Our
and ORC [35]. Also, they are fine-tuned to the CPU and experiments indicate that execution times of big data
network characteristics of Lambda and S3. pipelines benefit from larger workers, in particular,
beInter-process Communication. BabelMR transfers cause CPU resources scale accordingly. Additionally, we
data between its engine process and the application’s do not import more than 340 MB of data per function
process via the local filesystem. Using shared memory invocation as the network throughput diminishes after
or message passing may yield performance superior to an initial network burst budget. Consequently, we do not
iflesystem I/O [ 36]. However, it requires the user code process larger-than-memory data on a single function.
to implement complex read and write operations or use Optimizations can be applied, to further improve data
libraries, such as Arrow Flight [37], which are not widely exchange. Instead of storing data as CSV files in S3,
supported across programming languages. Conversely, choosing more eficient file formats like Parquet [ 34]
using the filesystem yields high usability as file access or ORC [35] yields better performance. Less data needs
is supported by every language. To evaluate whether to be read from S3 and partitioned data can be stored
the filesystem achieves acceptable performance under in one single file. As the BabelMR engine can convert
this usability trade-of, we measure the throughput of between standard formats, user code can operate on the
Lambda’s local ephemeral storage. preferred format independent of the format used in S3.</p>
        <p>To measure I/O throughput, we write 1 GiB of data in Programming languages may entail long initialization
chunks of 16 MB into the /tmp/ directory of the Lambda times (e.g., for dependency loading). In a regular Lambda
runtime, the execution environment is cached when an
invocation finishes. The initialization overhead occurs once
per function execution lifecycle [38]. In BabelMR, the
application is newly initialized per invocation, because
it’s process is not kept alive by the underlying custom
runtime. For many languages, it is possible to employ
wrappers that keep application state cached, eliminating
repeated initialization. This, however, is not supported
out-of-the-box and requires per-language efort.</p>
        <p>Additionally, we implement TPC-H Q1 in the
serverless MapReduce frameworks Corral [42] and PyWren as
well as with PySpark [4] and Ray [43]. We execute the
latter with AWS Elastic MapReduce (EMR) [44] and AWS
Glue [45]. We compare them against BabelMR (Go) and
BabelMR (Python), where we use Go and Python for the
application code. Here, we process CSV data, as Parquet
is not supported by Corral.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.1. Development Eficiency</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>By using BabelMR, the required lines of code are reduced
by 19% and 53% compared to All Custom (see Table 1).</p>
      <p>To evaluate the usability and performance of BabelMR, We do not have to implement any S3 access, partitioning,
we write BabelMR applications for the relational query or shufling logic. In fact, in BabelMR, the application
TPC-H Q1 [39] and the MapReduce job TPCx-BB Q1 [40, code does not use the AWS SDK at all.
41]. We implement the applications in Python and C# in Listing 2 shows the Python code for the first batched
three variants. The source code can be found on GitHub map job of TPCx-BB Q1. Here, we can utilize well-known
under https://github.com/hpides/babelmr-applications. APIs, as data is stored in the filesystem. Import and export
All Custom implements all functionalities a MapReduce code (lines 1-2 and 23) become simple and short. The
system requires. The complete workflow includes S3 majority of the file (lines 4-20) contains application logic.
access (partitioned import/export), the application logic Likewise, we implement the reduce jobs of the query.
(map/reduce), and repartitioning methods. The mappers With the Dockerfile in Listing 1, we install dependencies
export partitioned Parquet files and the reducers import and code on top of the BabelMR base image.
individual row groups from multiple map outputs. We use For All Custom and System-side Shufle, we implement
established dataframe libraries for the application logic, S3 access ourselves. Listing 3 shows the implementation
the AWS SDK for S3 access, and implement repartitioning of a parallelized import from S3.
and parallelized reads to the best of our knowledge. All of the MapReduce frameworks abstract from the
distribution logic. BabelMR applications are similar in
length to programs in Corral and PyWren. PySpark and
Ray programs are shorter due to the mature integration
into both dataframe and cloud service libraries.</p>
      <p>System-side Shufle takes care of data shufling. As
partitioning strategies are independent of the application
logic, a generic intermediate pipeline is supplied, which
shufles between map and reduce stages. To configure
the intermediate pipeline, we specify the attributes that
should be available on each reducer (i.e., the values), the
attributes the data should be partitioned on (i.e., the key)
and the number of reducers . The intermediate pipeline
reads the map outputs and yields output files with 
partitions, which are consumed by subsequent reducers.</p>
      <sec id="sec-3-1">
        <title>3.2. Performance</title>
        <p>System-side Shufle
TPC-H Q1 Python</p>
        <p>TPCx-BB Q1 C#
100
50
0</p>
        <p>Scale Factor
1
10
100
1000
1
10
100
1000
CSV data for TPC-H Q1 only, where we have 5 files and
use ⌈2.5⌉ mappers. For AWS EMR, elastic and static
clusters may utilize one driver (8 vCPUs/16 GB RAM)
and 1, 4, 40 or 400 worker (16 vCPUs, 32 GB RAM each)
for scale factors 1 to 1,000. For Ray clusters, we use Z.2x
machines for the worker. That way, the clusters have the
same amount of resources at disposal as the serverless
systems using Lambda workers.</p>
        <p>TPC-H Q1. For our experiments with TPC-H Q1 (see
Figure 4), System-side Shufle has the fastest runtimes.</p>
        <p>This is due to the map stage generating very small results
(≈  · 8 KB). These small files are read fast in parallel by
the intermediate shufle stage. The shufle outputs one
ifle that can be easily consumed by the reducer.</p>
        <p>While BabelMR also benefits from the optimized read
during the reduce stage, System-side Shufle executes
roughly 4 s faster. This is because of a shorter map stage,
where every worker works on exactly one input file. Here,
invoking BabelMR to materialize the Parquet file into the
ephemeral storage leads to an overhead, compared to
Python directly loading from S3.</p>
        <p>While All Custom’s parallelized import (see Listing 3)
is roughly 10 times faster than importing sequentially, it
still scales badly. At scale factor 1,000, the reduce function
needs to resolve 1,000 imports, resulting in around 60
seconds being spent on imports. This leads to ~3.6 times
slower end-to-end runtimes compared to BabelMR and
System-side Shufle at scale factor 1,000.</p>
        <p>TPCx-BB Q1. Figure 5 breaks down the execution times
of individual functions for TPCx-BB Q1. The query
consists of three stages (and two additional shufle stages for
System-side Shufle). Note the diference for workers of
the System-side Shufle and All Custom for Stage 2 and
Stage 3. The consumed data is the same, but workers in
All Custom need to execute 50 or 10 imports respectively.</p>
        <p>Workers in System-side Shufle only have to import two
ifles, due to the upstream repartitioning and combining
of the intermediate shufle stage (two blue lines).</p>
        <p>Stages 2 and 3 in BabelMR profit from the same
principle. Additionally, the import of BabelMR in the first stage
outperforms the other variants. Only 3 of the 23 columns
of the store_sales table are required. The Parquet import
of BabelMR is better optimized to read only the required
columns than the custom implementation in C#.</p>
        <p>At large scale factors, BabelMR beats All Custom with
regards to usability and performance on both TPC-H Q1
and TPCx-BB Q1. Compared to the System-side Shufle,
performance is similar and usability is better. In contrast
to System-side Shufle, BabelMR materializes data once
more and has a larger deployable, leading to overheads.</p>
        <p>BabelMR Engine</p>
        <p>User Code
0
5</p>
        <p>10 15
Runtime (seconds)
20</p>
        <p>25
PyWren. As PyWren is unmaintained, we had to rebuild
the underlying Python runtime and do minor bug fixes
on PyWren. Still, compared to 2017 [22] setup times
decreased from 14.2 s to 7 s and Lambda start latencies
from 9.7 s to at most 1 s now (see Figure 2).</p>
        <p>We execute the same Python map and reduce functions
for TPC-H Q1 in PyWren and BabelMR. As we measure
end-to-end runtimes, serialization of the Python
functions and their upload to S3 is also measured. BabelMR
outperforms PyWren by approximately 14 s for scale
factors 1 and 10. This originates from the fact, that setup
costs in PyWren are doubled, once for the map and once
for the reduce phase [22]. On scale factor 100, BabelMR
is on average 17.5 s faster than PyWren. At start-up,
PyWren creates S3 objects for each worker’s input, which
leads to this additional overhead at high scale factors.
Corral. Corral exposes the traditional tuple-based
MapReduce interface without the support of combiners.</p>
        <p>As a result, the outputs of the map stage are significantly Figure 6: Execution times for TPC-H Q1 on PyWren, Corral,
larger (≈  · 60 MB) than those of BabelMR Go. Addi- PySpark, Ray, and BabelMR. The data was stored in S3 and
tionally, the map stage produces only four distinct keys. partitioned into multiple CSV files. * denotes a failed run;
Therefore, execution times do not scale properly as four × denotes a run omitted due to AWS quota constraints.
reducers have to process the entire map output (see
Figure 6). For scale factor 100 the input for one reducer is
larger than its memory, which is not supported by Corral 4. Related Work
and results in a failing pipeline. Corral performs slightly
better compared to BabelMR Go for scale factor 1. This
is partly due to the materialization overhead discussed
earlier. Additionally, Corral distributes the rows of the
ifve input files evenly among workers, whereas BabelMR
assigns CSV input on a per-file granularity,
PySpark on EMR. BabelMR outperforms PySpark on
all scale factors on TPC-H Q1. PySpark exposes the high
level Spark [5] API to the user, so underlying execution
details are handled by Spark. Thus, three instead of two
stages are executed, wherein the first stage, Spark reads
metadata from input files. This takes 2 s at scale factor
1 and 20 s at scale factor 1,000. Execution time is also
spent on managing the cluster and allocating the required
executors. At scale factor 1,000, PySpark runs 28 s until
all 240 executors work on jobs in the static cluster and
even longer on the elastic cluster.</p>
        <p>Ray on Glue. Ray handles distribution and scheduling
of jobs, utilizing it’s high level dataset API. While Ray
yields good usability and is well integrated in AWS Glue,
execution times are higher than those of the introduced
systems working on AWS Lambda. Start-up latencies are
one reason for this. For scale factor 10, it takes 35 s to
start all workers.</p>
        <p>In summary, BabelMR is performance competitive with
other MapReduce systems running applications written
in the languages that they support, despite the more
generic multi-language approach.</p>
        <p>Most closely related to our work are data processing
systems that build on serverless infrastructure to alleviate
their users from provisioning, management, and scaling
of resources. There are the MapReduce-style systems
PyWren [22], Corral [42], Flint [46], and Qubole [47] that
each ofer an interface for a popular programming
language, e.g., Python, Go, or Java. In addition, there has
been work on serverless SQL query processors with
Starling [23] and Lambada [24], and our own prior work on
Skyrise [48]. They all integrate eficiently with cloud
storage options for storage access and data shufling. They,
however, all target only a single programming language.</p>
        <p>Saur et al. acknowledge the demand for custom code
executing arbitrary computation written in any
programming language [36]. They suggest containerized UDFs
enabling dependency management, portability, and
encapsulation. They exchange input and output tuples
between the running containers and the database. While
Saur et al. assume a database server communicating to
the containers, in our serverless environment data is
stored in S3. We propose to collocate BabelMR into the
container that handles the S3 access, communicating data
to the user code via the file system.</p>
        <p>While not a full-fledged data processing system, the
distributed map abstraction from AWS Step Functions
[49] allows to orchestrate large-scale parallel workloads
of containerized applications. In contrast to BabelMR,
they cannot execute reduce jobs, and thus are limited to
embarrassingly parallel applications.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>[30] Amazon, Lambda Runtimes, https://docs.aws. [39] Transaction Processing Performance Council,
TPCamazon.com/lambda/latest/dg/lambda-runtimes. H Benchmark, https://www.tpc.org/tpch/, 2023.
html, 2023. [40] P. Cao, B. Gowda, S. Lakshmi, C. Narasimhadevara,
[31] D. Merkel, Docker: Lightweight Linux Contain- P. Nguyen, J. Poelman, M. Poess, T. Rabl, From
Bigers for Consistent Development and Deployment, Bench to TPCx-BB: Standardization of a Big Data
Linux journal 2014 (2014) 2. Benchmark, in: TPCTC, volume 10080 of Lecture
[32] Amazon, Amazon Elastic Container Registry, https: Notes in Computer Science, 2016, pp. 24–44.</p>
      <p>//aws.amazon.com/ecr/, 2023. [41] Transaction Processing Performance Council,
[33] M. Dreseler, J. Kossmann, M. Boissier, S. Klauck, TPCx-BB Benchmark, https://www.tpc.org/
M. Uflacker, H. Plattner, Hyrise Re-engineered: An tpcx-bb/, 2023.</p>
      <p>Extensible Database System for Research in Rela- [42] B. Congdon, Corral: A Serverless MapReduce
tional In-Memory Data Management, in: EDBT, Framework Written for AWS Lambda, https://
2019, pp. 313–324. github.com/bcongdon/corral/, 2023.
[34] Apache Software Foundation, Apache Parquet, [43] Anyscale, Inc., Ray, https://www.ray.io/, 2023.</p>
      <p>https://parquet.apache.org/, 2023. [44] Amazon, Amazon EMR, https://aws.amazon.com/
[35] Apache Software Foundation, Apache ORC - High- emr/, 2023.</p>
      <p>Performance Columnar Storage for Hadoop, https: [45] Amazon, Amazon Glue, https://aws.amazon.com/
//orc.apache.org/, 2023. glue/, 2023.
[36] K. Saur, T. Mirmira, K. Karanasos, J. Camacho- [46] Y. Kim, J. Lin, Serverless Data Analytics with Flint,
Rodríguez, Containerized Execution of UDFs: An in: IEEE CLOUD, 2018, pp. 451–455.
Experimental Evaluation, PVLDB 15 (2022) 3158– [47] Qubole, Qubole: Apache Spark on AWS Lambda,
3171. https://github.com/qubole/spark-on-lambda/, 2023.
[37] Apache Arrow contributors, Introducing Apache [48] T. Bodner, Elastic Query Processing on Function
Arrow Flight: A Framework for Fast Data as a Service Platforms, in: VLDB PhD Workshop,
Transport, https://arrow.apache.org/blog/2019/10/ 2020.</p>
      <p>13/introducing-arrow-flight/, 2023. [49] Amazon, Step Functions: Using Map
[38] Amazon, Understanding the Lambda Execution En- State in Distributed Mode, https://docs.
vironment, https://docs.aws.amazon.com/lambda/ aws.amazon.com/step-functions/latest/dg/
latest/operatorguide/execution-environment. concepts-asl-use-map-state-distributed.html, 2023.
html, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>