=Paper=
{{Paper
|id=Vol-3785/paper107
|storemode=property
|title=Performance Analysis on DNA Alignment Workload with Intel SGX Multithreading
|pdfUrl=https://ceur-ws.org/Vol-3785/paper107.pdf
|volume=Vol-3785
|authors=Lorenzo Brescia,Iacopo Colonnelli,Marco Aldinucci
|dblpUrl=https://dblp.org/rec/conf/itadata/BresciaCA24
}}
==Performance Analysis on DNA Alignment Workload with Intel SGX Multithreading==
<pdf width="1500px">https://ceur-ws.org/Vol-3785/paper107.pdf</pdf>
<pre>
                         Performance Analysis on DNA Alignment Workload with
                         Intel SGX Multithreading
                         Lorenzo Brescia1,* , Iacopo Colonnelli1 and Marco Aldinucci1
                         1
                             University of Turin, Computer Science Department, Alpha research group


                                        Abstract
                                        Data confidentiality is a critical issue in the digital age, impacting interactions between users and public services
                                        and between scientific computing organizations and Cloud and HPC providers. Performance in parallel computing
                                        is essential, yet techniques for establishing Trusted Execution Environments (TEEs) to ensure privacy in remote
                                        environments often negatively impact execution time. This paper aims to analyze the performance of a parallel
                                        bioinformatics workload for DNA alignment (Bowtie2) executed within the confidential enclaves of Intel SGX
                                        processors. The results provide encouraging insights regarding the feasibility of using SGX-based TEEs for
                                        parallel computing on large datasets. The findings indicate that, under conditions of high parallelization and
                                        with twice as many threads, workloads executed within SGX enclaves perform, on average, 15% faster than
                                        non-confidential execution. This empirical demonstration supports the potential of SGX-based TEEs to effectively
                                        balance the need for privacy with the demands of high-performance computing.

                                        Keywords
                                        Confidential computing, Parallel computing, Intel SGX, Gramine, Occlum


                         1. Introduction
                         In recent years, the awareness of the need for privacy has gained significant prominence. In the digital
                         age, where information is predominantly stored and transmitted electronically, concerns regarding
                         the protection of sensitive data have become increasingly prevalent. This confidential information
                         can be extracted and reused without the knowledge or consent of the data owner, posing severe
                         privacy risks. This issue is not confined to the interaction between individuals and digital services; It
                         extends across various fields of scientific computing where data confidentiality is indispensable. Notable
                         examples include bioinformatics, which processes DNA and genomic data; medical research that handles
                         patient health records; epidemiology, particularly highlighted during the recent COVID-19 pandemic;
                         and social sciences that address sensitive topics such as mental health, income levels, and political
                         polarization. Economic considerations also drive the imperative to safeguard sensitive information. For
                         instance, in economics, processing financial data for trading purposes necessitates stringent privacy
                         measures. Similarly, in chemoinformatics, the discovery of drugs and molecular simulations, which
                         possess significant commercial value, require robust data protection to prevent unauthorized access
                         and exploitation.
                            For these reasons, it is imperative to adopt techniques that protect sensitive data at all stages.
                         In scientific computing, private organizations often lack the computational power to perform their
                         calculations. The simplest and most commonly used solution is outsourcing computation to a remote
                         location by renting the necessary hardware resources. A typical example of this is cloud computing,
                         where resources are allocated on demand, and an ecosystem exists to facilitate the execution of workloads
                         seamlessly. Data protection is typically considered in two primary contexts: at rest (in storage) and in
                         transit (during transmission over the network). However, it is less common to consider the vulnerability

                         BigHPC2024: Special Track on Big Data and High-Performance Computing, co-located with the 3rd Italian Conference on Big Data
                         and Data Science, ITADATA2024, September 17 – 19, 2024, Pisa, Italy.
                         *
                           Corresponding author.
                         $ lorenzo.brescia@unito.it (L. Brescia); iacopo.colonneli@unito.it (I. Colonnelli); marco.aldinucci@unito.it (M. Aldinucci)
                          https://alpha.di.unito.it/lorenzo-brescia/ (L. Brescia); https://alpha.di.unito.it/iacopo-colonnelli/ (I. Colonnelli);
                         https://alpha.di.unito.it/marco-aldinucci/ (M. Aldinucci)
                          0009-0005-1147-496X (L. Brescia); 0000-0001-9290-2017 (I. Colonnelli); 0000-0001-8788-0829 (M. Aldinucci)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                               1
of data during computation. Once a program starts executing on a remote machine, such as in cloud
computing, there is often no control or protection over the data in the main memory. Confidential
computing addresses this issue using trusted hardware to ensure data protection during execution.
This approach breaks the chain of trust between the user and the external provider by introducing
an additional entity in the trust process, the hardware manufacturer. This indirection step helps
safeguard data while it is being processed, enhancing overall data security in outsourced computational
environments. Figure 1 illustrates the entities involved and their relationships when a general user


Figure 1: Remote computing scheme with or without confidential computing involved


utilizes a provider’s remote resources. Without implementing confidential computing, the user transfers
the computation to the provider. Even if the sensitive data is encrypted during transmission and on
storage, it becomes vulnerable once it is decrypted for execution in the main memory. This exposure
occurs because the data is no longer encrypted during processing, making it susceptible to risks in a
multitenant environment, where potentially malicious workloads from other users may exist or if the
provider is compromised or has malicious intent. In such scenarios, the user has no options; she has to
unquestionably trust the provider, which is inherently untrusted. Confidential computing changes this
dynamic by breaking the direct trust relationship between the user and the provider. Trusted hardware
components, designed by the hardware manufacturer (e.g., CPU or GPU), incorporate specific features
that ensure the confidentiality and integrity of the user’s program during execution. This enables the
user to establish an indirect trust relationship with the provider. Instead of trusting the provider directly,
the user trusts the hardware manufacturer, which in turn supplies trusted components to the provider.
This approach ensures that the user’s data remains secure while being processed on the provider’s
infrastructure.
   The purpose of this paper is to conduct a performance analysis on the use of Intel SGX processors as
trusted hardware. The study is performed on an application called Bowtie2, which is a bioinformatics
software. Section 2 explains all the necessary background: what Intel SGX CPUs are and how they
can be exploited with Gramine and Occlum to facilitate their use. Furthermore, some reasons are
given for the choice of Bowtie2 as workload to assess performance. Section 3 discusses related works
considering other SGX frameworks besides Gramine and Occlum. In addition, an overview of previous


                                                      2
SGX performance studies in the High-Performance Computing (HPC) domain is provided. In Section 4,
the configurations implemented to execute Bowtie2 in native and within SGX enclaves are explained.
In Section 5, the results of the previously configured environment are illustrated, and finally, in Section
6, conclusions and possible future works are presented.


2. Background
2.1. Intel SGX
Intel Software Guard Extensions (SGX) [1] is a technology implemented in Intel processors designed
to protect processes during execution by ensuring confidentiality and integrity of the main memory.
Intel SGX extends the Instruction Set Architecture (ISA) with instructions that enable the creation of
Trusted Execution Environments (TEEs) [2], referred as enclaves in Intel’s terminology. These enclaves
are secure memory regions that provide protection even against privileged system software, such as
operating systems or hypervisors. Activating SGX features involves a non-trivial process. There are
primarily two approaches to obtain this:
Rewriting application code involves modifying the application code using the libraries provided by
    Intel’s Software Development Kit (SDK) [3] to manage enclaves. While this approach allows for
    granular control over what should be protected - down to the level of a single instruction - the
    effort required for the porting is considerable.
Using frameworks to execute existing applications aims to simplify application deployment by
     allowing them to run entirely within an enclave without significant rewriting. Several frameworks
     support this method, including Gramine and Occlum Library Operating System (LibOS), which
     facilitate the execution of legacy applications within enclaves.
   Intel SGX has evolved, and the community recognizes two main versions: SGXv1 and SGXv2. These
versions differ primarily in efficiency improvements and enclave size capacities, with SGXv2 supporting
enclaves up to 512GB (against 128MB of SGXv1) and introducing Enclave Dynamic Memory Management
(EDMM) [4]. EDMM allows dynamic allocation of enclave pages (EPCs) as needed, rather than requiring
a predefined enclave size at startup time, although this feature can be complex and inefficient to
implement. A notable capability of Intel SGX processors is the concurrent execution of the same enclave
code using multiple threads. Each thread is associated with an EPC with type Thread Control Structure
(TCS); this requires prior knowledge of the number of threads to ensure sufficient EPC allocation.
Obviously, this requirement is alleviated when EDMM is enabled due to the capabilities of allocating
EPC after the enclave’s creation. Another key feature of Intel SGX is remote attestation, which allows
a remote user to verify the correct instantiation of an enclave on an SGX processor. This is not the
focus of our work; in short, the remote attestation process verifies the hash of the enclave and relies on
Intel’s certificates as the root of trust. There are principally two attestation schemes for SGX: Enhanced
Privacy ID (EPID) [5] and Data Center Attestation Primitives (DCAP) [6].

2.2. Gramine
Gramine [7], known initially as Graphene [8], is a LibOS designed to enable unmodified Linux binaries
to run within Intel SGX enclaves. The core purpose of a LibOS is to intercept system calls from an
application and resolve them directly within user space whenever possible. Gramine extends this
capability by integrating support for SGX, ensuring that the entire application, including the LibOS
itself, operates within an SGX enclave transparently to the user. To execute an application with Gramine,
the required effort is minimal and involves writing a manifest in a declarative manner. This manifest
specifies all options necessary for the execution and customization of SGX features. Once the manifest is
prepared, the workload can be executed using a set of commands from the Gramine toolchain. Although
this LibOS was one of the first to support SGX, it remains highly competitive and continuously evolves
to incorporate new SGX features, such as EDMM of SGXv2.


                                                    3
   One of Gramine’s most notable properties is its support for multiprocessing and related system
calls, such as fork, vfork, clone, and execve. This support allows multiprocessing to be handled
transparently, much like in non-SGX environments. For example, when a fork occurs, a second enclave
is created, and the content is copied using message passing. Before this, a local attestation procedure is
conducted between the enclaves, establishing a TLS secure channel for future communications. This
method of handling multiprocessing is known as Enclave-Isolated Processes (EIP) (Figure 2a), where
each enclave contains an instance of LibOS.
   The EIP approach is inherently expensive in terms of execution time. Creating a process within an
enclave is costly, and inter-enclave communication requires exchanging encrypted messages over a
secure TLS channel. However, despite these disadvantages, the EIP method has significant advantages.
The primary purpose of a LibOS with SGX integration is to facilitate the transition of workloads from
an unsafe environment to an enclave. By supporting system calls like fork and adopting EIP for
multiprocessing, Gramine allows applications that use multiple processes to be deployed quickly, with
no additional effort than single-process applications. This ease of deployment is crucial for transitioning
existing applications to secure SGX environments.


Figure 2: Different ways of handling parallelism using SGX enclaves. (a) Enclave-Isolated Processes (EIP), where
each enclave is a separate process. (b) SFI-Isolated Processes (SIP), where a single enclave is used, and tasks are
executed within a single address space.


2.3. Occlum
Occlum [9] is a toolchain that includes a LibOS designed to run applications inside SGX enclaves.
To facilitate the transition of existing applications, the Occlum toolchain provides various utilities to
prepare all necessary configurations for the building and running phases. Occlum aims to implement
a LibOS that efficiently handles multitasking, a generic term referring to the parallel execution of
multiple tasks. Occlum achieves this through a Software Fault Isolation (SFI) scheme called MPX-based,
Multi-Domain SFI (MMDSFI). In the MMDSFI scheme, each process resides alongside the LibOS within
the single address space of an enclave. This approach, known as SFI-Isolated Processes (SIPs) (Figure 2b),
contrasts with the EIP scheme used by other LibOSes such as Gramine. The term "process" in the SIP
scheme is somewhat misleading because the enclave maintains a single address space. Consequently,
traditional process creation using the fork system call is not feasible, as it requires the child process
to share the parent’s address space. Instead, Occlum creates processes using the spawn system call,
mapping each process to an SGX thread. This limitation means that applications relying on fork-like
system calls cannot run within Occlum’s LibOS without modification. However, the SIP scheme offers
significant advantages, such as reducing the cost of setting up new enclaves (creation, local attestation,
and duplication of the parent process state) and lowering the communication cost between enclaves.
The primary disadvantage of the SIP scheme is the reduced portability of existing applications that
utilize fork. To address this, intermediate work - potentially nontrivial, or even possible - may be


                                                        4
required to replace fork calls with spawn. This additional effort can be a barrier for some applications,
but the overall benefits of the SIP scheme can make it a worthwhile trade-off for some use cases.

2.4. Bowtie2: DNA alignment
Bowtie2 1 ([10], [11] and [12]) is a tool used for aligning sequencing reads to large genomes. During the
alignment, the DNA sequences are compared to identify regions of similarity. This process is crucial for
various applications, such as identifying genetic variations. Bowtie2 was selected as the performance
evaluation workload in this paper for several logical considerations:

Memory-Intensive Application Bowtie2 is memory-intensive, making it an ideal candidate for evalu-
   ating the overhead associated with SGX, which aims to secure the main memory using encryption
   techniques.

Sensitive Data Analysis DNA sequence analysis involves highly susceptible data that must be pro-
     tected, especially in remote environments like cloud providers. Using Bowtie2 helps assess the
     effectiveness of SGX in safeguarding this data.

Multithreading Performance Bowtie2’s performance can be tuned through multithreading. While
     using multiple threads typically enhances performance, evaluating this in the context of SGX
     threads is particularly insightful, as the benefits may not be as straightforward due to the additional
     overhead and security constraints imposed by SGX.


3. Related work
3.1. Other SGX technologies
Besides Gramine and Occlum, there are other technologies whose purpose is to make it easy to run
existing applications inside SGX enclaves:

       • Haven [13] is one of the pioneering approaches to execute an entire LibOS within an SGX enclave,
         enabling the execution of unmodified Windows binaries securely.
       • SCONE [14] ensures the confidentiality and integrity of containerized applications by leveraging
         SGX. Unlike LibOS, SCONE uses a thinner shielding layer to protect the application from the
         untrusted host OS. This means there is no entire LibOS within the enclave, but only some widely
         lighter shielding modules.
       • Panoply [15] is another approach that tries to minimize the amount of code that needs to reside
         inside an SGX enclave. It introduces the concept of a micro-container, which encapsulates units
         of code and data isolated within SGX enclaves.
       • SGX-LKL [16] enables Linux binaries to run inside SGX enclaves, similar to a LibOS approach but
         based on the Linux Kernel Library (LKL). It combines the flexibility of Linux with the security
         benefits of SGX, providing a lightweight solution for running Linux-based applications securely
         within enclaves.
       • Ryoan [17] leverages SGX to process sensitive data securely in environments considered untrusted,
         both in terms of the application to run and the platform itself.

3.2. SGX performance analysis
Performance represents a significant concern in the realm of confidential computing. Although the goal
is to achieve privacy, it is crucial not to compromise the execution time in chasing it. The study [18]
conducted a performance evaluation using HPC benchmarks within SGX enclaves. The work included a
comparison of performance between Gramine and Occlum, even if this comparison is inherently limited
1
    https://github.com/BenLangmead/bowtie2


                                                     5
due to Occlum’s lack of support for multiprocessing, which is particularly relevant in HPC contexts. To
address this limitation, our work focuses on evaluating a single real-world multithreaded workload
rather than synthetic benchmarks. This approach ensures a fair comparison between Gramine and
Occlum, providing valuable insights into their performance.
  Another performance study [19] compares Intel SGX and AMD Secure Encrypted Virtualization
(SEV) based-TEEs. Specifically, SCONE is employed to execute on SGX. HPC benchmarks have been
used, encompassing traditional scientific computing, machine learning tasks, and graph analytics.
  In our further recent work [20], the reference workload focused on the initial two steps of the Next
Generation Sequencing (NGS) variant calling pipeline, which has been fully migrated to a cloud-based
HPC environment [21]. Specifically, one of these steps involves the execution of Bowtie2 using Gramine.


4. Methods
This section outlines the setup of execution environments for the Bowtie2 DNA alignment bioinformatics
workload. The configurations were designed to ensure fairness across different LibOSes environments
(Gramine and Occlum). Only crucial aspects of the configuration files are presented for each setup.
Both LibOSes were established using Dockerfiles, created based on the existing Docker images provided
by the respective maintainers. A public GitHub repository2 was established to provide insight into
the configurations implemented for running in various environments. However, due to confidentiality
concerns, it was not possible to publish the DNA reads input data.

4.1. Bare-metal
To use Bowtie2 on a native system, it is possible to easily utilize package managers such as Bioconda3 ,
which provides a distribution of bioinformatics software as a channel for the versatile Conda4 package
manager. However, in this study, the executables were built directly from the downloaded sources to
facilitate fair comparisons between all execution environments (bare-metal, Gramine, and Occlum). In
order to run Bowtie2, it is necessary to specify the basename of the index for the reference genome
and the two files containing the paired-end reads (short DNA sequences). An example command for
performing the alignment against the human hg38 genome is:
b o w t i e 2 −S " o u t . sam " −x " H o m o _ s a p i e n s _ a s s e m b l y 3 8 " \
              −1 " s a m p l e . r _ 1 _ v a l _ 1 . f q . gz " −2 " s a m p l e . r _ 2 _ v a l _ 2 . f q . gz " \
              −p n u m _ o f _ t h r e a d s
   In this command, the -x option is used to specify the reference genome. The -S option designates
the output file in .sam (Sequence Alignment/Map) format, and the -1 and -2 options are for the
compressed paired-end reads in .fq (FASTQ) format. The -p option specifies the number of parallel
threads to be used for searching; each thread runs on a different core, enabling all threads to find
alignments in parallel.

4.2. Gramine
A manifest must be compiled to run an unmodified Linux binary inside an SGX enclave using Gramine.
This manifest contains all the configuration information about the LibOS and the SGX enclave. In
the Gramine toolchain, the gramine-manifest executable processes a manifest template, which can
include Jinja5 syntax for customization. Using this template simplifies the creation of the manifest and
allows for more flexible configuration. To streamline the process of creating the manifest required to
run Bowtie2 (bow.manifest), a Makefile was written that also includes the recipe below:

2
  https://github.com/lorenzobrescia/performance-SGX-Bowtie2
3
  https://bioconda.github.io
4
  https://docs.conda.io/en/latest/
5
  https://jinja.palletsprojects.com


                                                            6
bow . m a n i f e s t : m a n i f e s t . t e m p l a t e
            gramine − m a n i f e s t − D t h r e a d s = n u m _ o f _ t h r e a d s $ < >$@
   As can it be observed from the previous recipe, a manifest.template must be prepared in order
to generate bow.manifest. In the template file, all the arguments needed for execution are passed as
environment variables in the following Gramine option:
l o a d e r . a r g v = [ " / b o w t i e 2 − a l i g n − s " , " − S " , " / o u t . sam " ,
" − x " , " / H o m o _ s a p i e n s _ a s s e m b l y 3 8 " , " − 1 " , " / s a m p l e . r _ 1 . f q . gz " ,
" − 2 " , " / s a m p l e . r _ 2 . f q . gz " , " − p " , " { { t h r e a d s } } " ]
   The options specified in the manifest.template are self-explanatory in relation to the bare-metal
execution of Bowtie2. It is important to note that the bowtie2-align-s binary is run directly, rather
than Bowtie2 itself. The latter is a Perl wrapper that selects the appropriate aligner to use. The wrapper
is bypassed to simplify the process and ensure a smoother comparison with Occlum. For this reason,
the bowtie2-align-s binary is executed directly. Consequently, bowtie2-align-s is set as the
LibOS entry point in the manifest.template, meaning it is the code executed immediately after the
enclave is ready:
l i b o s . e n t r y p o i n t = " / bowtie2 − a l i g n −s "
   For handling the EDMM feature, Jinja syntax was used, still within manifest.template. If the
environment variable edmm is set to 1, the feature is enabled; otherwise, it is not. This configuration
also allows specifying the size of the enclave and the number of threads available inside the enclave.
The semantics of these configurations differ depending on whether the EDMM function is enabled.
With EDMM enabled, sgx.enclave_size refers to the maximum size the enclave can reach, and
sgx.max_threads represents the number of TCS EPCs allocated before execution. If more threads
are required during execution, additional TCS pages will be created on demand. If EDMM is dis-
abled, the options are straightforward: sgx.enclave_size sets the fixed size of the enclave, and
sgx.max_threads specifies the total number of threads that can be used, both set at the time of
enclave creation. The following snippet implements what has just been described:
{% i f env . g e t ( ‘ edmm ’ , 0 ) == ‘ 1 ’ %}
     s g x . edmm_enable = t r u e
     sgx . e n c l a v e _ s i z e = " max_enclave_size "
     sgx . max_threads = n u m b e r _ o f _ p r e a l l o c a t e d _ t h r e a d s
{% e l s e %}
     s g x . edmm_enable = f a l s e
     sgx . e n c l a v e _ s i z e = " e n c l a v e _ s i z e "
     sgx . max_threads = max_number_of_threads
{% e n d i f %}
   Once the bow.manifest is obtained from the Makefile, the SGX manifest (bow.manifest.sgx) is
also created using the Gramine toolchain, and finally, the application is run simply with the command:
gramine − s g x bow


4.3. Occlum
To launch a Linux executable inside Occlum, it is necessary to create a workspace that includes the LibOS
image that will host the executable inside the enclave. Occlum provides a comprehensive toolchain
to facilitate the deployment of this instance. First, the workspace is created using the occlum init
command. Subsequently, the file system inside the LibOS must be configured. This configuration is
achieved using the copy_bom tool, where an input file bow.yaml specifies that the bowtie2-align-s
executable is to be mounted inside the /bin folder. This process ensures the executable is correctly
placed within the LibOS image for execution inside the SGX enclave. To achieve what has just been
described, the file bow.yaml must contain the following configuration:


                                                               7
    Table 1
    Experiment configurations. Both LibOSes have been created starting from the authors’ public Docker im-
    ages: Gramine (gramineproject/gramine:stable-focal) and Occlum (occlum/occlum:0.30.0-ubuntu20.04).
    Some options are N/A because they cannot be specified in the referenced LibOS.
                         No EDMM                       Gramine [Docker]           Occlum [Docker]
                         Enclave size                       8 GB                       8 GB
                         Number of threads                   38                         38
                         EDMM
                         Init enclave size                     N/A                    512 MB
                         Max enclave size                     32 GB                    32 GB
                         Init number of threads                 32                       32
                         Max number of threads                 N/A                       38


targets :
  − t a r g e t : / bin
    copy :
        − files :
            − bowtie2 − a l i g n −s
Next, it is necessary to configure the Occlum.json file, which describes all the characteristics of the
SGX enclave. This configuration includes essential information. In cases where EDMM is not active, it
is possible to specify the enclave size and the maximum number of threads in this way:
" resource_limits ": {
     " user_space_size " : " enclave_size " ,
     " max_ n u m _ o f _ t h r e a d s " : n u m _ m a x _ o f _ t h r e a d s
}
Instead, the following options should be additionally specified to configure EDMM:
" resource_limits ": { . . .
     " user_space_max_size " : " enclave_max_size " ,
     " init_num_of_threads " :   num_of_preallocated_threads
}
  Thus, a single Occlum.json file can turn EDMM features on or off. Consequently, two different
.json configuration files were created to delineate the desired features for the experiments. The
occlum build command is used to construct the Occlum SGX enclave and generate its associated file
system image according to the specifications in the Occlum.json configuration file. Finally, to run
Bowtie2, the following command must be executed, specifying all the necessary options:
occlum run b o w t i e 2 − a l i g n − s −x " / H o m o _ s a p i e n s _ a s s e m b l y 3 8 " \
     −1 " s a m p l e . r _ 1 _ v a l _ 1 . f q . gz " −2 " s a m p l e . r _ 2 _ v a l _ 2 . f q . gz " \
     −S " o u t . sam " −p n u m _ o f _ t h r e a d s


5. Results
To assess the performance of Bowtie2 across various environments, we utilized the configurations
detailed in Table 1. The experimental setup involved a machine powered by an Intel Xeon Gold 6346
CPU operating at 3.10 GHz, with an available memory capacity of approximately 400 GB RAM.
   Figure 3 illustrates the execution times of Bowtie2 under various configurations for both small and
large input sizes. Each experiment was performed 10 times. Since no significant variance or outliers
were observed, the mean value was considered representative of the configurations. A small input
size refers to aligning approximately 10, 000 reads, while a large input size involves aligning about


                                                             8
3 million reads. As shown in Figure 3a, native execution completes rapidly within seconds for small
workloads. However, both LibOSes exhibit poor performance in this scenario, although, as can be
noticed, Occlum outperforms Gramine. Furthermore, there is a lack of scalability: increasing the
number of threads does not significantly enhance execution time, even in the bare-metal configuration.
Enabling EDMM generally leads to a stable and acceptable increase in execution times across most
cases, except when Bowtie2 necessitates 32 threads on Gramine. The excessive overhead observed may
result from the dynamic management of the TCS enclave pages. As indicated in Table 1, Gramine’s
EDMM configuration preallocates 32 threads. Although Bowtie2 operates with exactly 32 threads,
Gramine requires at least three additional threads for managing inter-process communication (IPC),
asynchronous tasks, and secure TLS communication within the LibOS, and the overhead likely arises
from the effort needed to allocate these supplementary threads. These findings discourage the adoption
of SGX technologies due to the unacceptable overhead compared to the native case and the absence of
scalability. However, it is worth noting that scalability is also lacking in the native case. Consequently,
the experiment was repeated with the same configurations detailed in Table 1 but applied to a much
larger number of sequences, and the results are depicted in Figure 3b. Some patterns evident in the small
input size scenario are also observed here. For instance, in the bare-metal environment, execution times
are significantly faster compared to those in the LibOSes, and Gramine’s dynamic thread management
severely impacts performance when Bowtie2 uses 32 threads. However, unlike the small input size case,
the plot indicates some scalability. All configurations exhibit good scaling, with Occlum performing
slightly better than Gramine again.
   The critical consideration is justifying using trusted hardware techniques such as Intel SGX. Although
SGX provides privacy guarantees, it also significantly increases execution times. In the case illustrated
in Figure 3a, the technology appears unfeasible due to the uneven trade-off between overhead and
privacy. Conversely, 3b suggests that if the application is parallelizable, good scaling can be achieved
even with SGX computations as the number of threads increases. In detail, empirical evidence indicates
that running Bowtie2 on bare metal and then re-running the same workload on SGX with twice as
many threads often increases performance. This effect is further highlighted in Figure 4, which presents


Figure 3: Comparison of execution times of Bowtie2 on different environments. The results are the mean of
10 executions. (a) The input for Bowtie2 consists of 9, 997 reads (small input size). (b) The input for Bowtie2
consists of 2, 886, 533 reads (big input size)

scalability comparison plots. Figure 4a demonstrates the performance gains of bare-metal execution
when the number of threads is doubled. Figures 4b and 4c provide comparisons between bare-metal and
Gramine, and between bare-metal and Occlum, respectively, under the same conditions. As observed,
using SGX doubling threads often results in a performance gain compared to the native case. In some
instances, the gain can be substantial; for instance, Occlum shows a 38.96% performance increase when
using two threads compared to single-thread in the bare-metal setup. Nevertheless, performance gains
are not always achievable, particularly when approaching the scalability limits of the problem. For
example, in this bioinformatics workload, the performance gain from 16 to 32 threads is marginal, even
in the native case, yielding just a 67% improvement compared to the average 95% increase. Specifically,


                                                      9
when comparing 16 native threads to 32 threads in Gramine, there is a performance decrease of 46%,
while Occlum shows a decrease of 23% in the same conditions. However, excluding the latter case,
SGX with twice as many threads not only eliminates the overhead compared to non-confidential native
execution but also achieves, on average, a 15% performance gain.


Figure 4: Bowtie2 with big input size: comparison of the scalability of different environments in respect of
the bare-metal execution. (a) Performance gain in bare-metal VS bare-metal with double threads. (b) Gramine
performance gain in bare-metal VS Gramine with double threads. (c) Occlum performance gain in bare-metal VS
Occlum with double threads

  A final consideration that emerges from the experiments is that Occlum generally outperformed
Gramine in terms of execution time and scalability. However, it is essential to note that Gramine
supports multiprocess applications, unlike Occlum, which makes Gramine particularly attractive for
the portability of legacy workloads. A similar argument applies to the EDMM feature. Although, on
average, EDMM increases execution time, it simplifies the configuration of LibOSes by eliminating the
need to estimate the memory footprint, thus facilitating the portability of existing applications.


6. Conclusion and future work
This study provides a foundational analysis of Intel SGX’s performance for parallel executions. In-
troductory empirical observations obtained in our study offer essential insights into the feasibility
of employing SGX for this kind of execution. The results indicate that doubling the threads almost
invariably improves performance compared to an environment without hardware encryption techniques.
This scenario is entirely plausible in remote environments managed by external providers, as users
typically offload computations to remote systems due to insufficient local computational resources. In
addition, these findings suggest that SGX could effectively mitigate the inherent overhead associated
with encryption, thereby preserving privacy at runtime.
   Future work may expand this performance analysis in two directions. The first direction involves a
deeper exploration of SGX technologies as highlighted in Section 3.1, and a broader examination of
other types of hardware that enable the establishment of a TEE, such as AMD SEV or Intel Trust Domain
Extensions (TDX). The second direction focuses on analyzing multiprocess applications extensively
designed for HPC centers, extending beyond bioinformatics to encompass more general applications. By
pursuing these two avenues, future research can provide a more comprehensive understanding of the
capabilities and limitations of various hardware-based security technologies in different computational
environments.


Acknowledgments
This work was supported by the Spoke 1 “FutureHPC & BigData” of ICSC - Centro Nazionale di Ricerca
in High-Performance Computing, Big Data and Quantum Computing, funded by European Union -
NextGenerationEU.


                                                    10
References
 [1] C. Victor, D. Srinivas, Intel SGX Explained, 2016. URL: https://eprint.iacr.org/2016/086, Accessed:
     2024-07.
 [2] M. Sabt, M. Achemlal, A. Bouabdallah, Trusted Execution Environment: What It is, and What It
     is Not, in: 2015 IEEE Trustcom/BigDataSE/ISPA, volume 1, IEEE, 2015, pp. 57–64. doi:10.1109/
     Trustcom.2015.357.
 [3] Intel, Intel Software Guard Extensions (Intel SGX) SDK for Linux* OS, 2024. URL: https://download.
     01.org/intel-sgx/latest/linux-latest/docs, Accessed: 2024-07.
 [4] F. McKeen, I. Alexandrovich, I. Anati, D. Caspi, S. Johnson, R. Leslie-Hurd, C. Rozas, Intel software
     guard extensions (intel sgx) support for dynamic memory management inside an enclave, in:
     Proceedings of the Hardware and Architectural Support for Security and Privacy 2016, Association
     for Computing Machinery, 2016. doi:10.1145/2948618.2954331.
 [5] S. Johnson, V. Scarlata, C. Rozas, E. Brickell, F. Mckeen, et al., Intel software guard extensions: EPID
     provisioning and attestation services, 2016. URL: https://community.intel.com/legacyfs/online/
     drupal_files/managed/57/0e/ww10-2016-sgx-provisioning-and-attestation-final.pdf, Accessed:
     2024-07.
 [6] V. Scarlata, S. Johnson, J. Beaney, P. Zmijewski, Supporting third party attestation for intel sgx with
     intel data center attestation primitives, 2018. URL: https://www.intel.com/content/dam/develop/
     external/us/en/documents/intel-sgx-support-for-third-party-attestation-801017.pdf, Accessed:
     2024-07.
 [7] C.-C. Tsai, D. E. Porter, M. Vij, Graphene-SGX: a practical library OS for unmodified applications
     on SGX, in: Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference,
     USENIX Association, 2017, pp. 645–658.
 [8] C.-C. Tsai, K. S. Arora, N. Bandi, B. Jain, W. Jannen, J. John, H. A. Kalodner, V. Kulkarni, D. Oliveira,
     D. E. Porter, Cooperation and security isolation of library OSes for multi-process applications, in:
     Proceedings of the Ninth European Conference on Computer Systems, Association for Computing
     Machinery, Amsterdam The Netherlands, 2014, pp. 1–14. doi:10.1145/2592798.2592812.
 [9] Y. Shen, H. Tian, Y. Chen, K. Chen, R. Wang, Y. Xu, Y. Xia, S. Yan, Occlum: Secure and Efficient
     Multitasking Inside a Single Enclave of Intel SGX, in: Proceedings of the Twenty-Fifth Interna-
     tional Conference on Architectural Support for Programming Languages and Operating Systems,
     Association for Computing Machinery, 2020, pp. 955–970. doi:10.1145/3373376.3378469.
[10] B. Langmead, C. Trapnell, M. Pop, S. L. Salzberg, Ultrafast and memory-efficient alignment of
     short DNA sequences to the human genome, Genome Biology 10 (2009) R25. doi:10.1186/
     gb-2009-10-3-r25.
[11] B. Langmead, S. L. Salzberg, Fast gapped-read alignment with Bowtie 2, Nature Methods 9 (2012)
     357–359. doi:10.1038/nmeth.1923.
[12] B. Langmead, C. Wilks, V. Antonescu, R. Charles, Scaling read aligners to hundreds of threads on
     general-purpose processors, Bioinformatics 35 (2019) 421–432. doi:10.1093/bioinformatics/
     bty648.
[13] A. Baumann, M. Peinado, G. Hunt, Shielding Applications from an Untrusted Cloud with Haven, in:
     Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation,
     volume 33, Association for Computing Machinery, 2014, pp. 267–283. doi:10.1145/2799647.
[14] S. Arnautov, B. Trach, F. Gregor, T. Knauth, A. Martin, C. Priebe, J. Lind, D. Muthukumaran,
     D. O’Keeffe, M. L. Stillwell, D. Goltzsche, D. Eyers, R. Kapitza, P. Pietzuch, C. Fetzer, SCONE:
     Secure Linux Containers with Intel SGX, in: Proceedings of the 12th USENIX Conference on
     Operating Systems Design and Implementation, USENIX Association, 2016, pp. 689–703.
[15] S. Shweta, L. T. Dat, T. Shruti, S. Prateek, Panoply: Low-TCB Linux Applications With SGX
     Enclaves, NDSS Symposium (2017).
[16] C. Priebe, D. Muthukumaran, J. Lind, H. Zhu, S. Cui, V. A. Sartakov, P. Pietzuch, SGX-LKL: Securing
     the Host OS Interface for Trusted Execution, arXiv preprint arXiv:1908.11143 (2019).
[17] T. Hunt, Z. Zhu, Y. Xu, S. Peter, E. Witchel, Ryoan: A Distributed Sandbox for Untrusted Computa-


                                                     11
     tion on Secret Data, ACM Trans. Comput. Syst. 35 (2018) 533–549. doi:10.1145/3231594.
[18] S. Miwa, S. Matsuo, Analyzing the Performance Impact of HPC Workloads with Gramine+SGX
     on 3rd Generation Xeon Scalable Processors, in: Proceedings of the SC ’23 Workshops of The
     International Conference on High Performance Computing, Network, Storage, and Analysis,
     SC-W ’23, Association for Computing Machinery, 2023, pp. 1850–1858. doi:10.1145/3624062.
     3624267.
[19] A. Akram, A. Giannakou, V. Akella, J. Lowe-Power, S. Peisert, Performance Analysis of Scientific
     Computing Workloads on General Purpose TEEs, in: 2021 IEEE International Parallel and Dis-
     tributed Processing Symposium (IPDPS), 2021, pp. 1066–1076. doi:10.1109/IPDPS49936.2021.
     00115.
[20] L. Brescia, M. Aldinucci, Secure Generic Remote Workflow Execution with TEEs, in: Proceedings
     of the 2nd Workshop on Workflows in Distributed Environments, WiDE ’24, Association for
     Computing Machinery, 2024, pp. 8–13. doi:10.1145/3642978.3652834.
[21] A. Mulone, S. Awad, D. Chiarugi, M. Aldinucci, Porting the Variant Calling Pipeline for NGS data
     in cloud-HPC environment, in: 2023 IEEE 47th Annual Computers, Software, and Applications
     Conference (COMPSAC), 2023, pp. 1858–1863. doi:10.1109/COMPSAC57700.2023.00288.


                                                 12

</pre>