=Paper= {{Paper |id=Vol-2367/paper_4 |storemode=property |title=Monitoring Thread-Related Resource Demands of a Multi-Tenant In-Memory Database in a Cloud Environment |pdfUrl=https://ceur-ws.org/Vol-2367/paper_4.pdf |volume=Vol-2367 |authors=Dominik Paluch,Johannes Rank,Harald Kienegger,Helmut Krcmar |dblpUrl=https://dblp.org/rec/conf/gvd/PaluchRKK19 }} ==Monitoring Thread-Related Resource Demands of a Multi-Tenant In-Memory Database in a Cloud Environment== https://ceur-ws.org/Vol-2367/paper_4.pdf
         Monitoring Thread-Related Resource Demands of a
      Multi-Tenant In-Memory Database in a Cloud Environment
                           Dominik Paluch                                                          Johannes Rank
                    Chair for Information Systems                                          Chair for Information Systems
                    Technical University of Munich                                         Technical University of Munich
                         Garching, Germany                                                      Garching, Germany
                      dominik.paluch@in.tum.de                                               johannes.rank@in.tum.de

                          Harald Kienegger                                                        Helmut Krcmar
                   Chair for Information Systems                                          Chair for Information Systems
                   Technical University of Munich                                         Technical University of Munich
                        Garching, Germany                                                      Garching, Germany
                    harald.kienegger@in.tum.de                                                  krcmar@in.tum.de

ABSTRACT                                                                     1    INTRODUCTION
Estimating the resource demand of a highly configurable software             Cloud-based services are getting more and more attractive to enter-
system like an in-memory database is a difficult task. Many factors          prises by offering technical and financial advantages to in-house
such as the workload, flexible resource allocation, multi-tenancy            data centers. Furthermore, the demand for database services in
and various configuration settings influence the actual performance          the cloud, also defined as Database-as-a-Service (DaaS), has been
behavior of such systems. Cloud providers offering Database-as-              increasing during recent years. Growing density of memory chips
a-Service applications need to monitor and apply these factors in            combined with decreasing prices permit the operation of databases
order to utilize their systems in an efficient and cost-effective man-       holding an enterprise applications complete operational data set
ner. However, only observing the CPU utilization of the database’s           in memory. These so-called in-memory databases have advantages
processes, as done by traditional performance approaches, is not             over traditional disk-based databases regarding performance when
sufficient to accomplish this task. This is especially relevant for envi-    performing big data analytics or processing analytical data in gen-
ronments with multiple active database tenants, which adds another           eral. For this reason, they are optimally suited for online analytical
level of complexity to the thread handling on multiple layers like the       processing (OLAP) workload. In these cases, parallel computing
database management system or the operating system. In this paper,           methods are used to optimize the performance of OLAP workload
we propose a fine-grained monitoring setup allowing us to analyze            resulting in the generation of multiple parallel threads [3].
the performance of virtualized multi-tenant databases. Our focus                Utilizing virtualization and multi-tenancy features allows cloud
is on extensively collecting and analyzing performance data on a             providers to decrease their operational costs, such as hardware-,
thread level. We utilize this setup to show the performance influ-           energy- and software licensing. Furthermore, these features help
ence of varying database configuration settings, different workload          them to reduce administration efforts in order to provide their
characteristics, multi-tenancy and virtualization features. There-           services in a cost-efficient and scalable fashion [1]. However, it
fore, we conducted several experiments by applying the TPC-H                 is crucial to understand the impact of different influence factors
benchmark generating OLAP workload on a SAP HANA database                    and setups on the performance behavior of in-memory databases
in a virtualized environment. In our experiments, we can show a              when utilizing these features [2]. Recent work has shown that
strong dependency between the specific type of workload and the              the efficient usage of threads is an important performance aspect
performance. Furthermore, we analyze the workload-dependent                  for all in-memory databases that processes OLAP workload [7,
performance improvements and the performance degradation when                11]. Current research either puts a focus on the operation of a
changing the runtime configuration.                                          multi-tenant database system itself [9, 12] or puts a more generic
                                                                             focus on the underlying virtualization layer [4, 10]. However, cloud
CCS CONCEPTS                                                                 providers are often utilizing both concepts. They use virtualization
• General and reference → Measurement; Performance; • In-                    in order to increase the efficiency and improve the flexibility of their
formation systems → Main memory engines; Database per-                       hardware usage and multi-tenancy features on the database layer
formance evaluation;                                                         to allow further reductions in the maintenance effort and in the
                                                                             overhead of multiple virtual machines (VMs). Thus, it is expedient
KEYWORDS                                                                     to consider not only the performance of a multi-tenant database or
                                                                             the virtualization layer, but considering both concepts.
In-memory Database; Performance Analysis; Cloud Computing;
                                                                                Furthermore, cloud providers often utilize only high level moni-
Multi-Tenancy; SAP HANA
                                                                             toring, which does not allow to specifically monitor the utilization
31st GI-Workshop on Foundations of Databases (Grundlagen von Datenbanken),   of threads through a process. In our work, we could notice a a fully
11.06.2019-14.06.2019, Saarburg, Germany.                                    utilized CPU even in scenarios with less intense workload. Thus,
Copyright is held by the author/owner(s).
                                                                                                                                D. Paluch et al.


monitoring only the CPU utilization is not enough to draw conclu-                  Table 1: Layers of our monitoring solution
sions on the current workload intensity. In order to identify critical
workload scenarios, a more fine-grained monitoring is necessary           Layer                            Description
to capture all performance-relevant aspects.
                                                                            l1     The resource consumption of the jobworker processes
   This article is an extension of [11] and addresses the challenge
                                                                            l2     The resource consumption of the indexserver processes
to examine the performance behavior of the in-memory database
                                                                                             of the individual tenant databases
SAP HANA in the context of a cloud provider’s virtualized environ-
                                                                            l3          The total resource consumption of the LPAR
ment by considering various configuration changes. These include
changes regarding the database system, the virtualization layer,
the workload parametrization and the multi-tenancy features. Fur-
thermore, a focus is set on thread efficiency taking into account        utilization of the IBM tool ppc64_cpu we modified the configuration
the high parallelization when processing OLAP workload. Thus,            of the SLES operating system and varied the SMT configuration. We
following aspects were considered when designing our setup:              conducted benchmark runs with SMT turned off, SMT-2, SMT-4 and
    Workload Characteristics Regarding the workload, we con-             SMT-8. The different SMT configurations denote a hardware-based
      sider the workload intensity and the definition of a user in       multithreading feature of the Power8 CPUs. In this context, the digit
      the context of our benchmark setup. In this paper, we ex-          of the identifier equals the number of threads per core, i.e. SMT-4
      tend previous work by varying the type and intensity of the        equals four threads per core. In order to minimize the performance
      workload.                                                          impact on the database through our benchmark-setup, we utilized
    Multi-Tenancy We measure the threading characteristics of            another LPAR on the same server for our benchmark driver. In
      the single statements to get additional insights into threading    this setup, both LPARs are connected via a virtual switch, which
      efficiency. We also set various limits of concurrently active      allowed us to exclude any network-related performance impact,
      threads in the HANA database to show the performance               since this aspect is not included in the scope of this paper. In order
      impact on response time of the individual statements. In           to perform the benchmarks, we utilized customized shell scripts as
      addition, we vary the number of concurrently active tenant         the benchmark driver.
      databases to consider multi-tenant scenarios as well as single-
      tenant scenarios.                                                  2.2     Experimental Design
    Virtualization Aspects When considering threading, it is im-         In previous work, we could show that the performance of data-
      portant to include aspects of the virtualization layer. We         base tenants is strongly dependent on thread-related parameters
      consider the dynamic assignment of processing resources to         [11]. These parameters have a major impact on the performance
      a VM running a multi-tenant in-memory database. Since the          behavior. Thus, we created our experimental design with the ob-
      simultaneous multithreading (SMT) technology has an influ-         jective to extensively collect thread-related metrics influencing the
      ence on the processing time of threads [5], we also perform        CPU-related resource demand.
      benchmarks to quantify this impact.
                                                                         2.2.1 Monitoring setup. Our monitoring setup aims to collect fine-
2 METHODOLOGY                                                            grained information about the thread-usage of the HANA database.
                                                                         We utilized the virtual file system /proc to collect performance data
2.1 Hardware and Software Setup                                          regarding relevant database processes and their thread utilization.
We used the established OLAP benchmark suite TPC-H on a SAP              The database processes an OLAP query via the so-called indexserver
HANA database for our experiments [13]. To conduct the bench-            process, that is part of every database tenant. In a first step, the in-
marks, we used HANA version 2.0 SP 2 in a multi-tenant config-           dexserver process invokes an SQL executor thread, which performs
uration. In total, five tenants have been configured. We filled the      query optimizations, prepares the execution plan and identifies the
created database tenants with data sets created with a scale-factor      query as an OLAP query. After the query has been identified as
of 30 as proposed by the benchmark guidelines. This resulted in          a complex OLAP query, it is delegated to the job executor thread.
tenant sizes of 30 GB each. To avoid unwanted performance inter-         For parallel processing, the job executor assigns the query to multi-
ferences between the tenants, we created individual data sets for        ple idle threads from a predefined thread pool. These threads are
each tenant.                                                             utilized as jobworker threads. Based on these processing steps, we
   We chose SUSE Linux Enterprise Server (SLES) 12 SP2 as the            decided to put a focus on the three layers described in Table 1 when
underlying operating system. Our experiments were conducted              processing the raw data from our script-based monitoring solution.
on two VMs on an IBM Power E870 server with four CPU sockets             Monitoring l 1 allowed us to achieve fine-grained insights about
populated with Power8 CPUs. In total, the server offered 40 physical     the thread-usage while processing OLAP queries. The jobworker
CPU cores operating at a clock frequency of 4.19 GHz. To operate         threads consume most of the system resources when processing
the VMs, the server utilized the firmware-based hypervisor platform      an OLAP query. Thus, we have put a clear focus on analyzing the
IBM PowerVM. In this context, the VMs are also referred to as            resource usage of these jobworker threads. However, in order to
logical partitions (LPARs). The server was equipped with 4096 GB         consider the resource consumption of other threads, we decided to
RAM. We assigned 256 GB RAM to the LPAR running the HANA                 analyze the monitoring data from l 2 and l 3 in addition.
database. Furthermore, we utilized different CPU assignments and            After conducting our benchmarks, we noticed an identical mem-
varied between two CPU cores and four CPU cores. Through the             ory usage pattern in the database’s various tenants. Furthermore,
Monitoring an In-Memory Database in a Cloud Environment


an in-memory database is generally designed to keep its operational                 Table 2: Parameters of the experiment
data set in memory. Thus, we decided to exclude memory usage
from our work. Consequently, we put a focus on CPU utilization                    Layer                  Description
in our resource demand analysis. Thus, we decided to extract the
                                                                                    x1     Single Query User / Multi Query User
following performance-relevant metrics from our raw data:
                                                                                    x2            Number of active Users
   query: Each TPC-H query reflects individual performance rel-
                                                                                    x3           Number of active Tenants
evant aspects of the database processing OLAP workload due to
                                                                                    x4             Thread-Limit On / Off
differences in the execution plan. Thus, each query has an individual
                                                                                    x5        Number of assigned CPU cores
performance behavior and needs to be monitored separately.
                                                                                    x6       Number of Threads / Core (SMT)
   response time: We utilized this metric as the main indicator for
the performance behavior of an individual query. We defined this
metric as the time from which the query was sent to the database
until a result has been returned.                                       to consider all relevant processing phases. Thus, we configured
   processing time: In addition to the response time, we also consid-   our monitoring setup to collect data at intervals of 0.0079 seconds
ered the actual processing time of a query.                             on average. Lower monitoring resolutions would result in a lack
   active jobworker threads: We noticed the jobworker threads were      of accuracy and would not allow us to identify the exact resource
most relevant for processing an OLAP query. Thus, we counted            demand for each query. This aspect is especially relevant for the
the number of jobworker threads, which have been involved when          queries, which have only a short runtime. In later benchmark runs,
processing a query.                                                     we decided to decrease the monitoring resolution to 0.179 seconds
   rs ratio: To get further insights about the CPU utilization of the   to reduce the size of our raw data sets. In scenarios with higher
jobworker threads, we also analyzed the ratio between the total         loads the query runtime increased, which allowed us to decrease
count of jobworker threads in the status running and those in the       the monitoring resolution but still collect enough data to allow us
status sleeping.                                                        a fine-grained analysis.
   rs jumps: In our analysis, we also counted the number of times          In order to show the influence of workload characteristics on the
the jobworker threads changed their status from running to sleeping.    performance behavior, we decided to compare benchmark runs in
This is important, since it allows us to draw conclusions about the     single-tenant scenarios with those in multi-tenant scenarios. In [11],
thread-utilization of a query.                                          we could show that performance differences between single- and
   jw cpu: Furthermore, we considered the total CPU time consumed       multi-tenancy scenarios exist. In addition, we could show the largest
by the jobworker threads.                                               performance differences occur in scenarios with high load. However,
   total cpu: We included this metric to compare the CPU time           we demonstrated the performance impact being much smaller when
consumed by the jobworker threads with the CPU time consumed            we additionally increased the number of active tenants in our multi-
by the whole system to ensure that no major interference by another     tenant setup. Thus, we decided to compare only the performance
thread has occured.                                                     of scenarios with five active tenants to the performance with only
   cpu jumps: Threads get assigned to different CPU cores by the        one active tenant in this paper. This helped us to limit the number
operating system. However, the utilization of different cores by        of long running benchmark runs. We varied the number of users in
one thread increases the processing time of the thread. Thus, we        the performed benchmark runs in both scenarios. For a low load
counted the number of times the OS assigned a jobworker thread to       scenario we utilized 5 concurrent users, for a medium load scenario
a different CPU core.                                                   we utilized 20 concurrent users and for a high load scenario we
   context switches: This metric is only available for the whole        utilized 50 concurrent users. For further performance insights, we
system. It describes the switching of the CPU from one thread to        decided to vary parameter x 1 . Thus, we utilized two different user
another. The system has to perform multiple time-consuming steps,       definitions. With the first definition, we did not conduct the TPC-
when performing a context switch. Thus, this metric indicates a         H benchmark as intended. We chose to run each TPC-H query
negative impact on the performance.                                     isolated and defined the user as the number of concurrently active
                                                                        executions of the same query. In the second definition, we defined
2.2.2 Benchmark design. We designed our experiments in order            the TPC-H user as intended. Thus, each user represented a different
to get fine-grained information about the thread usage of the data-     set of queries in in a specified sequence, which was unique for
base tenants in the virtualized environment of a cloud provider.        each user. Summarizing, we varied parameter x 1 , x 2 and x 3 in this
The parameters of the experiment are listed in Table 2. In our first    second set of benchmark runs.
benchmark run, we aimed at getting fine-grained information about          SAP HANA is a highly configurable software system. Hence,
the performance behavior of the individual TPC-H queries. We as-        it offers various parameters to optimize the performance of the
signed two CPU cores to our database LPAR for the first benchmark       database. The parameter max_concurrency limits the maximum
run and executed all 22 queries consecutively on a single tenant.       number of jobworker threads a database tenant can utilize. SAP
Since we wanted to exclude caching effects from our results, we exe-    recommends setting the parameter to a value equal to the number
cuted the queries as regular, non-prepared statements. To avoid any     of available CPU cores divided by the number of tenant databases.
interferences between the queries, we configured a waiting time         In this third set of benchmark runs, we varied parameters x 2 , x 3
of 10 seconds after each query execution. Furthermore, we were          and x 4 . We set parameter x 4 to either three, which limited the
interested in monitoring data with a very high resolution in order      number of jobworker threads or no value, which did not set any limit.
                                                                                                                                                                                                                                                                    D. Paluch et al.


Furthermore, we set parameter x 1 to the user definition according
to the TPC-H benchmark for this run and all following runs.




                                                                                                                                                                                                                               1400
                                                                                                                                                                                  Normalized CPU Time Utilized by Jobworkers
     In the fourth set of benchmark runs, we aimed to analyze the per-




                                                                                                                       20




                                                                                                                                                                                                                               1200
formance behavior in a virtualized cloud environment. In this setup,




                                                                                                                                                                                                                               1000
                                                                                                                       15
it is possible for administrators to assign more CPU resources to a




                                                                                Seconds




                                                                                                                                                                                                                               800
                                                                                                                       10
VM dynamically. For this reason, we increased the CPU assignment




                                                                                                                                                                                                                               600
from two to four CPUs during this benchmark run. It is also possi-




                                                                                                                                                                                                                               400
                                                                                                                       5
ble to migrate the VM to a server with a different CPU, which for




                                                                                                                                                                                                                               200
                                                                                                                       0
example offers different capabilities regarding SMT. Thus, we con-                                                           1    3   5   7     9   11   13   15   17   19   21                                                         1    3   5   7     9   11   13   15   17   19   21


sidered the performance-impact of simultaneous-multithreading                                                                                 TPC-H Query Class                                                                                          TPC-H Query Class



in this paper. Summarizing, we decided to vary the parameters x 2 ,                                                              (a) Response Time                                                                                     (b) Utilized CPU Time
x 3 , x 5 and x 6 . Parameter x 4 has not been set during these runs to
avoid performance restrictions.




                                                                                                                                                                                                                               500
                                                                                                                                                                                  Normalized Total of Threads Set to Sleep
3 RESULTS




                                                                                                                       50
                                                                                Number of Active Jobworkers




                                                                                                                                                                                                                               400
3.1 Analyzing the resource demand of the




                                                                                                                       40




                                                                                                                                                                                                                               300
    individual TPC-H queries




                                                                                                                                                                                                                               200
                                                                                                                       30
In this section, we analyze the thread-related resource demand of




                                                                                                                                                                                                                               100
                                                                                                                       20
the individual TPC-H queries utilizing our collected monitoring




                                                                                                                                                                                                                               0
data. Figure 1 shows the thread-related resource demand through                                                              1    3   5   7     9   11   13   15   17   19   21                                                         1    3   5   7     9   11   13   15   17   19   21

                                                                                                                                              TPC-H Query Class                                                                                          TPC-H Query Class
the previously described performance metrics in a single tenant
environment. We analyzed each query individually, formed groups                                                                  (c) Active Threads                                                                                         (d) Sleeping Phases
and pointed out any anomalies.
    grp1 (Query 1, 18): Both queries stand out due to their high
response time. Furthermore, both queries are rather CPU intensive.
                                                                                                                       300




                                                                                                                                                                                                                               32000
                                                                                Normalized Number of CPU Assignments




                                                                                                                                                                                  Normalized Number of Context Switches
The status of the utilized jobworker threads are comparatively rarely
                                                                                                                       250




set from running to sleeping, which enhances the efficiency. These
                                                                                                                       200




                                                                                                                                                                                                                               28000
threads are also rarely assigned to a different CPU core. The number
                                                                                                                       150




                                                                                                                                                                                                                               24000
of context switches performed during the execution of these queries
                                                                                                                       100




is also comparatively low. However, query 18 utilizes a much higher
                                                                                                                       50




                                                                                                                                                                                                                               20000
number of jobworker threads indicating a better parallelizability.
                                                                                                                             1    3   5   7     9   11   13   15   17   19   21                                                         1    3   5   7     9   11   13   15   17   19   21
    grp2 (Query 9, 13, 21): The resource utilization of these queries is                                                                      TPC-H Query Class                                                                                          TPC-H Query Class

similar to the previous set of queries. In this case, all queries utilize a
high number of jobworker threads. In addition, the processing phase                                                          (e) CPU Assignments                                                                                       (f) Context Switches
of these threads is interrupted only rarely through sleeping phases.
The number of context switches is higher than in the previous set.                                                                                   Figure 1: Resource demand
    grp3 (Query 15, 16, 22): The major differences to the resource
demand of the previous query set are the low response times. The
number of sleeping phases and the number of context switches are              assignment of CPU cores, the number of sleeping phases and the
rising to a value slightly below the average value.                           CPU time.
    grp4 (Query 6, 13): In contrast to the previous set, these queries           In most cases, the response time is very close to the actual
show a much lower utilization of jobworker threads. The number                processing time in this scenario with only one active user. How-
of sleeping phases also increases.                                            ever, query 2, 3, 12 and 21 show differences between these times.
    grp5 (Query 4, 14, 17, 19, 20): Compared to the previous set, these       In conclusion, our fine-grained monitoring solution allows cloud
queries show only an average CPU utilization. All queries only                providers to examine the resource demand of the specific workload
utilize a low number of jobworker threads. Except for the queries 4           in detail.
and 14, the processing phases are often interrupted through sleeping
phases. This also results in a higher number of different assignments         3.2                                            Performance behavior in different
to the CPU cores.                                                                                                            workload scenarios
    grp6 (Query 5, 10, 12): The resource demand of these queries              In this section, we analyze the thread-related resource demand
is very similar to the previous query set. However, they utilize a            of the individual queries when changing the workload scenarios.
higher number of jobworker threads and are less often assigned to             Figure 2a shows the effects of workload changes on the query perfor-
different CPU cores.                                                          mance when a user is running all queries in a predefined sequence
    grp7 (Query 2, 3, 7, 8): These queries show the lowest utilization        compared to the repeated execution of only a single query. It is not-
of the CPU. In this case, this also results in a high number of con-          icable, that only CPU intensive queries (i.e. grp1 and 2) can benefit
text switches. Query 2 shows a very high variance regarding the               from the new workload scenario. However, a low CPU utilization
Monitoring an In-Memory Database in a Cloud Environment


does not necessarily result in a large loss of performance as for                                                                                             5 Users       20 Users        50 Users                                                                                           5 Users       20 Users        50 Users




                                                                                                                                         500




                                                                                                                                                                                                                                                                            150
example query 3 shows. The number of sleeping phases also affects




                                                                           Average Performance Change (in Percent)




                                                                                                                                                                                                              Average Performance Change (in Percent)
the performance. In most cases, the effect is stronger in high load




                                                                                                                                                                                                                                                                            100
                                                                                                                                         0
scenarios with 50 active users. In multi-tenant environments, the




                                                                                                                                                                                                                                                                            50
effect is also stronger than in single-tenant environments. For CPU




                                                                                                                                         -500




                                                                                                                                                                                                                                                                            0
intensive queries, changing the user definition results in a decreased




                                                                                                                                                                                                                                                                            -50
                                                                                                                                         -1000
probability for the CPU being blocked by another CPU intensive




                                                                                                                                                                                                                                                                            -100
query. Thus, these queries benefit from the workload change. How-




                                                                                                                                         -1500




                                                                                                                                                                                                                                                                            -150
ever, for less CPU intensive queries (i.e. grp7) the propability for                                                                             1   3    5     7       9   11   13    15    17     19   21                                                                        1   3   5     7       9   11   13    15    17     19   21


the CPU being blocked by a more CPU intensive query increases.                                                                                                          TPC-H Query                                                                                                                      TPC-H Query



In conclusion, these benchmark results show the importance of per-         (a) Impact of workload changes (b) Limited number of job-
formance predictions in the cloud context. Changes in the workload         on the performance in a single worker threads in a single ten-
which can occur i.e. when the usage profile of one tenant changes.         tenant environment             ant environment
However, these simple changes can result in major performance
losses depending on the specific type of workload.                                                                                                            5 Users       20 Users        50 Users                                                                                           5 Users       20 Users        50 Users




                                                                                                                                         800




                                                                                                                                                                                                                                                                            350
                                                                                                                                                                                                                                                                            300
                                                                           Average Performance Change (in Percent)




                                                                                                                                                                                                              Average Performance Change (in Percent)
                                                                                                                                         600
3.3    Performance behavior in different runtime




                                                                                                                                                                                                                                                                            250
                                                                                                                                         400




                                                                                                                                                                                                                                                                            200
       configurations




                                                                                                                                                                                                                                                                            150
                                                                                                                                         200
In this section, we describe the performance influence of runtime




                                                                                                                                                                                                                                                                            100
environment factors. In a first experiment, we limited the number

                                                                                                                                         0




                                                                                                                                                                                                                                                                            50
of jobworker threads the database can utilize and compared the
                                                                                                                                         -200




                                                                                                                                                                                                                                                                            0
results to the setup with unlimited jobworker threads.                                                                                           1   3    5     7       9   11   13    15    17     19   21                                                                        1   3   5     7       9   11   13    15    17     19   21

                                                                                                                                                                        TPC-H Query                                                                                                                      TPC-H Query
   Figure 2b shows mixed results in the environment with only
one active tenant. In general, queries with an increased number            (c) Limited number of job- (d) Asignment of more CPU re-
of sleeping phases (i.e. grp5) seem to benefit from the parameter          worker threads in a multi sources in a single tenant envi-
especially in scenarios with only a low load. Through the static           tenant environment         ronment
parameter max_concurrency a single database tenant cannot longer
                                                                           Average Performance Change Compared to SMT-OFF (in Percent)




                                                                                                                                                                                                              Average Performance Change Compared to SMT-OFF (in Percent)
                                                                                                                                                                SMT-2        SMT-4          SMT-8                                                                                                SMT-2        SMT-4          SMT-8

fully utilize the CPU resources in many workload scenarios. This
                                                                                                                                         500




                                                                                                                                                                                                                                                                            150
results in the decreased performance for most OLAP queries. In
                                                                                                                                         400




multi-tenant environments, Figure 2c shows a much clearer picture


                                                                                                                                                                                                                                                                            100
                                                                                                                                         300




of the performance behavior with the limiting parameter enabled. In
scenarios with low load, queries with a higher number of sleeping
                                                                                                                                         200




                                                                                                                                                                                                                                                                            50
phases benefit clearly from the parameter change. However, CPU
                                                                                                                                         100




intensive queries with less sleeping phases (i.e. grp1 and 2) show
a performance loss. In a less intense workload, the probability of
                                                                                                                                         0




                                                                                                                                                                                                                                                                            0




                                                                                                                                                 1   3    5     7       9   11   13    15    17     19   21                                                                        1   3   5     7       9   11   13    15    17     19   21

the CPU resources being blocked through an CPU intensive query                                                                                                          TPC-H Query                                                                                                                      TPC-H Query


decreases. Thus, the duration of the sleeping phases can be de-
                                                                           (e) Performance improvement (f) Performance improvement
creased. In scenarios with higher loads, the effect does not continue,
                                                                           through SMT in a single ten- through SMT in a single ten-
as there are only slight differences in the performance behavior           ant environment with 5 active ant environment with 50 active
in these cases. Additionally, we noticed a significant lower differ-       Users                         Users
ence between the response time and the processing time during the
benchmark runs with a limited number of jobworker threads. This                                                                                          Figure 2: Resulting performance changes
can be explained by an increased resource availability for other
relevant database threads. In order to analyze the performance
improvement through the assignment of more CPU resources, we             to the increased CPU resources. To further analyze the performance
changed the CPU assignment of the LPAR from two to four for the          improvement through hardware multithreading, we changed the
next benchmark runs. Figure 2d shows the performance improve-            setup of our database LPAR to utilize no SMT at all. Afterwards,
ment through the additional CPU resources. In general, queries with      we set the LPAR to utilize SMT-2, SMT-4 and SMT-8. Figure 2e
more sleeping phases especially profit from the additional resources     and Figure 2f show the resulting performance improvements of the
in the scenario with only five active users. In these scenarios, the     different SMT-settings compared to the benchmark run with SMT
huge difference between the performance improvements of the              disabled. It is noticeable, queries with a high CPU demand combined
individual queries is noticeable. The effect is much less intense        with a high count of sleeping phases and a relatively low number of
in high load scenarios. The performance improvement is slightly          active jobworker threads (i.e. grp5) usually benefit from SMT. This
higher in multi-tenant scenarios. This performance behavior can be       effect is very intense in low-load scenarios as Figure 2e shows. In
explained by a decreasing duration of the sleeping phases. Threads       multi-tenant scenarios, this effect further increases. Queries with a
in the sleeping status can be assigned faster to a processing unit due   low CPU demand and a high number of active jobworker threads
                                                                                                                                             D. Paluch et al.


(i.e. grp6 and 7) show almost no benefit with SMT-2 enabled. They        5     CONCLUSION AND FUTURE WORK
also show a lower performance with SMT-8 compared to SMT-4.              In our work, we provided fine-grained performance insights on the
This is the result of these queries not being able to benefit from the   in-memory database SAP HANA. We have built a monitoring setup
additional SMT capabilities. Increasing complexity regarding the         allowing us to perform a detailed analysis of the thread-utilization
access of the CPU cache results in a slight performance decrease in      of the database. Our setup is also capable of collecting data with
such cases. Figure 2f shows the performance improvements in high         a very high resolution, preventing any losses through inaccurate
load scenarios. In general, the differences between the individual       monitoring data. We have shown the dependency of several metrics
queries regarding their performance improvement through SMT              on the performance behavior in multiple scenarios. Furthermore,
are much lower than in low load scenarios. Additionally, all queries     our monitoring setup allowed us to group the TPC-H queries ac-
benefit from SMT-4 and SMT-8 under high workload. This perfor-           cording to their resource demand. The fine-grained analysis of
mance behavior can be explained by the higher resource demand            the resource-demand of different queries allowed us to explain
related to this workload scenario.                                       anomalies when observing their performance behavior in different
    In conclusion, these results show the dependency of the perfor-      workload scenarios as well as in different runtime environments.
mance on multiple workload-related aspects. Depending on the                In further work, we plan to create a fine-grained performance
workload, identical changes in the database configuration can ei-        prediction model allowing us to simulate the performance behavior
ther improve the performance or result in performance losses. Our        in different scenarios.
monitoring solution allows cloud providers to closer analyze their
workload. In order to operate the databases in an efficient and cost-    REFERENCES
effective manner, this analysis is crucial. Detailed knowledge about      [1] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz,
the resource usage of the specific workload allows cloud providers            Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and
                                                                              Matei Zaharia. 2010. A View of Cloud Computing. Commun. ACM 53, 4 (April
to deploy database tenants efficiently. Furthermore, valuable re-             2010), 50–58. https://doi.org/10.1145/1721654.1721672
sources can be assigned dynamically where they are needed. To as-         [2] Andreas Brunnert, Christian Vögele, Alexandru Danciu, Matthias Pfaff, Manuel
sign too many or too few resources is disadvantageous, since either           Mayer, and Helmut Krcmar. 2014. Performance Management Work. Business &
                                                                              Information Systems Engineering 6, 3 (01 Jun 2014), 177–179. https://doi.org/10.
performance goals are not met or unneeded resources are assigned.             1007/s12599-014-0323-7
With detailed knowledge about the workload, cloud providers can           [3] F. Dehne, Q. Kong, A. Rau-Chaplin, H. Zaboli, and R. Zhou. 2015. Scalable real-
avoid both situations. Changing CPU-related resources i.e. by mi-             time OLAP on cloud architectures. J. Parallel and Distrib. Comput. 79-80 (2015), 31
                                                                              – 41. https://doi.org/10.1016/j.jpdc.2014.08.006 Special Issue on Scalable Systems
grating the database to a more powerful server or by assigning                for Big Data Management and Analytics.
more CPU-resources in a virtualized environment also results in           [4] Martin Grund, Jan Schaffner, Jens Krueger, Jan Brunnert, and Alexander Zeier.
                                                                              2010. The Effects of Virtualization on Main Memory Systems. In Proceedings of
differing degrees of success depending on the specific workload               the Sixth International Workshop on Data Management on New Hardware (DaMoN
scenario.                                                                     ’10). ACM, New York, NY, USA, 41–46. https://doi.org/10.1145/1869389.1869395
                                                                          [5] J. L. Lo, L. A. Barroso, S. J. Eggers, K. Gharachorloo, H. M. Levy, and S. S. Parekh.
                                                                              1998. An analysis of database workload performance on simultaneous multi-
4   RELATED WORK                                                              threaded processors. In Proceedings. 25th Annual International Symposium on
In [12], the author provides performance insights into the in-memory          Computer Architecture (Cat. No.98CB36235). 39–50. https://doi.org/10.1109/ISCA.
                                                                              1998.694761
database SAP HANA in a multi-tenant configuration. However, he            [6] Karsten Molka and Giuliano Casale. 2015. Experiments or simulation? A charac-
only considers the database and the applied workload as a black box           terization of evaluation methods for in-memory databases. In 11th International
and give no further insights about performance-relevant factors.              Conference on Network and Service Management (CNSM 2015). IEEE, 201–209.
                                                                              https://doi.org/10.1109/CNSM.2015.7367360
Furthermore, he does not consider the efficiency of thread usage          [7] Karsten Molka and Giuliano Casale. 2016. Contention-Aware Workload Place-
in his work. In his experiments only small sized tenants are used,            ment for In-Memory Databases in Cloud Environments. ACM Transactions on
                                                                              Modeling and Performance Evaluation of Computing Systems (TOMPECS 2016) 2,
which is unlikely in a real world scenario.                                   1, Article 1 (Sept. 2016), 29 pages. https://doi.org/10.1145/2961888
   The authors in [6] provide more fine-grained performance in-           [8] K. Molka and G. Casale. 2016. Efficient Memory Occupancy Models for In-memory
sights into SAP HANA in a multi-tenant configuration considering              Databases. In 2016 IEEE 24th International Symposium on Modeling, Analysis and
                                                                              Simulation of Computer and Telecommunication Systems (MASCOTS). 430–432.
amongst other factors differently sized tenant databases, a vary-             https://doi.org/10.1109/MASCOTS.2016.56
ing workload and different CPU assignments. In [8] they extend            [9] K. Molka and G. Casale. 2017. Energy-efficient resource allocation and provision-
their work by providing new models for the prediction of memory               ing for in-memory database clusters. In 2017 IFIP/IEEE Symposium on Integrated
                                                                              Network and Service Management (IM). 19–27. https://doi.org/10.23919/INM.2017.
occupancy. In [7], they further extend their work and provide in-             7987260
sights into the usage of threads. However, they only measure the         [10] Tobias Mühlbauer, Wolf Rödiger, Andreas Kipf, Alfons Kemper, and Thomas
                                                                              Neumann. 2015. High-Performance Main-Memory Database Systems and Modern
average number of utilized CPU cores to obtain the thread usage               Virtualization: Friends or Foes?. In Proceedings of the Fourth Workshop on Data
through the queries. Furthermore, they utilize a lower monitoring             Analytics in the Cloud (DanaC’15). ACM, New York, NY, USA, Article 4, 4 pages.
resolution resulting in a lower accuracy when considering the re-             https://doi.org/10.1145/2799562.2799643
                                                                         [11] Dominik Paluch, Harald Kienegger, and Helmut Krcmar. 2018. A Workload-
source demand of the individual queries. In this paper, we could              Dependent Performance Analysis of an In-Memory Database in a Multi-Tenant
show the limitations of this approach by performing benchmarks                Configuration. In Companion of the 2018 ACM/SPEC International Conference
in various scenarios. Considering only the utilization of CPU cores           on Performance Engineering (ICPE ’18). ACM, New York, NY, USA, 131–134.
                                                                              https://doi.org/10.1145/3185768.3186290
is not sufficient to explain the performance behavior of the TPC-H       [12] Jan Schaffner. 2014. Multi Tenancy for Cloud-Based In-Memory Column Databases:
queries in our scenarios. Thus, we extended their work by providing           Workload Management and Data Placement. Springer International Publishing,
                                                                              Heidelberg. https://doi.org/10.1007/978-3-319-00497-6_1
more fine-grained insights into thread usage in varying hardware         [13] Transaction Processing Performance Council. 2018. TPC-H benchmark specifica-
environments.                                                                 tion. http://www.tpc.org/tpch/.