Measuring the Performance Impact of Branching Instructions Lukas Beierlieb1 , Lukas Iffländer2 , Aleksandar Milenkoski3 , Thomas Prantl1 and Samuel Kounev1 1 University of Würzburg, Am Hubland, Würzburg, 97074, Germany 2 Deutsches Zentrum für Schienenverkehrsforschung, August-Bebel-Straße 10, Dresden, 01069, Germany 2 Cybereason, Theresienhöhe 28, München, 80339, Germany Abstract With the continuing rise of cloud technology, hypervisors play a vital role in the performance and reliability of current services. Hypervisors implement interfaces providing call-based connectivity to hosted virtualization-aware virtual machines. One of them is the hypercall interface, allowing hyper- visor service requests from virtual machines. A hypercall injection tool measuring hypercall execution times must minimize internal overhead. Among other things, limiting logging to strictly required in- formation is crucial. However, checks for values to log for every injection requires executing many branch instructions. We assess the performance difference between using and avoiding such branching, measured by the hypercall throughput of the injector tool. Keywords measurement, performance, branching 1. Introduction Today, hypervisors are virtually omnipresent. They are widespread throughout data centers, functioning as the backbone of cloud computing [1] by allowing for server consolidation with huge benefits in efficiency and flexibility. Hypervisors are also prevalent in modern desktop and workstation infrastructures [2]. This extends to Microsoft shipping their Hyper-V hypervisor directly with many versions of the Windows Operating System (OS), and in some cases, even activating it by default [3]. Virtualization allows to create virtual instances of physical devices called Virtual Machines (VMs). In a virtualized environment, governed by a hypervisor, VMs share resources. Hyper- visors implement interfaces providing call-based connectivity to virtualization-aware hosted VMs. One of them is the hypercall interface, allowing for VMs to request services from the SSP’21: Symposium on Software Performance, November 09–10, 2021, Leipzig, Germany " lukas.beierlieb@uni-wuerzburg.de (L. Beierlieb); IfflaenderL@dzsf.bund.de (L. Iffländer); milenkoski.aleksandar@gmail.com (A. Milenkoski); thomas.prantl@uni-wuerzburg.de (T. Prantl); samuel.kounev@uni-wuerzburg.de (S. Kounev) ~ https://go.uniwue.de/beierlieb (L. Beierlieb); https://go.uniwue.de/prantl (T. Prantl); https://go.uniwue.de/kounev (S. Kounev)  0000-0002-0877-7063 (L. Beierlieb); 0000-0002-8506-2758 (L. Iffländer); 0000-0002-4780-6963 (A. Milenkoski); 0000-0003-4044-8494 (T. Prantl); 0000-0001-9742-2063 (S. Kounev) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) hypervisor. Hypercalls are software traps from a VM to the hypervisor. Before introducing x86 hardware virtualization in 2006, hypercalls were one solution to run virtualized OSs. Nowadays, while technically not required, hypercalls are still a common utility to improve efficiency or offer additional features. The crucial role that hypervisor play in today’s infrastructure requires robustness and high performance, among other properties. We proposed a framework to help testing these quali- ties [4]. On the one hand, the framework supports the logging of values and execution times if required, on the other, it should be able to inject calls at a high rate, due to low overhead. In this paper, we investigate whether or not there is a performance penalty to evaluating multiple if statements for optional logging. Other works concentrate on reducing the cost of branches if that cannot be removed [5, 6, 7]. The remainder of this paper is structured as follows: Section 2 presents background knowl- edge about virtualization, Hyper-V, hypercalls, and branching in processors. Next, Section 3 introduces the measurement approach and environmental conditions during its realization; Section 4 then presents and discusses the measurements. Finally, Section 5 concludes the paper. 2. Background In a non-virtualized scenario, an OS manages physical hardware (i.e., processor, memory, and IO devices) and provides and schedules physical resource accesses for applications running on top. Virtualization describes the concept of introducing an abstraction layer above the hardware. That layer, called the hypervisor or Virtual Machine Monitor (VMM), provides a set of virtual resources, which can form multiple virtual machines managed by independent OSs. One way to classify hypervisors is by whether directly control the hardware or run on top of an OS [8]. The former approach is called a Type-1 or bare-metal hypervisor and can utilize its full control for increased performance. Type-2 or hosted hypervisors, on the other hand, can reduce their complexity by relying on the OS to take care of most of the hardware management. Para-virtualization applies changes to the source code of OSs themselves. These modifi- cations allow the hypervisor and VMs to interact more efficiently, e.g., by using abstract IO interfaces instead of emulating existing, physical devices, reducing overhead and improving performance [9]. Hyper-V is an x86_64 hypervisor developed by Microsoft [10]. It is a Type-1 hypervisor. Thus, it directly controls the hardware. However, to avoid limiting it to specific hardware configurations or bloat the code base with countless device drivers, Hyper-V uses a microkernel- based architecture. A specialized VM called the root partition always runs an instance of Windows on top of Hyper-V to provide management features and device drivers. Guest VMs (also called guest partitions) can run para-virtualized if they support it but also can use unmodified OSs, in which case Hyper-V provides emulated devices. Hyper-V offers various interfaces for VM-Hypervisor and VM-VM communication [11]: Privileged register and memory accesses, emulation of privileged instructions, IO ports, inter- VM communication via the VMBUS, and hypercalls. Similar to applications requesting services from the OS by issuing system calls, guest OSs can call into the hypervisor with hypercalls. Hypercalls are triggered by special processor instructions that transfers execution control from VM to hypervisor. Hyper-V expects a call code present in a specified register beforehand. Also, parameters for the call can be placed in registers and memory, depending on the calling convention. After processing a hypercall, Hyper-V returns a result value, which indicates either a successful execution or an error code (e.g., missing privilege, out of memory). Apart from virtualization terms, some background knowledge about CPU execution is re- quired to understand the goal of this paper. Modern processors try to execute as many instruc- tions as possible at a time. Branching instructions are problematic because the processor does not know which instructions follow until the branch is fully executed. CPUs deploy branch prediction to guess and speculatively execute further instructions, which helps to keep the performance penalty down for well predictable branches [6]. 3. Methodology The hypercall injector consists of three parts. There has to be a description of the hypercall workload, called the campaign file. It consists of binary data, describing which hypercalls with which parameters should be executed, and also if any delays should be waited between calls. A Windows kernel driver, the so-called injector module, executes the campaign. It loads the campaign into memory, executes the call and delay instructions one by one, and in between logging values if required and storing them in a file in the end. These two are connected by a desktop application. It takes the path to the campaign file as an argument, along with which values to log. Then, it prepares and loads the injector kernel module, and passes all the information required to execute the campaign correctly. We support the logging of timestamp pairs for each action (hypercall, delay), execution times (difference between the timestamps), as well as result values and all output values. Hyper-V stores output values on a dedicated memory pages. Saving these values therefore requires storing 4MB of data, which is slow. Thus, always storing all values is not viable; the slowdown would be too big for a test campaign, e.g., a stress test ignoring log values and injecting hypercalls as fast as possible. That is why the required log values are passed to the desktop application. Yet, there are two different ways of handling optional logging values during execution. Firstly, the probably more natural approach is too place if statements in the main injection loop. Listing 1 shows a pseudo-C-code of this approach. The loop works through the actions of the campaign. We are interested in issuing as many calls as possible, so there must not be any delays in the workload. Accordingly, we can skip the handling of delays. Without any logging, hypercalls need their parameters prepared in memory and the processor registers, and the call must be invoked. Additionally, depending on the requested log values, the loop has to optionally take timestamps, calculate execution time, and store values to the log buffer. These if statements introduce a lot of branches. It has to be noted however that they always take the same branch as the requested log values cannot change during the campaign. The speculative execution engine of the processor should have no problem getting the branch right. The alternative approach is the implementation of the loop for every combination of log values, which does not require any ifs. This still requires branching depending on the log values, but only once at the start. As a drawback, this approach causes lots of duplicate code, increasing driver size (not really an issue at this scale) but also drastically hurts maintainability. To test 1 // prepare 2 while (/* more to execute */) { 3 switch (/* type */) { 4 case TYPE_WAIT: 5 // sleep and maybe log times, details left out 6 break; 7 case TYPE_CALL: 8 // prepare memory for call 9 if (/* timesteps or execution time requested */) 10 // take start time 11 // issue hypercall 12 if (/* timesteps or execution time requested */) 13 // take end time 14 if (/* timesteps requested */) 15 // store time stamps 16 if (/* execution time requested */) 17 // calculate execution time and store 18 if (/* result value requested */) 19 // store result value 20 if (/* output values requested /*) 21 // store output memory page 22 } 23 } Listing 1: Main injection loop with if statements if the chaos of copy-pasted injection loops is worth it, we want to determine the maximum hypercall throughput of both configurations. We tried focusing the workload to spend as much time as possible injecting hypercalls in order to spot an eventually existing performance difference. By choosing an invalid hypercall code (hypervisor detects quickly and returns with an error) and passing no parameters, time for other tasks is reduced as much as possible. The measurements are performed on an Lenovo ThinkPad P1 with an Intel i7 9850H CPU and 2 × 16GB of 2667MHz DDR4 RAM. Unfortunately, the laptop form factor means that the CPU will probably thermally throttle under high load, i.e., rapid hypercall injection. Maximum throughput during short burst as well as continuously sustained high load throughput are of interest, so we chose the following methodology: A single test campaign contains 50 million invalid hypercalls. Each configuration starts with a processor that has been idling for at least five minutes, to ensure fair starting temperatures in the heat sink. The campaign is executed 35 times in a row without breaks. The desktop application is adapted to log the execution time the kernel driver requires to perform the workload. Tested configurations are of course the branching and branchless variants, each compiled once as debug executables and once with Release optimizations by Microsoft’s C++ compiler. The next section is now going to illustrate the measurement results. 20 15 Configuration exectime [s] branch_debug 10 branch_release nobranch_debug nobranch_release 5 0 0 10 20 30 iteration Figure 1: Comparison of different injection variants; 35 successive execution times for 50 million invalid hypercalls each 4. Results Figure 1 shows the results of the measurements. The x-axis shows the number of iterations of the 50 million-call campaign, the y-axis shows how long it took to execute the particular workload. Each configuration has its own line and color. The first few iterations show the peak performance numbers, with a non-throttled CPU. The debug-compile of the branch variant managed 3.17 million calls per second (mcps) at best. The release version actually performed worse, with only 2.93 mcps. Both versions of the no-branch implementation performed significantly better, within measurement tolerance of each other around 3.6 mcps. Except for the debug-compile without branches, all the lines showed similar throttling behavior. Minor penalties after around five runs, and yet more slowdown at around 15 repetitions. Interestingly, the branchless debug build was the least affected by thermal conditions. Overall, the original research question can be clearly answered. Removing the branches yields a significant benefit in this scenario, even though there should not occur any mispredictions. 5. Conclusion In this work, we shared our findings regarding the cost of including if statements into the main loop of a hypercall injector tool. Results showed that even though the same branches are always taken, the penalty is high enough to make an significant difference in hypercall throughput. In conclusion however, it should be stated that working dozens of slight variations of the same loop is not practical if changes happen to the code occasionally. However, a compromise of maintenance and performance should be achievable: In performance-critical cases that require no logging or only time measurements, a dedicated, no-branch loop can be implemented. For other scenarios, which are more concerned with values that hypercalls return, performance is hindered by logging but usually not of great importance - here, a branching implementation can save lots of duplicate code. Acknowledgments This work was written in Cooperation with SPEC RG Security. Lukas Iffländer made his initial contributions as group leader with University of Würzburg. References [1] S. Srivastava, S. Singh, A survey on virtualization and hypervisor-based technology in cloud computing environment, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) 5 (2016). [2] K. Miller, M. Pegah, Virtualization: virtually at the desktop, in: Proceedings of the 35th annual ACM SIGUCCS fall conference, ACM, 2007, pp. 255–260. [3] Virtualization-Based Security: Enabled by Default, https://techcommunity.microsoft.com/ t5/Virtualization/Virtualization-Based-Security-Enabled-by-Default/ba-p/890167, ???? Ac- cessed: 2019-10-27. [4] L. Beierlieb, L. Ifflander, A. Milenkoski, S. Kounev, Towards Testing the Performance Influence of Hypervisor Hypercall Interface Behavior (2019). [5] W. Hwu, T. M. Conte, P. P. Chang, Comparing software and hardware schemes for reducing the cost of branches, ACM SIGARCH Computer Architecture News 17 (1989) 224–233. [6] S. McFarling, J. Hennesey, Reducing the cost of branches, ACM SIGARCH Computer Architecture News 14 (1986) 396–403. [7] H. Kim, J. A. Joao, O. Mutlu, C. J. Lee, Y. N. Patt, R. Cohn, Vpc prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization, ACM SIGARCH Computer Architecture News 35 (2007) 424–435. [8] Z. Gu, Q. Zhao, A State-of-the-art Survey on Real-time Issues in Embedded Systems Virtualization, Journal of Software Engineering and Applications 05 (2012) 277–290. doi:10.4236/jsea.2012.54033. [9] H. Fayyad-Kazan, L. Perneel, M. Timmerman, Full and para-virtualization with xen: a performance comparison, Journal of Emerging Trends in Computing and Information Sciences 4 (2013) 719–727. [10] H. Fayyad-Kazan, L. Perneel, M. Timmerman, Benchmarking the performance of mi- crosoft hyper-v server, vmware esxi and xen hypervisors, Journal of Emerging Trends in Computing and Information Sciences 4 (2013) 922–933. [11] Hyper-V Top Level Function Specifications, 2019. URL: https://docs.microsoft.com/en-us/ virtualization/hyper-v-on-windows/reference/tlfs, [Online; accessed 5. May. 2021].