Overhead Comparison of OpenTelemetry, inspectIT and Kieker David Georg Reichelt1 , Stefan Kühne1 and Wilhelm Hasselbring2 1 University Computing Centre, Research and Development, Universität Leipzig 2 Software Engineering Group, Christian-Albrechts-Universität zu Kiel Abstract Having low performance overhead when monitoring the performance is crucial for exact measurements. Especially when trying to identify performance changes at code level, the performance overhead needs to be as low as possible. Due to changes in monitoring frameworks, performance benchmarks need regular updates. Due to changes in virtual machines, operating systems and hardware environments, performance benchmarking results also need regular updates. Therefore, we describe an extension of the benchmark MooBench which includes the emerging mon- itoring framework OpenTelemetry in MooBench, and the results of its execution on a Rasperry Pi 4 in this paper. We find that Kieker is creating slightly less overhead than inspectIT and OpenTelemetry when processing traces. Keywords performance measurement, performance monitoring, performance benchmarking, software performance engineering 1. Introduction To assure that performance requirements are met, the performance of parts of a system needs to be measured under real conditions. This measurement in live operation is called monitoring [1, p. 45]. Monitoring data can be used to identify performance issues, to extract performance models or for online capacity management. To measure the performance, monitoring tools add monitoring probes, i.e. pieces of code capable of measuring the resource usage, into the monitored system and serialize the monitoring records. The instrumentation, the measurement itself and the serialization cause monitoring overhead. Especially for the identification of performance changes at code level [2], the monitoring overhead needs to be as low as possible. Benchmarking compares different methods, techniques and tools and is used widely to compare the performance of different implementation [3]. Therefore, the MooBench benchmark has been introduced to compare the performance overhead of different monitoring frameworks [4]. Originally, MooBench was able to measure the performance of the performance monitoring frameworks Kieker [5], inspectIT1 and SPASSmeter [6]. Recently, the monitoring framework OpenTelemetry2 emerged. It provides monitoring SSP’21: Symposium on Software Performance, November 09–10, 2021, Leipzig, Germany { https://www.urz.uni-leipzig.de/fue/DavidGeorgReichelt/ (D. G. Reichelt) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://www.inspectit.rocks/ 2 https://opentelemetry.io/ support for a variety of languages and frameworks, and can therefore be used in different contexts. This paper presents an extension of MooBench that enables measuring the overhead of OpenTelemetry and Kieker and results of the execution of the extended MooBench. In the remainder of this paper, we first describe the benchmark MooBench and our extension of MooBench for measurement of OpenTelemetry in Section 2. Afterwards, we describe the measurement results of our extended MooBench version in Section 3. In Section 4, we discuss related work. Finally, we summarize this paper and give an outlook to future work in Section 5. 2. Supporting OpenTelemetry in MooBench This section first gives an overview of the benchmark MooBench and describes our extension of MooBench afterwards. 2.1. MooBench MooBench is a benchmark for measuring the overhead of monitoring frameworks [4]. Perfor- mance measurement in the JVM is influenced by non-deterministic effects such as Just-In-Time- Compilation (JIT), Garbage Collections and memory fragmentation. Therefore, performance measurements need to be repeated inside of one started JVM, called VM in the remainder. Since warmup may end up in different steady states, multiple VMs need to be started and their results need to be analyzed by statistic tests such as T-Test [7]. MooBench provides a basic Java application that repeats busy waiting in the leaf node of a tree of nodes with only one child with a given recursion depth. For every monitoring framework, a Bash script automates the VM starts of the benchmark for the correspondent framework configuration. For the frameworks, the configurations contain (at least) the baseline (no in- strumentation) and regular monitoring with serialization of the results. They may also contain deactivated monitoring (but enabled instrumentation) and different monitoring configurations (e.g. writing the results as generic text or as binary in Kieker). The measurement result of each VM run is saved as CSV into a result directory with a name denoting the VM configuration. 2.2. Extension of MooBench To make OpenTelemetry runnable with MooBench, we implemented the instrumentation with OpenTelemetry. Additionally, the existing benchmarks for inspectIT were updated. For instrumenting OpenTelemetry, we added a script. This script was built to support corresponding calls for each Kieker Call, i.e. (1) the baseline (no instrumentation), (2) monitoring with disabled sampling with otel.traces.sampler=always_off, i.e. no measurement will be done, but the instrumentation is still present, like Kieker with deactivated probe, (3) logging all method executions to standard output, like Kieker writing to hard disc, (4) logging all method executions (spans) to Zipkin and (5) logging metrics of executions to Prometheus, like Kieker writing to TCP. The measurement of Prometheus is disabled on 32-bit systems, since Prometheus only runs on 64-bit systems. Since results are saved in the MooBench CSV format, existing R scripts can be used for data analysis. Our forked version of MooBench is available on GitHub.3 3 https://github.com/DaGeRe/moobench-fork/ 3. Measurement Results To facilitate reproducibility of our results, we decided to run our benchmarks on a Raspberry Pi like Knoche and Eichelberger [8]. We used the latest Raspberry Pi 4 running Raspberry OS (formerly known as Raspbian) 5.10.17-v7l+ in the 32 bit version and OpenJDK 11.0.11. Since Rasperry OS runs on 32-bit by default, the Jaeger serialization could not be used. To compare the Raspberry Pi results to a regular Desktop system, we also measured the same benchmarks on an i7-4770 CPU @ 3.40GHz with 16 GB RAM, running Ubuntu 20.04 and OpenJDK 11.0.11. We executed the benchmark with call tree depth 10 (like Knoche and Eichelberger [8]) and with growing call tree depth. 3.1. Call Tree Depth 10 Our results of the execution of 2 000 000 calls with recursion depth 10 and 10 VM starts are depicted in Table 1 and Table 2. On Raspberry Pi and i7-4770, all relations stay equal, e.g. deactivated OpenTelemetry is slower than deactivated Kieker both on Raspberry Pi and i7-4770. Nevertheless, we see that the ratio of execution times changes, e.g. baseline execution is ap- proximately 25 times faster on i7-4770 than on Raspberry Pi but execution with deactivated OpenTelemetry instrumentation is only approximately 5 times faster. Due to the limited instruc- tion set of ARM processors, benchmark results on Raspberry Pi are probably not corresponding to benchmark results on desktop or server systems. Therefore, Raspberry Pi might not be a suitable hardware for benchmarking in every use case. For the configu- Variant Raspberry Pi i7-4770 rations, we see the 95 % CI 𝜎 95 % CI 𝜎 following results: Baseline [1.5;1.5] 0.1 [0.057;0.058] 0.026 Baseline: The Rasp- Kieker berry Pi 4 with cur- Deactivated Probe [4.1;4.1] 7.5 [0.4;0.4] 7.1 rent software envi- DumpWriter [51.9;52.0] 14.6 [8.5;8.5] 12.2 ronment shows (as Logging (Text) [743.3;799.4] 14315.8 [103.0;103.3] 56.4 expected) slight im- Logging (Binary) [59.8;87.8] 7149.4 [3.4;3.4] 15.8 provements over TCP [45.6;45.7] 14.6 [4.6;4.7] 10.4 the measurement values from Knoche Table 1: Measurement Results for Kieker and Eichelberger [8] on Raspberry Pi 3. Deactivated: The execution of Kieker with deactivated probe creates less overhead than the deactivated execution of OpenTelemetry and inspectIT. Logging to hard disc: With activated logging to the file system, Kieker with binary logging is faster than OpenTelemetry. Regular text logging is slower with Kieker. External data processing: With external data processing, by Zipkin for OpenTelemetry and TCP sending by Kieker, Kieker is slightly (but significantly) faster than OpenTelemetry and inspectIT. inspectIT is faster when metrics are processed by Prometheus. 3.2. Growing Call Tree Depth Figure 1 shows the average of warmed up measured durations of all VMs with growing call tree depth. Even if call tree depth are discrete values, we chose to draw lines for better visibility. OpenTelemetry inspectIT Variant Deactivated StdOut Zipkin Prometheus Deactivated Dump Zipkin Prometheus Probe Probe Writer Pi 4 CI [26.8;26.9] [483.0;508.1] [53.4;53.6] [44.4;44.5] [9.9;9.9] [87.2;87.5] [97.2;97.8] [32.3;32.4] 𝜎 20.4 6408.7 46.7 25.2 10.5 78.7 149.6 16.6 i7-4770 CI [4.9;5.0] [56.9;57.7] [6.8;6.9] [6.9;6.9] [1.3;1.4] [10.3;10.4] [10.9;11.2] [4.0;4.0] 𝜎 4.1 222.5 8.5 4.9 8.2 17.4 57.4 4.1 Table 2 Measurement Results for OpenTelemetry and inspectIT It show two things: (1) The over- Overview of Method Execution Durations 140000 head lineary increases with grow- Baseline Kieker (TCP) ing call tree depth, which is equal 120000 inspectIT (Zipkin) OpenTelemetry (Zipkin) to growing count of instrumented 100000 methods. Therefore, instrument- Duration µs 80000 ing the whole application will al- 60000 ways create big overhead. (2) The relations from call tree depth 10 40000 persist: Sending the trace to Zipkin 20000 from OpenTelemetry or inspectIT 0 creates more overhead then send- 0 20 40 60 80 100 120 140 Call Tree Depth ing the trace using Kieker. Figure 1: Growing Call Tree Depth with i7-4770 4. Related Work Benchmarking is widely used for testing the performance of software in continuous integration [9]. To measure the performance, benchmarking harnesses like the Java Microbenchmarking Harness jmh4 provide an execution environment for workload specification and measurement. According to a study of Stefan et al. [9], only 3,4 % of all open source projects continously benchmark their software performance. Widespread frameworks like Hadoop [10] or Java itself5 contain benchmarks for continuous performance evaluation. Besides application monitoring overhead, other benchmarks cover other system classes as stream processing engines [11] or ML systems [12]. In contrast to these works, we examined the performance overhead of application performance monitoring. Ahmed et al. [13] compare different APM tools by executing a load test on different systems and researching whether a performance regression could be identified. Afterwards, they check whether performance issues could be detected by thresholds in the commercial APM tools New Relic, AppDynamics and Dynatrace, and the open source tool Pinpoint6 . They did not research 4 http://openjdk.java.net/projects/code-tools/jmh/ 5 https://www.spec.org/jvm2008/ 6 https://pinpoint-apm.github.io/pinpoint/ the overhead of the tools, but their suitability for identification of performance changes. In contrast to this work, MooBench [4] measures the overhead of performance monitoring tools. It is used continuously for measuring the performance overhead of Kieker. MooBench has been extended and used for testing the replicability of performance measurements on the Raspberry Pi by Knoche and Eichelberger [8] [14]. They used different benchmarks to assess the replicability of performance measurements on the Raspberry Pi. They find that the Raspberry Pi is capable of providing an infrastructure for replicable benchmark execution. In contrast to our work, they did not consider OpenTelemetry and use a Raspberry Pi 3. OpenTelemetry itself maintains continuous performance benchmarks for the performance of its python implementa- tion.7 While the users of OpenTelemetry do occasional overhead measurement,8 no continuous benchmarking or benchmarking against other frameworks is done. Hence, a comparison of the monitoring overhead of OpenTelemetry in Java and Kieker has not been done so far. Waller and Hasselbring [15] research the effects of activation of processor cores and mul- tithreading to the monitoring overhead. They find that using one processor core with hyper- threading yields the lowest overhead in their configuration, since synchronization overhead between different processor cores increases the monitoring overhead. In contrast to their work, this work focusses on the comparison of different monitoring frameworks. 5. Summary and Outlook We compared the monitoring overhead of OpenTelemetry and Kieker. Therefore, we extended the MooBench benchmark. By execution of the benchmarks on a Raspberry Pi 4 and a regular Desktop PC, we found that Kieker has better performance with serialization to hard disc and with processing the results with TCP. This relation also persists with growing call tree depth. We also see that the ratios between execution durations on Raspberry Pi 4 and the regular Desktop PC vary for different benchmark configurations. Therefore, the Raspberry Pi might not be a suitable hardware for benchmark execution in every use case. In the future, benchmarks are required that cover real world usages of application monitoring frameworks better. Therefore, the following extensions are necessary from our point of view: (1) Real world programs are not built out of single-children trees with workload only in the one leaf node. The current tree structure leads to a regular execution order consisting of a constant number of monitored method executions and one busy wait. More complex trees containing a more complex distribution of the workload would make it possible to measure more realistic overhead. In a binary tree with busy wait in every leaf, the count of executions before the leaf node is called would vary. (2) Real world monitoring overhead is also caused by monitoring of specific frameworks, e.g. Jersey, CXF and Spring. To benchmark the overhead created by the probes for these frameworks, separate benchmarked application would need to be created (or adopted for this use case) and maintained. (3) Monitoring overhead is only one measurable property of monitoring. For practical purposes, like root cause analysis for performance problems or anomaly detection, accuracy is also a main property. Accuracy could be checked by how well certain root cause analysis algorithms perform with the examined 7 https://open-telemetry.github.io/opentelemetry-python/benchmarks/index.html 8 https://github.com/open-telemetry/opentelemetry-java-instrumentation/discussions/2104 monitoring framework like Ahmed et al. [13]. Acknowledgments This work is funded by the German Federal Ministry of Education and Research within the project “Performance Überwachung Effizient Integriert” (PermanEnt, BMBF 01IS20032D). References [1] J. Waller, Performance Benchmarking of Application Monitoring Frameworks, BoD–Books on Demand, 2015. [2] D. G. Reichelt, S. Kühne, W. Hasselbring, PeASS: A Tool for Identifying Performance Changes at Code Level, in: Proceedings of the 33rd ACM/IEEE ASE, ACM, 2019. (in press). [3] W. Hasselbring, Benchmarking as Empirical Standard in Software Engineering Research, CoRR abs/2105.00272 (2021). URL: https://arxiv.org/abs/2105.00272. arXiv:2105.00272. [4] J. Waller, N. C. Ehmke, W. Hasselbring, Including Performance Benchmarks into Continu- ous Integration to Enable DevOps, ACM SIGSOFT Software Engineering Notes 40 (2015) 1–4. URL: http://eprints.uni-kiel.de/28433/. doi:doi:10.1145/2735399.2735416. [5] W. Hasselbring, A. van Hoorn, Kieker: A monitoring framework for software engineering research, Software Impacts 5 (2020) 100019. doi:https://doi.org/10.1016/j.simpa. 2020.100019. [6] H. Eichelberger, K. Schmid, Flexible resource monitoring of Java programs, Journal of Systems and Software 93 (2014) 163–186. [7] A. Georges, D. Buytaert, L. Eeckhout, Statistically Rigorous Java Performance Evaluation, ACM SIGPLAN Notices 42 (2007) 57–76. [8] H. Knoche, H. Eichelberger, The Raspberry Pi: A Platform for Replicable Performance Benchmarks?, Softwaretechnik-Trends 37 (2017) 14–16. [9] P. Stefan, V. Horky, L. Bulej, P. Tuma, Unit Testing Performance in Java Projects: Are We There Yet?, in: Proceedings of ACM/SPEC ICPE 2017, ACM, 2017, pp. 401–412. [10] S. Huang, J. Huang, Y. Liu, L. Yi, J. Dai, HiBench: A Representative and Comprehensive Hadoop Benchmark Suite, in: Proc. ICDE Workshops, 2010, pp. 41–51. [11] S. Henning, W. Hasselbring, Theodolite: Scalability Benchmarking of Distributed Stream Processing Engines in Microservice Architectures, Big Data Research 25 (2021) 100209. [12] P. Mattson, V. J. Reddi, C. Cheng, C. Coleman, G. Diamos, D. Kanter, P. Micikevicius, D. Patterson, G. Schmuelling, H. Tang, et al., MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance, IEEE Micro 40 (2020) 8–16. [13] T. M. Ahmed, C.-P. Bezemer, T.-H. Chen, A. E. Hassan, W. Shang, Studying the Effective- ness of Application Performance Management (APM) Tools for Detecting Performance Regressions for Web Applications: An Experience Report, in: IEEE/ACM MSR, IEEE, 2016. [14] H. Knoche, H. Eichelberger, Using the Raspberry Pi and Docker for Replicable Performance Experiments: Experience Paper, in: Proceedings of the 2018 ICPE, 2018, pp. 305–316. [15] J. Waller, W. Hasselbring, A Comparison of the Influence of Different Multi-Core Processors on the Runtime Overhead for Application-Level Monitoring, in: ICMSEPT, Springer, 2012.