<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Empirical Software Engineering</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1007/s10664-019-09681-1</article-id>
      <title-group>
        <article-title>Evaluating Benchmark Quality: a Mutation-Testing-Based Methodology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federico Lochbaum</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillermo Polito</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Lille, Inria</institution>
          ,
          <addr-line>Centrale Lille</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>24</volume>
      <issue>2019</issue>
      <fpage>83</fpage>
      <lpage>99</lpage>
      <abstract>
        <p>Performance benchmarking is a crucial tool for evaluating software eficiency. Unlike behavioral tests, where Mutation testing and Test Coverage propose metrics to measure test quality, there are no methodologies for evaluating the quality of benchmarks. Coverage provides insights into execution but does not necessarily correlate with performance bugs. In this paper, we propose to assess the efectiveness of benchmarks by measuring their capacity to find performance issues. We explore a methodology that evaluates the quality of benchmarks based on mutation testing, where artificial performance bugs are introduced into programs and the benchmark's ability to detect them is measured. We present a series of experiments where we measure the sensitivity of benchmarks to catch artificially introduced bugs, providing a systematic approach to validate their efectiveness in finding performance issues. We introduce a deeper understanding of benchmark quality and ofer insights into improving benchmark measuring.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Test Cases</kwd>
        <kwd>Benchmark Quality</kwd>
        <kwd>Performance Evaluation</kwd>
        <kwd>Code Coverage</kwd>
        <kwd>Mutation Testing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Performance evaluation is crucial to understand resource consumption both in academic and industrial
settings [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1, 2, 3, 4, 5</xref>
        ]. Such evaluations are often made by means of benchmarking, i.e., systematic
measurement and analysis [
        <xref ref-type="bibr" rid="ref1 ref6 ref7 ref8">1, 6, 7, 8</xref>
        ]. Benchmarking is commonly related to speed benchmarking, to
analyse the time taken to perform a task. However, benchmarking techniques can be used to evaluate
other metrics such as energy consumption [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or memory usage [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. Benchmarks are typically
written as collections of programs called benchmark suites that exercise the application under evaluation
to detect performance variations.
      </p>
      <p>
        Unlike functional testing, which has well-established methodologies for evaluating test quality,
benchmarking lacks systematic methodologies to assess its efectiveness. A benchmark provides
performance metrics, but how well it detects performance issues remains unclear. Existing work
proposes methodologies to design and select benchmark programs [
        <xref ref-type="bibr" rid="ref12">12, 13</xref>
        ] or statistical frameworks to
get precise results [14]. However, to the best of our knowledge, no works have explored the idea of
evaluating benchmark quality.
      </p>
      <p>Traditionally, test quality metrics such as test coverage and mutation testing have been used in
software testing to evaluate test behavior. Test coverage measures the extent to which a test suite
exercises a program [15]. On the other side, mutation testing assesses test efectiveness by introducing
controlled modifications and observing whether tests detect them [ 16]. However, these techniques
primarily focus on correctness rather than performance, leading to the question of whether similar
methodologies are adapted to assess performance benchmark quality.</p>
      <p>In this paper, we introduce a general methodology to measure benchmark performance quality.
For this purpose, we explore a mutation-testing-based approach to evaluate benchmark efectiveness.
Traditional mutation testing introduces controlled bugs in a program, and leverages the autovalidating
property of tests to assess if a mutation was detected or not. Our approach adapts mutation testing
to performance benchmarks with two ideas. First, we introduce performance mutation operators that
introduce artificial performance degradations. Second, we introduce a performance-specific oracle.
and measure a benchmark’s sensitivity to these changes. On top of these, leveraging mutation testing
principles, we define a measure to quantify how well benchmarks detect performance regressions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Motivation</title>
      <p>Although any program can be used as a benchmark program, writing efective benchmarks remains
complex under the current state of the art.</p>
      <p>
        High-costs incurred by benchmarking. Sound benchmarking is challenging because it requires the
selection of representative benchmarks, the study of varying workloads, and statistical rigourosity to
cope with non-determinism [17, 18, 19]. This requires a combination of application-specific knowledge,
statistical knowledge, and system design knowledge [20] and is often found as a barrier to wider
adoption [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. On the one hand, benchmarks should be representative of the execution of the application
in normal conditions. On the other hand, benchmark results should consider the noise that is inherent
in existing hardware and software systems. There is an agreement in the community on the need for
tooling to reason about performance and automate the detection of regressions [
        <xref ref-type="bibr" rid="ref5 ref7">7, 5</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>Non-representative benchmark programs waste resources. The choice of benchmark programs</title>
        <p>
          is crucial in performance analysis because they guide, and may misguide, optimisation decisions. To
avoid wasting resources, benchmark programs should be representative of normal application executions.
This is particularly a problem of microbenchmarks [
          <xref ref-type="bibr" rid="ref4">21, 4</xref>
          ] and synthetic benchmarks [22, 23]. Obtaining
representative benchmark programs requires a deep understanding of the application, the
programming language, its implementation and the underlying hardware. We find an example
in the programming language community. The DaCapo benchmarks [18] proposed in 2006 a suite of
programs representing typical Java applications. This suite inspired others for Scala [24], Javascript,
and WebAssembly [25, 26]. Although currently in use by language implementation researchers, these
suites are still not considered representative of realistic executions, producing biases in research results.
The representativity problem afects even modern just-in-time compilers and production-ready garbage
collectors supported by strong companies such as Google and Oracle. State-of-the-art Virtual Machines
such as V8 deprecated the usage of synthetic benchmarks in favour of real applications [27]. Recently,
JVM’s garbage collectors in production since 2018 (Shenandoah and ZGC) have been observed to sufer
performance losses on realistic workloads [28].
        </p>
        <p>Benchmark datasets count as much as benchmark programs. The representativity of
benchmarks depends not only on the chosen programs: application behaviour and performance vary
depending on their workload i.e., the size and shape of their inputs. Regarding their size, small datasets
put less pressure on memory accesses while large datasets execute for longer times and unveil more
profile-guided optimisation opportunities to modern JIT compilers. In addition, data that varies a lot in
the found data types may prevent optimisations such as procedure inlinings. Benchmark programs
thus need to execute on diferent dataset configurations to detect these variations, since performance
regressions (and improvements) may be present on certain workloads and not on others.
Non-determinisms demand complex methodologies. Benchmarks also sufer from
nondeterminisms arising from hardware, operating systems, and programming language
implementations [29, 30, 21, 31, 32, 33, 34], commonly referred to as noise. This problem gets worsened because
benchmark results are mostly based on unstable metrics: metrics that are subject to noise and vary
from one measure to another, such as e.g., wall-clock time. Two wall-clock time measurements for the
same program will vary depending on factors such as garbage collection, thread/process scheduling
and even CPU temperature. This raises major issues with the reproducibility and statistical significance
of the results. Nowadays, such instabilities are approached by methodological means [35, 36, 17, 19].
Notably, Georges et al. [37] proposed in 2007 a statistically rigorous methodology based on (a) improving
statistical significance with repeated executions and (b) automatically determining when a benchmark
reaches a steady state by analyzing the variation of measurements. Recently [38, 39] the language
implementation community discovered that such methodologies were based on the false assumption
that benchmarks always stabilize [40].</p>
        <p>Goal: How can we measure benchmark quality? We aim at identifying the most suitable
benchmarks for a given problem while minimising exploration time and maximising reachability. The absence
of methodologies for evaluating benchmark quality presents a significant challenge to assessing
benchmark efectiveness. Without a structured approach to measuring benchmark efectiveness, performance
testing remains ad hoc and potentially unreliable. Supposing we introduce a well-known bug that
significantly reduces the performance of a system, we can’t guarantee that our ad hoc test cases will
ifnd it. Having a methodology for evaluating benchmarks, we can correlate them with their capacity to
detect potential performance issues.</p>
        <p>In this paper, we propose a methodology for evaluating the efectiveness of benchmarks in detecting
performance bugs. Instead of focusing on isolated case studies or specific performance scenarios, we
aim to develop a structured evaluation framework that is applied across diferent benchmarks and
software test suites. Moreover, we intend this methodology to be malleable enough to be adapted for
any set of benchmarks.</p>
        <p>We contribute to a more rigorous understanding of benchmarking quality and provide insights into
improving benchmarking methodologies for performance software.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Challenges Defining a Benchmark Evaluation Methodology</title>
      <p>This section presents the main challenges to establishing a benchmark evaluation methodology. The
ifrst of it is to define what benchmark quality is and select which benchmark properties we want to
assess. But also, we need to define an oracle that says when a benchmark detects or does not detect a
performance issue.</p>
      <sec id="sec-3-1">
        <title>3.1. Mutation Testing and Test Quality</title>
        <p>Mutation testing [41] is a technique for evaluating the quality of a test suite. It introduces mutations
to a target program and validates whether the existing tests can detect these mutants. If a test suite
fails when run against a mutant, the mutant is said to be killed, otherwise, it survives. The goal is to
simulate common programming errors and assess if the test suite is robust enough to catch them.</p>
        <p>In the past forty years, mutation testing has been widely adopted in industry because of the advances
in computing performance. This increased the interest of the research community that studied new
methods to improve mutation performance, its applications to a large number of programming languages
and frameworks, and even its usage for topics like security [42].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Performance Mutation Testing</title>
        <p>Our proposed methodology is illustrated in Figure 1. It is structured as a mechanism where a benchmark
is executed twice per evaluation. The first one, with the base application and this execution, is going
ot be used for the oracle as the assert baseline. The second one will be forced to run over a mutated
application where it should catch a performance perturbation. The oracle will use both execution to
define how good the benchmark behaves.</p>
        <p>Benchmark quality. We propose to treat each benchmark as a test case, where it is evaluated against
a target application. A benchmark usually provides performance metrics as a result, as it is execution
time, but there is no linear relationship between those metrics and performance issues. The nature of
benchmarks does not fit with finding bugs but rather with measuring performance. This is because,
our metric.
mutation.
eBnchmark
eBnchmark
eBnchmark
eBnchmark
.
.
.
.
.</p>
        <p>.
even if the benchmark can identify a performance problem, it cannot tell which part of the code it
comes from. We propose to use the elapsed time of test case execution concerning a defined baseline as</p>
        <sec id="sec-3-2-1">
          <title>Performance bug introduction.</title>
          <p>We augment mutation testing by introducing artificial performance
bugs and validating how many of these mutants are detected. Mainly, we want to use a benchmark as a
black-box program, accepting any benchmark for our methodology. Then, we modify the benchmark
program input by introducing controlled mutants that degrade significantly the program’s
performance [43]. This means that if the benchmark is capable of catching the performance bug, running
it with a performance-mutated input must have an output measure of time worse than without the
Oracle. To verify whether the current benchmark successfully detected a performance perturbation,
we suggest using an oracle. An oracle knows how to compare the execution time result of the benchmark
run with a defined baseline. The considered baseline that we propose to use is the result of executing
the same benchmark with a non-perturbed input. Besides the baseline, the oracle establishes a reliability
threshold to define when, given the benchmark result, it considers the benchmark detects the mutant.
seultr
rOacle
u/ncdet</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Methodology Requirements</title>
        <p>To guide our methodology in our experiments and to assess the efectiveness of benchmarks in detecting
performance bugs, we seek to explore the following research questions:
RQ1: Representative performance issues. What represents a potential performance issue or bug?
RQ2: Artificial bugs. Can we introduce artificial performance issues to assess benchmark quality?
RQ3: Measure sensibility. What is the degree of confidence in these measures?</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimentation</title>
      <p>Our objective is to provide an example and instantiate this methodology with a real use case. In this
section, we instantiate our methodology that we have presented before using mutants that introduce
performance perturbations with sleep statements, and an oracle that determines performance variations
by comparing run-time averages. To evaluate this concrete setup, we analysed a set of autogenerated
benchmarks for the Pharo regular expressions library.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimentation details</title>
        <p>Use case: Regexp. We selected as a use case the validation of a set of benchmarks for regular
expressions. This stresses the implementation of the matches: method. Benchmarks are generated using
a fuzzer strategy.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Benchmarks generation with Monte Carlo tree search. We use a Monte Carlo tree search</title>
        <p>grammar fuzzer over a regular expression grammar to generate benchmark tests. We implemented a
fuzzer to generate regexes [44]. We guide the fuzzer by the execution time taken for the generated
regex to perform the matches: method with the minimal acceptable input.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.2. Instantiating methodology</title>
        <p>Mutants: sleep statements. We use Mutalk 1, a Mutation Testing framework for Pharo to introduce
artificial performance bugs. We define a series of performance mutation operators that introduce sleep
statements in every program sequence node e.g., n milliseconds wait. Each operator generates a
mutant at the beginning of each sequence node. This way, not only do methods contain performance
perturbations, but also control flow statements, and more specifically, loops.</p>
        <p>Oracle: Comparing Runtime Means. Our oracle uses a baseline to compare each benchmark result,
once for each applied mutation. For each benchmark, we compare its run time with and without mutant
perturbations. To cope with the noise and non-determinism of performance measurements, we compute
baseline values by running a benchmark multiple times and considering the average and standard
deviation. For these experiments, we compute the baseline with 30 iterations. We consider that a mutant
is killed (i.e., that the performance perturbation is detected) if the run time of the perturbed benchmark
lies within the average and one standard deviation of the baseline.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.3. Results</title>
        <p>As a first approach using this methodology, we executed 100 benchmarks. Each benchmark is configured
to run n iterations, where n forces the baseline to run at least 300 ms, thus minimising noise. We introduce
62 mutants per mutant operator, and we execute every benchmark once per mutant.</p>
        <p>Figure 2 presents the preliminary results of these experiments using a dispersion chart. The dispersion
chart allows us to understand the mutation score detection distribution per benchmark and operator.
Figure 3 presents the stacked values between the average and the standard deviation for each benchmark
baseline. Table 1 shows the ratio between the stdevs and averages of every benchmark baseline.</p>
        <sec id="sec-4-4-1">
          <title>Statistics</title>
          <p>Average
Max
Min
Q1
Q2
Q3</p>
          <p>Value
0.4284428074</p>
          <p>1.0785776
0.2107902808
0.3633576572</p>
          <p>0.3861501
0.4055622512
1https://github.com/pharo-contributions/mutalk
Mutation Score per Benchmark
0,50
0,25
0,00</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.4. Analysis</title>
        <p>At first glance, we can observe in Figure 2 that there is no huge diference between the points of diferent
operators. This led us to think that small perturbations work almost as well as huge perturbations do.
In only 23 benchmarks, 500 ms perturbation reaches better results with an average diference of 2,6%
compared to 10 ms perturbation. We also observe that on average, the mutation score per benchmark is
above 50% having highlights with high mutation scores, getting almost 100%.</p>
        <p>Moreover, we observe in Figure 3 and Table 1 that the standard deviations of the benchmark baseline,
on average, do not exceed 50% of the average, having a maximum ratio of 107% of the average and a
minimum of 21%. However, we observe that even in the third quarter, benchmarks do not exceed 40%
of the baseline average, which makes the sample reliable.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Lessons Learned and Future Work</title>
      <p>Initially, we expected to see a large diference in the mutation score for mutant operators with huge
perturbations. But we noted that small perturbations work as well as huge perturbations.
Test variance. We observed that some benchmark test execution times have a big standard deviation.
Therefore, we cannot use them for stable measurements, and we need to filter them in a previous step.
In this sense, we need to study better benchmark selection techniques that allow us to reduce the
number of benchmarks that introduce noise in our experiment samples.</p>
      <p>Small Benchmarks. Having very low test execution times makes them too sensitive to external
disturbances. For this reason, it is necessary to consider a minimum noise tolerance to perform tests
with very short execution times.</p>
      <p>Stdev and Averages
stdev</p>
      <p>Averages
Averages</p>
      <sec id="sec-5-1">
        <title>Runtime perturbation, Garbage Collector and JIT Compilation. A recognisable noise cause</title>
        <p>is the Garbage Collector executions. The garbage collector is not a disableable feature. This means
we need to guarantee somehow that the garbage collector won’t introduce performance noise during
our experimentations. For example, we need to study techniques that ensure the Virtual Machine
guarantees executing benchmarks under the same conditions minimizing external noise.
Elapsed time as quality metric. Using the wall-clock time as a performance metric is very unstable
and requires so many executions to stabilise it. This leads us to carry out executions that take a long
time and are therefore expensive. Following this, we need to study alternative metric techniques, as
time estimation based on static metrics, like the number of messages sent or memory accesses.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we propose a systematic methodology to evaluate the efectiveness of performance
benchmarks. We propose a method to introduce artificial performance bugs by extending mutation
testing. We use an oracle to assess the efectiveness of a benchmark using a baseline generated. Finally,
we define a benchmark quality and we instantiated the framework in a real setting. We present some
preliminary results of the experiment, and we perform an analysis to suggest future improvements.</p>
      <p>The results demonstrate that the proposed methodology provides enough information to compare
benchmark efectiveness. However, we find limitations regarding the measure used to define a mutant
kill. Either we need to stabilise the baseline by running each benchmark a higher number of times, or
we need to approach the omission of false negatives/positives from performance noise.</p>
      <p>Another interesting approach could be to study other metrics that allow us to measure the efectiveness
of a benchmark with higher precision, or to propose another mutant kill detection technique.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This project is financed by the ANR JCJC project convention ANR-25-CE25-0002-01.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[13] S. Marr, B. Daloze, H. Mössenböck, Cross-language compiler benchmarking: are we fast yet?,</p>
      <p>ACM SIGPLAN Notices 52 (2016) 120–131.
[14] A. Georges, D. Buytaert, L. Eeckhout, Statistically rigorous java performance evaluation, ACM</p>
      <p>SIGPLAN Notices 42 (2007) 57–76.
[15] M. Böhme, L. Szekeres, J. Metzman, On the reliability of coverage-based fuzzer benchmarking, in:</p>
      <p>Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 1621–1633.
[16] R. A. DeMillo, R. J. Lipton, F. G. Sayward, Hints on test data selection: Help for the practicing
programmer, Computer 11 (1978) 34–41.
[17] S. M. Blackburn, K. S. McKinley, R. Garner, C. Hofmann, A. M. Khan, R. Bentzur, A. Diwan,
D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss,
A. Phansalkar, D. Stefanovik, T. VanDrunen, D. von Dincklage, B. Wiedermann, Wake up and
smell the cofee: Evaluation methodology for the 21st century, Commun. ACM 51 (2008). URL:
https://doi.org/10.1145/1378704.1378723. doi:10.1145/1378704.1378723.
[18] S. M. Blackburn, R. Garner, C. Hofmann, A. M. Khang, K. S. McKinley, R. Bentzur, A. Diwan,
D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss,
A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, B. Wiedermann, The DaCapo
Benchmarks: Java Benchmarking Development and Analysis, in: Object-Oriented Programming
Systems, Languages, and Applications, OOPSLA ’06, Association for Computing Machinery,
New York, NY, USA, 2006, pp. 169–190. URL: https://doi.org/10.1145/1167473.1167488. doi:10.1145/
1167473.1167488.
[19] E. Weyuker, F. Vokolos, Experience with performance testing of software systems: issues, an
approach, and case study, IEEE Transactions on Software Engineering 26 (2000) 1147–1156.
doi:10.1109/32.888628.
[20] J. v. Kistowski, J. A. Arnold, K. Huppler, K.-D. Lange, J. L. Henning, P. Cao, How to build a
benchmark, in: ICPE’15, ACM, 2015. URL: https://doi.org/10.1145/2668930.2688819. doi:10.1145/
2668930.2688819.
[21] C. Laaber, P. Leitner, An evaluation of open-source software microbenchmark suites for continuous
performance assessment, in: International Conference on Mining Software Repositories, MSR ’18,
2018. URL: https://doi.org/10.1145/3196398.3196407. doi:10.1145/3196398.3196407.
[22] A. Sarimbekov, L. Stadler, L. Bulej, A. Sewe, A. Podzimek, Y. Zheng, W. Binder, Workload
characterization of jvm languages, Software: Practice and Experience 46 (2016) 1053–1089.
URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.2337. doi:https://doi.org/10.1002/
spe.2337. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.2337.
[23] T. Ogasawara, Workload characterization of server-side javascript, in: 2014 IEEE International
Symposium on Workload Characterization (IISWC), 2014, pp. 13–21. doi:10.1109/IISWC.2014.
6983035.
[24] A. Sewe, M. Mezini, A. Sarimbekov, W. Binder, Da capo con scala: Design and analysis of a
scala benchmark suite for the java virtual machine, in: Object Oriented Programming Systems
Languages and Applications, OOPSLA ’11, 2011. URL: https://doi.org/10.1145/2048066.2048118. doi:10.
1145/2048066.2048118.
[25] F. Pizlo, Jetstream benchmark suite, ???? URL: https://browserbench.org/JetStream/, retrieved June
07 2022.
[26] S. Cazzulani, Octane: The javascript benchmark suite for the modern web, ???? URL: https:
//blog.chromium.org/2012/08/octane-javascript-benchmark-suite-for.html, retrieved June 07 2022.
[27] V8ByeOctane, Retiring octane - https://v8.dev/blog/retiring-octane, ???? URL: https://v8.dev/blog/
retiring-octane, retrieved June 07 2022.
[28] Z. Cai, S. M. Blackburn, M. D. Bond, M. Maas, Distilling the real cost of production garbage
collectors, CoRR abs/2112.07880 (2021). URL: https://arxiv.org/abs/2112.07880. arXiv:2112.07880.
[29] D. Costa, C.-P. Bezemer, P. Leitner, A. Andrzejak, What’s wrong with my benchmark results?
studying bad practices in jmh benchmarks, IEEE Transactions on Software Engineering (2021).
doi:10.1109/TSE.2019.2925345.
[30] C. Laaber, J. Scheuner, P. Leitner, Software microbenchmarking in the cloud. how bad is it really?,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <article-title>A qualitative study on performance bugs</article-title>
          ,
          <source>in: Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, MSR '12</source>
          , IEEE Press,
          <year>2012</year>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>208</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Summarizing evolutionary trajectory by grouping and aggregating relevant code changes</article-title>
          ,
          <source>in: International Conference on Software Analysis, Evolution, and Reengineering</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.-P.</given-names>
            <surname>Bezemer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eismann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ferme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Grohmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jamshidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. van Hoorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villavicencio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Walter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Willnecker</surname>
          </string-name>
          ,
          <article-title>How is performance addressed in devops?</article-title>
          , in: ACM/SPEC International Conference on Performance Engineering, ICPE '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
          <year>2019</year>
          . URL: https://doi.org/10.1145/3297663.3309672. doi:
          <volume>10</volume>
          .1145/3297663.3309672.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Leitner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-P.</given-names>
            <surname>Bezemer</surname>
          </string-name>
          ,
          <article-title>An exploratory study of the state of practice of performance testing in java-based open source projects</article-title>
          , in: International Conference on Performance Engineering, ICPE '
          <volume>17</volume>
          ,
          <year>2017</year>
          . URL: https://doi.org/10.1145/3030207.3030213. doi:
          <volume>10</volume>
          .1145/3030207.3030213.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Stefan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Horky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bulej</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tuma</surname>
          </string-name>
          ,
          <article-title>Unit testing performance in java projects: Are we there yet?</article-title>
          ,
          <source>in: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering</source>
          , ICPE '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , pp.
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          . URL: https://doi.org/10.1145/3030207.3030226. doi:
          <volume>10</volume>
          .1145/3030207.3030226.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>M.-H. Lim</surname>
            , J.-G. Lou,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>A. B. J.</given-names>
          </string-name>
          <string-name>
            <surname>Teoh</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>D. Zhang,</given-names>
          </string-name>
          <article-title>Identifying recurrent and unknown performance issues</article-title>
          ,
          <source>in: 2014 IEEE International Conference on Data Mining</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>320</fpage>
          -
          <lpage>329</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDM.
          <year>2014</year>
          .
          <volume>96</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nistor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Discovering, reporting, and fixing performance bugs</article-title>
          ,
          <source>in: 2013 10th Working Conference on Mining Software Repositories (MSR)</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>246</lpage>
          . doi:
          <volume>10</volume>
          .1109/ MSR.
          <year>2013</year>
          .
          <volume>6624035</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Horký</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Libič</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Marek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Steinhauser</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. Tůma,</surname>
          </string-name>
          <article-title>Utilizing performance unit tests to increase performance awareness</article-title>
          , in: International Conference on Performance Engineering, ICPE '
          <volume>15</volume>
          ,
          <year>2015</year>
          . URL: https://doi.org/10.1145/2668930.2688051. doi:
          <volume>10</volume>
          .1145/2668930.2688051.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ournani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Belgaid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rouvoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Penhoat</surname>
          </string-name>
          ,
          <article-title>Evaluating the Impact of Java Virtual Machines on Energy Consumption</article-title>
          ,
          <source>in: 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)</source>
          , Bari, Italy,
          <year>2021</year>
          . URL: https://hal.inria.fr/hal-03275286.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Agadakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Williams-King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. P.</given-names>
            <surname>Kemerlis</surname>
          </string-name>
          , G. Portokalidis, Nibbler:
          <article-title>Debloating binary shared libraries</article-title>
          ,
          <source>in: Proceedings of the 35th Annual Computer Security Applications Conference</source>
          , ACSAC '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , pp.
          <fpage>70</fpage>
          -
          <lpage>83</lpage>
          . URL: https://doi.org/10.1145/3359789.3359823. doi:
          <volume>10</volume>
          .1145/3359789.3359823.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Polito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fabresse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bouraqadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ducasse</surname>
          </string-name>
          ,
          <article-title>Run-fail-grow: Creating tailored object-oriented runtimes</article-title>
          ,
          <source>The Journal of Object Technology</source>
          <volume>16</volume>
          (
          <year>2017</year>
          ) 2:
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          . URL: https://hal.archives-ouvertes.fr/ hal-01609295. doi:
          <volume>10</volume>
          .5381/jot.
          <year>2017</year>
          .
          <volume>16</volume>
          .3.a2.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Blackburn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Garner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Khang</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. S. McKinley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bentzur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Diwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Feinberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Frampton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Z.</given-names>
            <surname>Guyer</surname>
          </string-name>
          , et al.,
          <article-title>The dacapo benchmarks: Java benchmarking development and analysis</article-title>
          ,
          <source>in: Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>