<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>munity Benchmark</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabian Huch</string-name>
          <email>huch@in.tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincent Bode</string-name>
          <email>vincent.bode@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universität München</institution>
          ,
          <addr-line>Boltzmannstraße 3, 85748 Garching</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Choosing hardware for theorem proving is no simple task: automated provers are highly complex and optimized programs, often utilizing a parallel computation model, and there is little prior research on the hardware impact on prover performance. To alleviate the problem for Isabelle, we initiated a community benchmark where the build time of HOL-Analysis is measured. On 54 distinct CPUs, a total of 669 runs with diferent Isabelle configurations were reported by Isabelle users. Results range from 107 s to over 11 h. We found that current consumer CPUs performed best, with an optimal number of 8 to 16 threads, largely independent of heap memory. As for hardware parameters, CPU base clock afected multi-threaded execution most with a linear correlation of 0.37, whereas boost frequency was the most influential parameter for single-threaded runs (correlation coeficient no significant role. When comparing our benchmark scores with popular high-performance computing benchmarks, we found a strong linear relationship with Dolfyn ( 2 = 0.79) in the single-threaded scenario. Using data from the 3DMark CPU Profile consumer benchmark, we created a linear model for optimal (multi-threaded) Isabelle performance. When validating, the model has an average  2-score of 0.87; the mean absolute error in the final model corresponds to a wall-clock time of 46.6 s. With a dataset of true median values for the 3DMark, the error improves to 37.1 s. Isabelle, theorem proving, user benchmark, run-time performance, performance prediction PAAR'22: 8th Workshop on Practical Aspects of Automated Reasoning, August 11-12, 2022, Haifa, Israel 0000-0002-9418-1580 (F. Huch); 0000-0003-2353-389X (V. Bode) CEUR Workshop Proceedings</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Choosing appropriate hardware and tuning configuration parameters is a common task when
one wants to run software optimally. For a complex and truly parallel interactive proof assistant
such as Isabelle, many factors influence run-time performance: The prover needs a Java and
a Meta Language (ML) run-time, the number of threads is variable, as is the amount of heap
memory – which in turn (in combination with the CPU architecture family) dictates which
ML platform and hence Poly/ML backend may be used. On a hardware level, CPU specs, the
memory hierarchy, and interconnects all influence how well the software components perform
and how the system as a whole behaves. The parallel eficiency of Isabelle (i.e., the ratio of
actual time versus sequential time divided by the number of parallel units) decays according to
a non-linear characteristic [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], as is the case in most parallel systems. As a result, there is no
single hardware or software characteristic that dominates the observed performance behavior.
      </p>
      <p>In Isabelle, performance is important both in the interactive mode (such that processing
https://home.in.tum.de/~huch (F. Huch); https://home.in.tum.de/~bodev (V. Bode)</p>
      <p>© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
changes and running solvers is faster) and in a batch build mode, where sessions (i.e., collections
of formalizations) can be processed. Independent sessions can even be run in parallel with
multiple ML processes.</p>
      <p>However, making informed decisions on hardware is no trivial task. Members of the Isabelle
community have to rely on word of mouth to determine which processors and memory to use,
and configuration parameters (such as the number of threads or heap sizes) are largely folk
knowledge – backed by experience collected over the years, ad-hoc experiments, and sometimes
intuition. While there is some performance data available, it is not helpful in that regard as it
only covers a very small number of machines.</p>
      <p>With new and exciting hardware being developed at a fast pace, one can often be overwhelmed
by the sheer variety of hardware options available. Hence, the question of which hardware
to recommend for Isabelle can often not be answered exhaustively or satisfactory. This is
relevant both for individuals working with Isabelle, and for the larger-scale server infrastructure
maintained for continuous integration and for Isabelle and the Archive of Formal Proofs.</p>
      <p>To alleviate this problem, a solid data base with performance benchmark results for a wide
variety of involved hardware and configurations is needed. Not only would that directly answer
the question of optimal configurations for a given system and allow one to compare the hardware
on the market, but such a collection of data (if large enough, and kept up to date) would also
allow one to predict performance of other hardware for which no Isabelle data is available yet.</p>
      <p>In this paper, we outline our Isabelle community benchmark, discuss the immediate results
and findings, and derive a model to predict the Isabelle performance of unknown CPUs with
the help of widely used benchmarks for which more data is retrievable. Our source-code and
data and is made available publicly1. Section 2 covers related work; we explain our benchmark
set-up in Section 3, and discuss the results in Section 4. In Section 5, we conclude and discuss
future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Parallel run-time performance has been first analyzed for Isabelle when parallelism was
introduced by Wenzel in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Benchmarks for multiple diferent sessions on a single test machine
already showed that the speedup (in terms of run-time) peaked at three worker threads with
a factor of 3.0, and slightly decreased for four cores. Matthews and Wenzel described the
necessary adaptations to the Poly/ML run-time that were necessary for introducing parallelism,
and analyzed the resulting bottlenecks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. They found that the parallelization model for Isabelle
sometimes failed to fully utilize all worker threads. Moreover, the synchronization model that
uses a single signal across all threads for guarded access was identified (but not analyzed) as
a potential bottleneck. Finally, it was observed that the single-threaded garbage collection is
responsible for up to 30 % CPU-time for 16 threads. Overall, a maximum speedup of 5.0 to 6.2
could be achieved using 8 threads.
      </p>
      <p>
        In automatic theorem provers, run-time is an important factor, since it can dictate whether a
goal can be proven within the given cut-of time. As a result, much research includes analysis
of the run-time performance of provers or individual prover components. Typically, only a
12022-paper folder in https://isabelle.systems/benchmark
single hardware configuration is used, which is reasonable for the analysis for single-threaded
systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, since performing such analysis on a wide range of diferent hardware is
often impractical, run-time performance analysis of parallel approaches is frequently carried
out on single systems or clusters [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. These results don’t always generalize, because the
hardware used can have a significant impact on the observed results.
      </p>
      <p>
        In contrast, results for the Isabelle sledgehammer proof-finder tool show that when running
multiple automatic provers to solve a goal, run-time becomes less important: In their judgement
day study [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Böhme and Nipkow found that running three diferent Automated Theorem
Provers for five seconds each solved as many goals as running the most efective one for 120 s.
Subsequently, in direct follow-up work [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], run-time was not analyzed.
      </p>
      <p>
        For automatic provers, a large range of benchmarks exist to judge their efectiveness on a
given set of problems. One of these is the widely known TPTP library [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, there is
not much work investigating the efect of hardware in the field of automated reasoning. To the
best of our knowledge, there exists no other benchmark comparing the hardware impact on
run-time performance of any theorem prover, and this is the first work that analyzes this efect
on a wide range of diferent hardware.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Benchmarking Methodology</title>
      <p>The benchmark has to fulfill to multiple requirements: It needs to capture typical computations
found in Isabelle — mostly symbolic computation —, have a reasonable run-time for the end
user, and motivate users to want to see how their machines perform (i.e., results should be
self-evident). We settled for a clean build of the HOL-Analysis session: It is a typical Isabelle
formalization which runs in approximately five minutes on typical consumer machines. Many
Isabelle users have likely run a similar workload for their own work in the past.</p>
      <p>While users can easily contribute results for their favourite Isabelle configuration, we supplied
a small script to run a comparable set of configurations automatically 2. This way, the whole
benchmark can be run with a single command (assuming a working installation of Isabelle
2021-1) on any platform. We vary the underlying ML platform between 64_32 (64-bit mode
with 32-bit values) and true 64-bit mode, heap sizes of both the ML and JVM process (set to the
same value to reduce the number of linear combinations, as early benchmark results indicated
they play only a minor role here), and the number of worker threads for parallel proof checking.</p>
      <p>Results are collected in a collaborative spreadsheet3 with automatically updated figures for
fastest CPUs and parallel eficiency. The benchmark is not intended as one-shot experiment, but
rather as a continuous community efort to maintain an overview over the Isabelle computing
landscape as new hardware emerges. It is being kept open for new results, and will be maintained
for future Isabelle versions.</p>
      <sec id="sec-3-1">
        <title>2Documentation and code at https://isabelle.systems/benchmark 3https://docs.google.com/spreadsheets/d/12GhEwSNSopowDBq5gSem3u39fliiIcoTIZHMnX4RE3A</title>
        <sec id="sec-3-1-1">
          <title>3.1. Benchmark Score (Isascore)</title>
          <p>For the benchmark results, we use the wall-clock build time as an intuitive metric. Together with
the well-known HOL-Analysis build target, the metric immediately gives a good understanding
of Isabelle performance. However, it is not well suited to compare to other metrics such as
throughput, because the relationship between time to solution and throughput is inverse. To
still allow using simple linear models such as the Pearson correlation, we introduce a benchmark
score that we call Isascore. It reflects the number of HOL-Analysis runs one could complete in a
day, i.e.:</p>
          <p>Isascore =</p>
          <p>1 d
wall-clock time
(1)</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Threats to Validity</title>
          <p>The experiments discussed in this paper could not be performed in a controlled environment,
since they were run by members of the Isabelle community rather than exclusively by the authors
of this paper. This means that various outside factors may have influence on the reported results,
though it seems reasonable to assume that those factors should usually be constant between
diferent configurations of the same benchmark run. The efect of machine-local anomalies
can be mitigated for hardware where we received several independent measurements by using
statistical techniques. Furthermore, due to reasons of practicality in orchestrating data collection,
extended system specifics beyond the CPU model, OS, and memory configuration were not
recorded. There is a possibility that relevant parameters may have been missed. Therefore, like
all performance benchmarks, these results represent upper bounds of what might be achieved
with a given system configuration. Lastly, while the benchmark was posted on the Isabelle-users
mailing list, in principle the data entry was open to public and could have been misused.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>At the time of writing this paper, 669 results for a total of 594 unique configurations have been
reported, utilizing 54 distinct CPUs. Those include Intel Desktop/Server CPUs from Sandy
Bridge to Alder Lake, AMD Ryzen Zen2 to Zen4 processors as well as Epyc and Threadripper
server systems, a Fujitsu A64FX, and Apple M1 processors.
12th Gen Intel(R) Core(TM) i7-12700K
Apple M1 Pro (native arm64)
AMD Ryzen 9 5900X
Intel(R) Core(TM) i9-9900K
Apple M1 Pro</p>
      <p>Table 1 shows the five CPUs with the lowest time to solution, using the median value as
an aggregate for multiple runs of the same configuration. Older Intel and AMD consumer
hardware is surpassed by the Apple M1 Pro chip; only the most recent Intel core line performs
better. Due to the nature of the benchmark, server and high performance hardware does not
rank highly, with the best performing system (2x AMD Epyc 7742) clocking in at 184 s.</p>
      <p>In the following, we analyze how Isabelle configuration influences performance, investigate
the impact of hardware parameters, and then compare our results to other computational
benchmarks. Where individual CPUs were concerned, we filtered out special system configurations
(e.g., overclocked hardware, dual-cpu systems, power-saving mode). We also encountered a
small number of extreme outliers where Isabelle run-time was much longer than expected. For
two of those, we could identify the user and investigate; in both cases, the system configuration
was at fault (excessive swapping, UEFI set to “silent mode”) and when corrected, results were
much closer to the rest of the data. We could not investigate the third extreme outlier but
excluded it from the following, since it is likely to stem from a similar cause.</p>
      <sec id="sec-4-1">
        <title>4.1. Multi-Threaded Performance</title>
        <p>
          The number of threads used plays a major role in the overall performance. Figure 1 illustrates
how the wall clock time and CPU time compare from a single thread to up to 128 threads. The
optimal wall-clock time is achieved with 8 to 16 threads depending on the hardware and greatly
increases if more threads are used. This is typical behavior for a strong-scaling benchmark like
ours, where the relative impact of communication increases with an increase in the number of
threads used. For more than the optimal number of threads, the run-time increases substantially.
The underlying limitations of the parallel computing model – the single-threaded garbage
collection of the Poly/ML run-time and worker starvation after parallelization is saturated –
were already discussed in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], albeit tests were run on a machine with 32 cores. It might be a
surprise that the scalability is so low when distributing across more threads. In contrast, the CPU
time divided by number of threads (which is not an ideal metric, but the only feasible solution
due to the nature of the benchmark) flattens out at eight threads. In small-scale experiments,
we found that the JVM process takes up a constant amount of CPU time independent of the
number of threads (about 26 % in single-core mode). This means that there is not too much
computation overhead but the hardware can not be properly utilized by the ML process, most
likely due to the single-threaded garbage collection that stops all threads when running. This
is an inherently sequential task, which means that Amdahl’s Law (the speedup is limited by
1/(1 − parallelizable portion) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]) limits the achievable speedup for this problem.
        </p>
        <p>The parallel eficiency paints a similar picture in Figure 2, decreasing almost linearly (on the
logarithmic x-axis) up to 32 threads at which it is at a median of 0.065. With the number of
threads tending to the limit of 128, it approaches 0.002. There is an outlier where the parallel
eficiency is over one – super-linear speedup is unusual but can appear in practice because of
caching efects or resource contention in the measured system.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Performance Impact of Heap Memory</title>
        <p>As preliminary results indicated that heap memory (as long as suficient) only plays a minor
role in performance, we keep the JVM and Poly/ML processes at the same heap size. We know
from experience that a few gigabytes of memory sufice for HOL-Analysis; however, increased
parallelism requires more memory in principle due to memory overhead. Hence, the range of
examined heap sizes depends on the number of threads used. Figure 3 shows the change in
run-time for diferent heap settings relative to the minimal setting. The boxes capture the 25
and 75 percentiles as height and sampling size as width; whiskers correspond to the extreme
values. The results show that performance is not afected very much by heap size. Following
1.2</p>
        <p>1
0.8
1.2
e 1
m
i
-nT0.8
u
R
iev1.2
t
a
l
e
R 1
0.8
1.2</p>
        <p>1
0.8
medians
s
d
a
e
r
h
T
8
s
d
a
e
r
h
T
6
1
s
d
a
e
r
h
T
2
3
s
d
a
e
r
h
T
4
6
the line of medians, wall-clock time slightly increases above 16 GB (where the 64 bit Poly/ML
backend needs to be used, as the more eficient 64_32 mode does not allow more than 16 GB),
as well as for very large values. We observed a single outlier for 64 threads and 128 GB heap
memory at a relative factor of 0.66.</p>
        <p>8
16</p>
        <p>32 64 128
Heap Memory (log scale) [GB]
256
512</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Influence of Hardware Characteristics</title>
        <p>Based on folk knowledge about Isabelle performance, we suspected that cache size would be a
major factor; it was debated whether boost clock speed would be relevant. To test the hypotheses,
we analyzed the impact of size of the L3-cache, base clock speed, and maximal (boost) clock
speed (ignoring power-save cores where applicable) on Isabelle performance. Table 2 shows the
correlation between Isascore and those parameters (APA notation as explained in caption). At
our significance level of 0.05, we did not find cache size to impact performance significantly.
Base frequency is weakly correlated with the Isascore for a single thread (though at the edge
of significance) and a bit more strongly (and much more significantly) in the multi-threaded
scenarios. Finally, boost frequency has a significant medium correlation for all modes, which is
strongest in the single-threaded configuration with a value of 0.55.</p>
        <p>A possible explanation is that boost frequency can only be sustained for a single core in most
CPUs, hence single-threaded performance profits from it a lot; in the multi-threaded scenario,
the actual core frequency is much closer to the base frequency and thus its impact is larger.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Comparison to Computational Benchmarks</title>
        <p>Performance benchmarks exist for many applications; additionally, synthetic benchmarks are
often used to evaluate hardware performance. They can roughly be categorized into scientific
computing versus consumer benchmarks. In the following, we compare the results of our Isabelle
community benchmark with a number of publicly available datasets for such benchmarks. For
the comparison, we selected results with matching processors, and matched the benchmarks’
multi-thread setting (e.g., specific thread count, or all cores). To obtain suficiently large datasets,
we selected some of the most popular benchmarks.
4.4.1. Benchmarks in High-Performance Computing
The first analysis we wish to conduct is a comparison of Isabelle performance with some
scientific programs. For this analysis, we chose to import data from the High Performance
Computing suite on OpenBenchmarking. We selected the three benchmarks that had the most
public results available (in their primary configuration) at the time of writing: Himeno, NAMD,
and Dolfyn. Himeno4 is an incompressible fluid analysis code written in Fortran and C [ 11].
While a distributed memory parallel version exists (using MPI with Fortran VPP), we concern
ourselves with the sequential implementation. NAMD5 is a shared memory parallel molecular
dynamics code based on C++ and Charm++ [12]. The data we use stems from machine-wide
parallel trials. Finally, Dolfyn6 is a sequential computational fluid dynamics code based on
Fortran [13].</p>
        <p>Figure 4 shows the results when correlating each of the high-performance computing
benchmarks with Isabelle performance. Himeno reports performance in terms of work done over
time (where higher is better), while NAMD and Dolfyn measure time (per simulated ns, and
4Results from https://openbenchmarking.org/test/pts/himeno
5Results from https://openbenchmarking.org/test/pts/namd
6Results from https://openbenchmarking.org/test/pts/dolfyn
to solution; lower is better). For Himeno, we therefore compare against Isascore, while with
NAMD and Dolfyn we compare against our observed wall clock time.</p>
        <p>NAMD, as the only benchmark of these three that scales well with parallel resources, has
no significant correlation with single-threaded Isabelle time. However, it has a strong linear
relation with multi-threaded time. The two less scalable benchmarks correlate much closer
with Isabelle single-thread performance, where Dolfyn has a particularly nice correlation that
holds well for the most performant processors. In both cases, correlation with multi-threaded
Isabelle results is much worse ( 2-values: Himeno 0.46, Dolfyn 0.52). For both the Isabelle
benchmark and Dolfyn, the top processor that was tested is the same (Intel i7-12700K), and
on both benchmarks it has a margin on the runner-ups. This is also visible on the Himeno
benchmark, where the 12700K produces the highest floating point throughput of all tested
processors. However, it is not a highly parallel processor, which is why its NAMD results are
less favorable. This again shows that Isabelle performance is significantly impacted by the
single-thread performance of the underlying processor.
4.4.2. Consumer CPU Benchmarks
For our second comparison, we chose some of the most common consumer benchmarks to
compare to: PassMark CPU Mark7, Geekbench 58, Cinebench R159, and 3DMark CPU Profile 10.
For sequential performance, Figure 5 shows the scatter plots of Isascore to consumer benchmark
scores, which are normalized to a [0; 1] range so the plots can be compared against. A strong
positive relationship can be observed for all benchmarks, with  2-values in the range 0.63 to</p>
        <sec id="sec-4-4-1">
          <title>7Results from https://www.cpubenchmark.net/CPU_mega_page.html</title>
          <p>8Results from https://browser.geekbench.com/processor-benchmarks.json
9Results from https://us.rebusfarm.net/en/tempbench
10Results from https://www.3dmark.com/search, median over the top-100 values
0.82. A few moderate outliers are present (possibly due to system configuration). All in all, the
Isabelle benchmark seems quite similar to those consumer benchmarks for a single thread.</p>
          <p>Cinebench R15</p>
          <p>Geekbench 5
 2 = 0.63</p>
          <p>This gives rise to prediction of Isabelle performance, which would allow one to judge hardware
on which Isabelle was not executed before. However, single-threaded results are not meaningful
for real-world performance, and scaling them according to the average parallel eficiency did
not yield helpful results ( 2-values: Cinebench 0.59, Geekbench 0.70, PassMark 0.68, 3DMark
0.58). Not many datasets for consumer benchmarks report on results for diferent number of
threads, most report only a single “multi-core” value where all threads are utilized. An exception
to that is the 3DMark CPU Profile benchmark, where results are reported for 1 to 16 threads
individually (in steps by power of two). This allows us to create a better correlation, because all
consumer benchmarks tested had a far better parallel eficiency in the limit and were hence not
suited for direct prediction.</p>
          <p>When using 8 and 16 threads in both the 3DMark and Isabelle benchmark, score and Isascore
are strongly to moderately correlated and have individual  2-values of 0.77 and 0.61, respectively.
This makes the 3DMark well suited for performance prediction. Since the optimal number
of threads is in between, we use the average of its 8-thread and 16-thread results to create a
linear model for performance prediction (tuning for a non-uniform split did not yield better
results). Using ten times ten-fold cross-validation (i.e., averaging results over multiple iterations,
splitting the data into ten parts, and using each part as a test set and the remainder as training
set), the linear regression has an average  2-value of 0.87. Figure 6 shows the final model
( 2 = 0.84) and the resulting predictor for wall-clock time, which has a mean absolute error
(MAE) of 46.6 s. However, that error is somewhat exaggerated by the data collection method:
The public 3DMark data only shows only the top-100 results, from which we use the median
values. The regression improves slightly when the (non-public) true medians are used, and the
MAE decreases to 37.1 s.
2,000 4,000 6,000 8,000 0 2,000 4,000 6,000 8,000</p>
          <p>Averaged 8/16-Thread 3DMark CPU Profile Score</p>
          <p>The residual plot displayed in Figure 7 has no noticeable patterns and the residual distribution
roughly follows a normal distribution. All in all, the model simplicity and good fit indicate that
this linear model is quite well suited for performance prediction, as long as the other system
parameters are kept within reasonable bounds and the configuration is well tuned.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This work resolves our questions on Isabelle performance for the 2021-1 version. The Isabelle
community benchmark that we initiated saw lively participation and hundreds of results were
reported for a total of 54 distinct CPUs. The results form a solid data base for tuning of Isabelle
configuration; when not constrained, the optimal configuration is at 8 to 16 threads with 16 GB
heap memory for both the Java and ML process (at least for HOL-Analysis, larger sessions might
require more).
2,000 4,000 6,000 8,000
3DMark CPU Profile score
0
5</p>
      <p>When buying new hardware, the benchmark results give a good indication of which processor
is desirable for Isabelle. Individual CPU parameters are not as important as clock speeds are
only correlated with medium strength (though boost clock more than base clock) and cache
size not significantly at all. Instead, for hardware that has not yet been tested with Isabelle,
other benchmarks can greatly help in judging performance: While the single-threaded Isabelle
benchmark score is strongly correlated with many benchmarks (most strongly with the Dolfyn
high-performance benchmark and the PassMark CPU Mark), the multi-threaded scenario was
more dificult to model; In the end, we found a good predictor by using 3DMark CPU Profile
scores from 8 and 16 threads, with a final mean absolute error of 46.56 s. The model has a good
ift, and one can assume that it is fairly future-proof given that hardware from the last ten years
is properly predicted.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>If the Isabelle computation model does not change, there is not much left to be done on the
topic of performance prediction: The benchmark from which Isabelle performance is predicted
is widely popular and data for new hardware is usually quickly added, and the fit is about as
good as one can hope for. Still, the model should be validated after a few years, and we are
curious to see whether future hardware characteristics will be able to break the trend.</p>
      <p>One other aspect that we did not touch on is running multiple independent sessions in parallel,
which is often possible on automated large-scale Isabelle builds, e.g., for the Archive of Formal
Proofs. This can be done on multiple processes that run independently of each other and greatly
increases the usefulness of larger server CPUs with many cores; then, other parameters such as
memory bandwidth might be more important and have to be analyzed. However, given the large
cost of such machines, it would be much more economical to instead distribute the build on
multiple cheap but fast desktop CPUs (especially when latency is a concern, not throughput).</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Thanks to all the countless individuals who participated in the benchmark. Parts of the
performance results have been obtained on systems in the test environment BEAST (Bavarian
Energy Architecture &amp; Software Testbed) at the Leibniz Supercomputing Centre. NAMD was
developed by the Theoretical and Computational Biophysics Group in the Beckman Institute
for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. Many
thanks to Phoronix Media (OpenBenchmarking), PassMark Software (PassMark), RebusFarm
GmbH (Cinebench), Primate Labs Inc. (Geekbench) and especially UL Benchmarks (3DMark),
for providing their benchmark data.
[11] riken jp, Introduction to the Himeno benchmark, 2001. URL: https://i.riken.jp/en/supercom/
documents/himenobmt/.
[12] J. C. Phillips, D. J. Hardy, J. D. C. Maia, J. E. Stone, J. V. Ribeiro, R. C. Bernardi, R. Buch,
G. Fiorin, J. Hénin, W. Jiang, R. McGreevy, M. C. R. Melo, B. K. Radak, R. D. Skeel, A.
Singharoy, Y. Wang, B. Roux, A. Aksimentiev, Z. Luthey-Schulten, L. V. Kalé, K. Schulten,
C. Chipot, E. Tajkhorshid, Scalable molecular dynamics on cpu and gpu architectures with
namd, The Journal of Chemical Physics 153 (2020) 044130. doi:10.1063/5.0014475.
[13] dolfyn, Dolfyn, Open-Source CFD Solver, 2005. URL: https://www.dolfyn.net/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wenzel</surname>
          </string-name>
          , Parallel Proof Checking in Isabelle/Isar,
          <source>Proceedings of the ACM SIGSAM 2009 Internaional Workshop on Programming Languages for Mechanized Mathematics Systems (PLLMS</source>
          <year>2009</year>
          )
          <article-title>(</article-title>
          <year>2009</year>
          )
          <fpage>13</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Matthews</surname>
          </string-name>
          , M. Wenzel,
          <article-title>Eficient parallel programming in Poly/ML and Isabelle/ML, in:</article-title>
          <source>DAMP'10 - Proceedings of the 2010 ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming</source>
          , ACM Press, New York, New York, USA,
          <year>2010</year>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>62</lpage>
          . doi:
          <volume>10</volume>
          .1145/1708046.1708058.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Möhrmann</surname>
          </string-name>
          ,
          <article-title>Performance of clause selection heuristics for saturation-based theorem proving</article-title>
          ,
          <source>in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          , volume
          <volume>9706</volume>
          , Springer Verlag,
          <year>2016</year>
          , pp.
          <fpage>330</fpage>
          -
          <lpage>345</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -40229-1_
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ertel</surname>
          </string-name>
          ,
          <article-title>Performance Analysis of Competitive OR-Parallel Theorem Proving</article-title>
          ,
          <source>Technical Report, Technische Universität München</source>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jindal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Overbeek</surname>
          </string-name>
          , W. C.
          <article-title>Kabat, Exploitation of parallel processing for implementing high-performance deduction systems</article-title>
          ,
          <source>Journal of Automated Reasoning</source>
          <volume>8</volume>
          (
          <year>1992</year>
          )
          <fpage>23</fpage>
          -
          <lpage>38</lpage>
          . doi:
          <volume>10</volume>
          .1007/BF00263447.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Parallelization of a hyper-linking-based theorem prover</article-title>
          ,
          <source>Journal of Automated Reasoning</source>
          <volume>26</volume>
          (
          <year>2001</year>
          )
          <fpage>67</fpage>
          -
          <lpage>106</lpage>
          . doi:
          <volume>10</volume>
          .1023/A:
          <fpage>1006469202251</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Böhme</surname>
          </string-name>
          , T. Nipkow, Sledgehammer: Judgement day, in: J.
          <string-name>
            <surname>Giesl</surname>
          </string-name>
          , R. Hähnle (Eds.),
          <source>Automated Reasoning (IJCAR</source>
          <year>2010</year>
          ), volume
          <volume>6173</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2010</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Blanchette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Böhme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Paulson</surname>
          </string-name>
          ,
          <article-title>Extending sledgehammer with SMT solvers</article-title>
          ,
          <source>in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          , volume
          <volume>6803</volume>
          LNAI, Springer, Berlin, Heidelberg,
          <year>2011</year>
          , pp.
          <fpage>116</fpage>
          -
          <lpage>130</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -22438-6_
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sutclife</surname>
          </string-name>
          ,
          <article-title>The TPTP problem library and associated infrastructure: The FOF</article-title>
          and CNF Parts,
          <year>v3</year>
          .
          <fpage>5</fpage>
          .0,
          <source>Journal of Automated Reasoning</source>
          <volume>43</volume>
          (
          <year>2009</year>
          )
          <fpage>337</fpage>
          -
          <lpage>362</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10817- 009-9143-8.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>M. D. Hill</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Marty</surname>
          </string-name>
          ,
          <article-title>Amdahl's law in the multicore era</article-title>
          ,
          <source>Computer</source>
          <volume>41</volume>
          (
          <year>2008</year>
          )
          <fpage>33</fpage>
          -
          <lpage>38</lpage>
          . doi:
          <volume>10</volume>
          .1109/
          <string-name>
            <surname>MC</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <volume>209</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>