The Isabelle Community Benchmark
Fabian Huch, Vincent Bode
Technische Universität München, Boltzmannstraße 3, 85748 Garching, Germany


                                      Abstract
                                      Choosing hardware for theorem proving is no simple task: automated provers are highly complex
                                      and optimized programs, often utilizing a parallel computation model, and there is little prior research
                                      on the hardware impact on prover performance. To alleviate the problem for Isabelle, we initiated a
                                      community benchmark where the build time of HOL-Analysis is measured. On 54 distinct CPUs, a total
                                      of 669 runs with different Isabelle configurations were reported by Isabelle users. Results range from
                                      107 s to over 11 h. We found that current consumer CPUs performed best, with an optimal number of
                                      8 to 16 threads, largely independent of heap memory. As for hardware parameters, CPU base clock
                                      affected multi-threaded execution most with a linear correlation of 0.37, whereas boost frequency was
                                      the most influential parameter for single-threaded runs (correlation coefficient 0.55); cache size played
                                      no significant role. When comparing our benchmark scores with popular high-performance computing
                                      benchmarks, we found a strong linear relationship with Dolfyn (𝑅2 = 0.79) in the single-threaded scenario.
                                      Using data from the 3DMark CPU Profile consumer benchmark, we created a linear model for optimal
                                      (multi-threaded) Isabelle performance. When validating, the model has an average 𝑅2 -score of 0.87; the
                                      mean absolute error in the final model corresponds to a wall-clock time of 46.6 s. With a dataset of true
                                      median values for the 3DMark, the error improves to 37.1 s.

                                      Keywords
                                      Isabelle, theorem proving, user benchmark, run-time performance, performance prediction


1. Introduction
Choosing appropriate hardware and tuning configuration parameters is a common task when
one wants to run software optimally. For a complex and truly parallel interactive proof assistant
such as Isabelle, many factors influence run-time performance: The prover needs a Java and
a Meta Language (ML) run-time, the number of threads is variable, as is the amount of heap
memory – which in turn (in combination with the CPU architecture family) dictates which
ML platform and hence Poly/ML backend may be used. On a hardware level, CPU specs, the
memory hierarchy, and interconnects all influence how well the software components perform
and how the system as a whole behaves. The parallel efficiency of Isabelle (i.e., the ratio of
actual time versus sequential time divided by the number of parallel units) decays according to
a non-linear characteristic [1], as is the case in most parallel systems. As a result, there is no
single hardware or software characteristic that dominates the observed performance behavior.
   In Isabelle, performance is important both in the interactive mode (such that processing

PAAR’22: 8th Workshop on Practical Aspects of Automated Reasoning, August 11–12, 2022, Haifa, Israel
Envelope-Open huch@in.tum.de (F. Huch); vincent.bode@tum.de (V. Bode)
GLOBE https://home.in.tum.de/~huch (F. Huch); https://home.in.tum.de/~bodev (V. Bode)
Orcid 0000-0002-9418-1580 (F. Huch); 0000-0003-2353-389X (V. Bode)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
changes and running solvers is faster) and in a batch build mode, where sessions (i.e., collections
of formalizations) can be processed. Independent sessions can even be run in parallel with
multiple ML processes.
   However, making informed decisions on hardware is no trivial task. Members of the Isabelle
community have to rely on word of mouth to determine which processors and memory to use,
and configuration parameters (such as the number of threads or heap sizes) are largely folk
knowledge – backed by experience collected over the years, ad-hoc experiments, and sometimes
intuition. While there is some performance data available, it is not helpful in that regard as it
only covers a very small number of machines.
   With new and exciting hardware being developed at a fast pace, one can often be overwhelmed
by the sheer variety of hardware options available. Hence, the question of which hardware
to recommend for Isabelle can often not be answered exhaustively or satisfactory. This is
relevant both for individuals working with Isabelle, and for the larger-scale server infrastructure
maintained for continuous integration and for Isabelle and the Archive of Formal Proofs.
   To alleviate this problem, a solid data base with performance benchmark results for a wide
variety of involved hardware and configurations is needed. Not only would that directly answer
the question of optimal configurations for a given system and allow one to compare the hardware
on the market, but such a collection of data (if large enough, and kept up to date) would also
allow one to predict performance of other hardware for which no Isabelle data is available yet.
   In this paper, we outline our Isabelle community benchmark, discuss the immediate results
and findings, and derive a model to predict the Isabelle performance of unknown CPUs with
the help of widely used benchmarks for which more data is retrievable. Our source-code and
data and is made available publicly1 . Section 2 covers related work; we explain our benchmark
set-up in Section 3, and discuss the results in Section 4. In Section 5, we conclude and discuss
future work.


2. Related Work
Parallel run-time performance has been first analyzed for Isabelle when parallelism was intro-
duced by Wenzel in [1]. Benchmarks for multiple different sessions on a single test machine
already showed that the speedup (in terms of run-time) peaked at three worker threads with
a factor of 3.0, and slightly decreased for four cores. Matthews and Wenzel described the
necessary adaptations to the Poly/ML run-time that were necessary for introducing parallelism,
and analyzed the resulting bottlenecks [2]. They found that the parallelization model for Isabelle
sometimes failed to fully utilize all worker threads. Moreover, the synchronization model that
uses a single signal across all threads for guarded access was identified (but not analyzed) as
a potential bottleneck. Finally, it was observed that the single-threaded garbage collection is
responsible for up to 30 % CPU-time for 16 threads. Overall, a maximum speedup of 5.0 to 6.2
could be achieved using 8 threads.
   In automatic theorem provers, run-time is an important factor, since it can dictate whether a
goal can be proven within the given cut-off time. As a result, much research includes analysis
of the run-time performance of provers or individual prover components. Typically, only a
   1
       2022-paper folder in https://isabelle.systems/benchmark
single hardware configuration is used, which is reasonable for the analysis for single-threaded
systems [3]. However, since performing such analysis on a wide range of different hardware is
often impractical, run-time performance analysis of parallel approaches is frequently carried
out on single systems or clusters [4, 5, 6]. These results don’t always generalize, because the
hardware used can have a significant impact on the observed results.
   In contrast, results for the Isabelle sledgehammer proof-finder tool show that when running
multiple automatic provers to solve a goal, run-time becomes less important: In their judgement
day study [7], Böhme and Nipkow found that running three different Automated Theorem
Provers for five seconds each solved as many goals as running the most effective one for 120 s.
Subsequently, in direct follow-up work [8], run-time was not analyzed.
   For automatic provers, a large range of benchmarks exist to judge their effectiveness on a
given set of problems. One of these is the widely known TPTP library [9]. However, there is
not much work investigating the effect of hardware in the field of automated reasoning. To the
best of our knowledge, there exists no other benchmark comparing the hardware impact on
run-time performance of any theorem prover, and this is the first work that analyzes this effect
on a wide range of different hardware.


3. Benchmarking Methodology
The benchmark has to fulfill to multiple requirements: It needs to capture typical computations
found in Isabelle — mostly symbolic computation —, have a reasonable run-time for the end
user, and motivate users to want to see how their machines perform (i.e., results should be
self-evident). We settled for a clean build of the HOL-Analysis session: It is a typical Isabelle
formalization which runs in approximately five minutes on typical consumer machines. Many
Isabelle users have likely run a similar workload for their own work in the past.
   While users can easily contribute results for their favourite Isabelle configuration, we supplied
a small script to run a comparable set of configurations automatically 2 . This way, the whole
benchmark can be run with a single command (assuming a working installation of Isabelle
2021-1) on any platform. We vary the underlying ML platform between 64_32 (64-bit mode
with 32-bit values) and true 64-bit mode, heap sizes of both the ML and JVM process (set to the
same value to reduce the number of linear combinations, as early benchmark results indicated
they play only a minor role here), and the number of worker threads for parallel proof checking.
   Results are collected in a collaborative spreadsheet3 with automatically updated figures for
fastest CPUs and parallel efficiency. The benchmark is not intended as one-shot experiment, but
rather as a continuous community effort to maintain an overview over the Isabelle computing
landscape as new hardware emerges. It is being kept open for new results, and will be maintained
for future Isabelle versions.


    2
        Documentation and code at https://isabelle.systems/benchmark
    3
        https://docs.google.com/spreadsheets/d/12GhEwSNSopowDBq5gSem3u39fliiIcoTIZHMnX4RE3A
3.1. Benchmark Score (Isascore)
For the benchmark results, we use the wall-clock build time as an intuitive metric. Together with
the well-known HOL-Analysis build target, the metric immediately gives a good understanding
of Isabelle performance. However, it is not well suited to compare to other metrics such as
throughput, because the relationship between time to solution and throughput is inverse. To
still allow using simple linear models such as the Pearson correlation, we introduce a benchmark
score that we call Isascore. It reflects the number of HOL-Analysis runs one could complete in a
day, i.e.:
                                                       1d
                                      Isascore =                                              (1)
                                                 wall-clock time

3.2. Threats to Validity
The experiments discussed in this paper could not be performed in a controlled environment,
since they were run by members of the Isabelle community rather than exclusively by the authors
of this paper. This means that various outside factors may have influence on the reported results,
though it seems reasonable to assume that those factors should usually be constant between
different configurations of the same benchmark run. The effect of machine-local anomalies
can be mitigated for hardware where we received several independent measurements by using
statistical techniques. Furthermore, due to reasons of practicality in orchestrating data collection,
extended system specifics beyond the CPU model, OS, and memory configuration were not
recorded. There is a possibility that relevant parameters may have been missed. Therefore, like
all performance benchmarks, these results represent upper bounds of what might be achieved
with a given system configuration. Lastly, while the benchmark was posted on the Isabelle-users
mailing list, in principle the data entry was open to public and could have been misused.


4. Results
At the time of writing this paper, 669 results for a total of 594 unique configurations have been
reported, utilizing 54 distinct CPUs. Those include Intel Desktop/Server CPUs from Sandy
Bridge to Alder Lake, AMD Ryzen Zen2 to Zen4 processors as well as Epyc and Threadripper
server systems, a Fujitsu A64FX, and Apple M1 processors.

Table 1
The five processors with the lowest time to solution, using each processor’s optimal configuration
(median value taken for multiple runs).
     Processor Model                          Total Cores    Base Clock     Wall-Clock Time
     12th Gen Intel(R) Core(TM) i7-12700K               12       3.6 GHz                 108 s
     Apple M1 Pro (native arm64)                        10       2.1 GHz                 137 s
     AMD Ryzen 9 5900X                                  12       3.7 GHz                 143 s
     Intel(R) Core(TM) i9-9900K                          8       3.6 GHz                 163 s
     Apple M1 Pro                                       10       2.1 GHz                 164 s
   Table 1 shows the five CPUs with the lowest time to solution, using the median value as
an aggregate for multiple runs of the same configuration. Older Intel and AMD consumer
hardware is surpassed by the Apple M1 Pro chip; only the most recent Intel core line performs
better. Due to the nature of the benchmark, server and high performance hardware does not
rank highly, with the best performing system (2x AMD Epyc 7742) clocking in at 184 s.
   In the following, we analyze how Isabelle configuration influences performance, investigate
the impact of hardware parameters, and then compare our results to other computational bench-
marks. Where individual CPUs were concerned, we filtered out special system configurations
(e.g., overclocked hardware, dual-cpu systems, power-saving mode). We also encountered a
small number of extreme outliers where Isabelle run-time was much longer than expected. For
two of those, we could identify the user and investigate; in both cases, the system configuration
was at fault (excessive swapping, UEFI set to “silent mode”) and when corrected, results were
much closer to the rest of the data. We could not investigate the third extreme outlier but
excluded it from the following, since it is likely to stem from a similar cause.

4.1. Multi-Threaded Performance
The number of threads used plays a major role in the overall performance. Figure 1 illustrates
how the wall clock time and CPU time compare from a single thread to up to 128 threads. The
optimal wall-clock time is achieved with 8 to 16 threads depending on the hardware and greatly
increases if more threads are used. This is typical behavior for a strong-scaling benchmark like
ours, where the relative impact of communication increases with an increase in the number of
threads used. For more than the optimal number of threads, the run-time increases substantially.
The underlying limitations of the parallel computing model – the single-threaded garbage
collection of the Poly/ML run-time and worker starvation after parallelization is saturated –
were already discussed in [2], albeit tests were run on a machine with 32 cores. It might be a
surprise that the scalability is so low when distributing across more threads. In contrast, the CPU
time divided by number of threads (which is not an ideal metric, but the only feasible solution
due to the nature of the benchmark) flattens out at eight threads. In small-scale experiments,
we found that the JVM process takes up a constant amount of CPU time independent of the
number of threads (about 26 % in single-core mode). This means that there is not too much
computation overhead but the hardware can not be properly utilized by the ML process, most
likely due to the single-threaded garbage collection that stops all threads when running. This
is an inherently sequential task, which means that Amdahl’s Law (the speedup is limited by
1/(1 − parallelizable portion) [10]) limits the achievable speedup for this problem.
   The parallel efficiency paints a similar picture in Figure 2, decreasing almost linearly (on the
logarithmic x-axis) up to 32 threads at which it is at a median of 0.065. With the number of
threads tending to the limit of 128, it approaches 0.002. There is an outlier where the parallel
efficiency is over one – super-linear speedup is unusual but can appear in practice because of
caching effects or resource contention in the measured system.
                        56:00                                                                                                      56:00


                                                                                                                                           CPU Time per Thread [min]
                        48:00                                                                                                      48:00
Wall-Clock Time [min]


                        40:00                                                                                                      40:00
                        32:00                                                                                                      32:00
                        24:00                                                                                                      24:00
                        16:00                                                                                                      16:00
                        08:00                                                                                                      08:00
                        00:00                                                                                                      00:00
                                1   2   4                     8     16       32 64 128      1 2 4        8     16    32   64 128
                                                                              Thread Count (log scale)
Figure 1: Run-time by number of threads (log scale): Wall-clock time on the left, CPU time per thread
on the right.


                                                              1.0

                                                              0.8
                                        Parallel Efficiency


                                                              0.6

                                                              0.4

                                                              0.2

                                                              0.0

                                                                         1      2      4    8   16 32 64       128
                                                                                    Thread Count (log scale)
Figure 2: Parallel efficiency by number of threads. The HOL-Analysis build does not scale past a low
number of threads, with parallel efficiency already down by half when using eight threads.


4.2. Performance Impact of Heap Memory
As preliminary results indicated that heap memory (as long as sufficient) only plays a minor
role in performance, we keep the JVM and Poly/ML processes at the same heap size. We know
from experience that a few gigabytes of memory suffice for HOL-Analysis; however, increased
parallelism requires more memory in principle due to memory overhead. Hence, the range of
examined heap sizes depends on the number of threads used. Figure 3 shows the change in
run-time for different heap settings relative to the minimal setting. The boxes capture the 25
and 75 percentiles as height and sampling size as width; whiskers correspond to the extreme
values. The results show that performance is not affected very much by heap size. Following
the line of medians, wall-clock time slightly increases above 16 GB (where the 64 bit Poly/ML
backend needs to be used, as the more efficient 64_32 mode does not allow more than 16 GB),
as well as for very large values. We observed a single outlier for 64 threads and 128 GB heap
memory at a relative factor of 0.66.

                                1.2                                               medians


                                                                                            8 Threads
                                 1

                                0.8

                                1.2


                                                                                            16 Threads
                                 1
            Relative Run-Time


                                0.8

                                1.2


                                                                                            32 Threads
                                 1

                                0.8

                                1.2


                                                                                            64 Threads
                                 1

                                0.8
                                      8   16       32     64       128      256      512
                                               Heap Memory (log scale) [GB]
Figure 3: Relative run-time changes to minimal memory configuration, by heap memory. The available
heap memory has only a marginal impact on the run-time, with large amounts of available heap memory
degrading performance independently of the number of threads used.


4.3. Influence of Hardware Characteristics
Based on folk knowledge about Isabelle performance, we suspected that cache size would be a
major factor; it was debated whether boost clock speed would be relevant. To test the hypotheses,
we analyzed the impact of size of the L3-cache, base clock speed, and maximal (boost) clock
speed (ignoring power-save cores where applicable) on Isabelle performance. Table 2 shows the
correlation between Isascore and those parameters (APA notation as explained in caption). At
our significance level of 0.05, we did not find cache size to impact performance significantly.
Base frequency is weakly correlated with the Isascore for a single thread (though at the edge
of significance) and a bit more strongly (and much more significantly) in the multi-threaded
scenarios. Finally, boost frequency has a significant medium correlation for all modes, which is
strongest in the single-threaded configuration with a value of 0.55.

Table 2
Pearson correlation of hardware parameters with Isascore in APA notation: degrees of freedom (number
of observations minus two), 𝑟 (strength) and 𝑝 (significance). Cache size is left out as there is no
significant correlation.
                        Mode      Base Frequency           Maximum Boost Frequency
                      1 thread    𝑟(51) = .27, 𝑝 = .050    𝑟(51) = .55, 𝑝 < .001
                     8 threads    𝑟(47) = .39, 𝑝 = .006    𝑟(47) = .43, 𝑝 = .002
                optimal config    𝑟(52) = .37, 𝑝 = .006    𝑟(52) = .50, 𝑝 < .001

  A possible explanation is that boost frequency can only be sustained for a single core in most
CPUs, hence single-threaded performance profits from it a lot; in the multi-threaded scenario,
the actual core frequency is much closer to the base frequency and thus its impact is larger.

4.4. Comparison to Computational Benchmarks
Performance benchmarks exist for many applications; additionally, synthetic benchmarks are
often used to evaluate hardware performance. They can roughly be categorized into scientific
computing versus consumer benchmarks. In the following, we compare the results of our Isabelle
community benchmark with a number of publicly available datasets for such benchmarks. For
the comparison, we selected results with matching processors, and matched the benchmarks’
multi-thread setting (e.g., specific thread count, or all cores). To obtain sufficiently large datasets,
we selected some of the most popular benchmarks.

4.4.1. Benchmarks in High-Performance Computing
The first analysis we wish to conduct is a comparison of Isabelle performance with some
scientific programs. For this analysis, we chose to import data from the High Performance
Computing suite on OpenBenchmarking. We selected the three benchmarks that had the most
public results available (in their primary configuration) at the time of writing: Himeno, NAMD,
and Dolfyn. Himeno4 is an incompressible fluid analysis code written in Fortran and C [11].
While a distributed memory parallel version exists (using MPI with Fortran VPP), we concern
ourselves with the sequential implementation. NAMD5 is a shared memory parallel molecular
dynamics code based on C++ and Charm++ [12]. The data we use stems from machine-wide
parallel trials. Finally, Dolfyn6 is a sequential computational fluid dynamics code based on
Fortran [13].
   Figure 4 shows the results when correlating each of the high-performance computing bench-
marks with Isabelle performance. Himeno reports performance in terms of work done over
time (where higher is better), while NAMD and Dolfyn measure time (per simulated ns, and
    4
      Results from https://openbenchmarking.org/test/pts/himeno
    5
      Results from https://openbenchmarking.org/test/pts/namd
    6
      Results from https://openbenchmarking.org/test/pts/dolfyn
                                             Himeno                                                                     Dolfyn                                                              NAMD
                                                                                                                                                                               18:00


                                                               Isabelle Time (Single-threaded) [min]


                                                                                                                                        Isabelle Time (Multi-threaded) [min]
                                 150        𝑅2 = 0.62                                                                  𝑅2 = 0.79                                                           𝑅2 = 0.73
                                                                                                       36:00
Isascore (Single-threaded)


                                                                                                                                                                               12:00
                                 100
                                                                                                       24:00


                                 50                                                                                                                                            06:00
                                                                                                       12:00


                                  0                                                                    00:00                                                                   00:00
                                       0   3,000 6,000 9,000                                                   0        10     20                                                      0    500    1,000
                                             MFLOPS                                                                Execution Time [s]                                             Time per Simulated ns [d]
Figure 4: Comparison of single-threaded (Himeno and Dolfyn) and multi-threaded (NAMD) Isabelle
results with different high-performance computing benchmarks, with linear regression line.


to solution; lower is better). For Himeno, we therefore compare against Isascore, while with
NAMD and Dolfyn we compare against our observed wall clock time.
   NAMD, as the only benchmark of these three that scales well with parallel resources, has
no significant correlation with single-threaded Isabelle time. However, it has a strong linear
relation with multi-threaded time. The two less scalable benchmarks correlate much closer
with Isabelle single-thread performance, where Dolfyn has a particularly nice correlation that
holds well for the most performant processors. In both cases, correlation with multi-threaded
Isabelle results is much worse (𝑅2 -values: Himeno 0.46, Dolfyn 0.52). For both the Isabelle
benchmark and Dolfyn, the top processor that was tested is the same (Intel i7-12700K), and
on both benchmarks it has a margin on the runner-ups. This is also visible on the Himeno
benchmark, where the 12700K produces the highest floating point throughput of all tested
processors. However, it is not a highly parallel processor, which is why its NAMD results are
less favorable. This again shows that Isabelle performance is significantly impacted by the
single-thread performance of the underlying processor.

4.4.2. Consumer CPU Benchmarks
For our second comparison, we chose some of the most common consumer benchmarks to
compare to: PassMark CPU Mark 7 , Geekbench 5 8 , Cinebench R15 9 , and 3DMark CPU Profile 10 .
For sequential performance, Figure 5 shows the scatter plots of Isascore to consumer benchmark
scores, which are normalized to a [0; 1] range so the plots can be compared against. A strong
positive relationship can be observed for all benchmarks, with 𝑅2 -values in the range 0.63 to
                             7
                           Results from https://www.cpubenchmark.net/CPU_mega_page.html
                             8
                           Results from https://browser.geekbench.com/processor-benchmarks.json
                         9
                           Results from https://us.rebusfarm.net/en/tempbench
                        10
                           Results from https://www.3dmark.com/search, median over the top-100 values
0.82. A few moderate outliers are present (possibly due to system configuration). All in all, the
Isabelle benchmark seems quite similar to those consumer benchmarks for a single thread.

                                    Cinebench R15                                  Geekbench 5

              150       𝑅2 = 0.63                                      𝑅2 = 0.81


              100


               50


               0
   Isascore


                              PassMark CPU Mark                              3DMark CPU Profile

              150       𝑅2 = 0.82                                      𝑅2 = 0.73


              100


               50


               0
                    0       0.2       0.4   0.6     0.8     1         0    0.2      0.4   0.6    0.8   1
                                                          Normalized Score
Figure 5: Comparison of single-threaded results with different consumer benchmarks, with linear
regression line. Benchmark scores are normalized to [0, 1] range.


   This gives rise to prediction of Isabelle performance, which would allow one to judge hardware
on which Isabelle was not executed before. However, single-threaded results are not meaningful
for real-world performance, and scaling them according to the average parallel efficiency did
not yield helpful results (𝑅2 -values: Cinebench 0.59, Geekbench 0.70, PassMark 0.68, 3DMark
0.58). Not many datasets for consumer benchmarks report on results for different number of
threads, most report only a single “multi-core” value where all threads are utilized. An exception
to that is the 3DMark CPU Profile benchmark, where results are reported for 1 to 16 threads
individually (in steps by power of two). This allows us to create a better correlation, because all
consumer benchmarks tested had a far better parallel efficiency in the limit and were hence not
suited for direct prediction.
   When using 8 and 16 threads in both the 3DMark and Isabelle benchmark, score and Isascore
are strongly to moderately correlated and have individual 𝑅2 -values of 0.77 and 0.61, respectively.
This makes the 3DMark well suited for performance prediction. Since the optimal number
of threads is in between, we use the average of its 8-thread and 16-thread results to create a
linear model for performance prediction (tuning for a non-uniform split did not yield better
results). Using ten times ten-fold cross-validation (i.e., averaging results over multiple iterations,
splitting the data into ten parts, and using each part as a test set and the remainder as training
set), the linear regression has an average 𝑅2 -value of 0.87. Figure 6 shows the final model
(𝑅2 = 0.84) and the resulting predictor for wall-clock time, which has a mean absolute error
(MAE) of 46.6 s. However, that error is somewhat exaggerated by the data collection method:
The public 3DMark data only shows only the top-100 results, from which we use the median
values. The regression improves slightly when the (non-public) true medians are used, and the
MAE decreases to 37.1 s.
                                                                                    18:00
             800        2
                       𝑅 = 0.84                                                                                  MAE = 46.6 s
                                                                                    15:00

             600                                                Isabelle Time [s]
                                                                                    12:00
  Isascore


                                                                                    09:00
             400
                                                                                    06:00
             200
                                                                                    03:00
                                             46.8 + 0.063 ⋅ 𝑥
              0                                                                     00:00
                   0        2,000   4,000   6,000   8,000                                   0   2,000   4,000   6,000   8,000
                                    Averaged 8/16-Thread 3DMark CPU Profile Score
Figure 6: Scatter plots for final model of Isabelle performance. Linear model on Isascore on the left,
corresponding wall clock run-times on the right.


  The residual plot displayed in Figure 7 has no noticeable patterns and the residual distribution
roughly follows a normal distribution. All in all, the model simplicity and good fit indicate that
this linear model is quite well suited for performance prediction, as long as the other system
parameters are kept within reasonable bounds and the configuration is well tuned.


5. Conclusion
This work resolves our questions on Isabelle performance for the 2021-1 version. The Isabelle
community benchmark that we initiated saw lively participation and hundreds of results were
reported for a total of 54 distinct CPUs. The results form a solid data base for tuning of Isabelle
configuration; when not constrained, the optimal configuration is at 8 to 16 threads with 16 GB
heap memory for both the Java and ML process (at least for HOL-Analysis, larger sessions might
require more).
                                        200


                  Isascore Residuals
                                        100

                                          0

                                       −100

                                       −200
                                              0   2,000   4,000   6,000   8,000   0   5
                                                   3DMark CPU Profile score
Figure 7: Residual plot for final linear model, with histogram. The residuals show no apparent patterns
and roughly follow a normal distribution.


    When buying new hardware, the benchmark results give a good indication of which processor
is desirable for Isabelle. Individual CPU parameters are not as important as clock speeds are
only correlated with medium strength (though boost clock more than base clock) and cache
size not significantly at all. Instead, for hardware that has not yet been tested with Isabelle,
other benchmarks can greatly help in judging performance: While the single-threaded Isabelle
benchmark score is strongly correlated with many benchmarks (most strongly with the Dolfyn
high-performance benchmark and the PassMark CPU Mark), the multi-threaded scenario was
more difficult to model; In the end, we found a good predictor by using 3DMark CPU Profile
scores from 8 and 16 threads, with a final mean absolute error of 46.56 s. The model has a good
fit, and one can assume that it is fairly future-proof given that hardware from the last ten years
is properly predicted.


6. Future Work
If the Isabelle computation model does not change, there is not much left to be done on the
topic of performance prediction: The benchmark from which Isabelle performance is predicted
is widely popular and data for new hardware is usually quickly added, and the fit is about as
good as one can hope for. Still, the model should be validated after a few years, and we are
curious to see whether future hardware characteristics will be able to break the trend.
   One other aspect that we did not touch on is running multiple independent sessions in parallel,
which is often possible on automated large-scale Isabelle builds, e.g., for the Archive of Formal
Proofs. This can be done on multiple processes that run independently of each other and greatly
increases the usefulness of larger server CPUs with many cores; then, other parameters such as
memory bandwidth might be more important and have to be analyzed. However, given the large
cost of such machines, it would be much more economical to instead distribute the build on
multiple cheap but fast desktop CPUs (especially when latency is a concern, not throughput).
Acknowledgments
Thanks to all the countless individuals who participated in the benchmark. Parts of the per-
formance results have been obtained on systems in the test environment BEAST (Bavarian
Energy Architecture & Software Testbed) at the Leibniz Supercomputing Centre. NAMD was
developed by the Theoretical and Computational Biophysics Group in the Beckman Institute
for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. Many
thanks to Phoronix Media (OpenBenchmarking), PassMark Software (PassMark), RebusFarm
GmbH (Cinebench), Primate Labs Inc. (Geekbench) and especially UL Benchmarks (3DMark),
for providing their benchmark data.


References
 [1] M. Wenzel, Parallel Proof Checking in Isabelle/Isar, Proceedings of the ACM SIGSAM
     2009 Internaional Workshop on Programming Languages for Mechanized Mathematics
     Systems (PLLMS 2009) (2009) 13–21.
 [2] D. C. Matthews, M. Wenzel, Efficient parallel programming in Poly/ML and Isabelle/ML,
     in: DAMP’10 - Proceedings of the 2010 ACM SIGPLAN Workshop on Declarative Aspects
     of Multicore Programming, ACM Press, New York, New York, USA, 2010, pp. 53–62.
     doi:10.1145/1708046.1708058 .
 [3] S. Schulz, M. Möhrmann, Performance of clause selection heuristics for saturation-based
     theorem proving, in: Lecture Notes in Computer Science (including subseries Lecture
     Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 9706, Springer
     Verlag, 2016, pp. 330–345. doi:10.1007/978-3-319-40229-1_23 .
 [4] W. Ertel, Performance Analysis of Competitive OR-Parallel Theorem Proving, Technical
     Report, Technische Universität München, 1991.
 [5] A. Jindal, R. Overbeek, W. C. Kabat, Exploitation of parallel processing for implementing
     high-performance deduction systems, Journal of Automated Reasoning 8 (1992) 23–38.
     doi:10.1007/BF00263447 .
 [6] C. H. Wu, S. J. Lee, Parallelization of a hyper-linking-based theorem prover, Journal of
     Automated Reasoning 26 (2001) 67–106. doi:10.1023/A:1006469202251 .
 [7] S. Böhme, T. Nipkow, Sledgehammer: Judgement day, in: J. Giesl, R. Hähnle (Eds.),
     Automated Reasoning (IJCAR 2010), volume 6173 of LNCS, Springer, 2010, pp. 107–121.
 [8] J. C. Blanchette, S. Böhme, L. C. Paulson, Extending sledgehammer with SMT solvers,
     in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
     Intelligence and Lecture Notes in Bioinformatics), volume 6803 LNAI, Springer, Berlin,
     Heidelberg, 2011, pp. 116–130. doi:10.1007/978-3-642-22438-6_11 .
 [9] G. Sutcliffe, The TPTP problem library and associated infrastructure: The FOF and CNF
     Parts, v3.5.0, Journal of Automated Reasoning 43 (2009) 337–362. doi:10.1007/s10817-
     009-9143-8 .
[10] M. D. Hill, M. R. Marty, Amdahl’s law in the multicore era, Computer 41 (2008) 33–38.
     doi:10.1109/MC.2008.209 .
[11] riken jp, Introduction to the Himeno benchmark, 2001. URL: https://i.riken.jp/en/supercom/
     documents/himenobmt/.
[12] J. C. Phillips, D. J. Hardy, J. D. C. Maia, J. E. Stone, J. V. Ribeiro, R. C. Bernardi, R. Buch,
     G. Fiorin, J. Hénin, W. Jiang, R. McGreevy, M. C. R. Melo, B. K. Radak, R. D. Skeel, A. Sing-
     haroy, Y. Wang, B. Roux, A. Aksimentiev, Z. Luthey-Schulten, L. V. Kalé, K. Schulten,
     C. Chipot, E. Tajkhorshid, Scalable molecular dynamics on cpu and gpu architectures with
     namd, The Journal of Chemical Physics 153 (2020) 044130. doi:10.1063/5.0014475 .
[13] dolfyn, Dolfyn, Open-Source CFD Solver, 2005. URL: https://www.dolfyn.net/.