Characteristic Subsets of SMT-LIB Benchmarks
Jan Jakubův1,2 , Mikoláš Janota2 and Andrew Reynolds3
1
  University of Innsbruck, Innsbruck, Austria
2
  Czech Technical University in Prague, Prague, Czechia
3
  The University of Iowa, Iowa City, USA


                                         Abstract
                                         A typical challenge faced when developing a parametrized solver is to evaluate a set of strategies over
                                         a set of benchmarking problems. When the set of strategies is large, the evaluation is often done with
                                         a restricted time limit and/or on a smaller subset of problems in order to estimate the quality of the
                                         strategies in a reasonable time. Firstly, considering the standard SMT-LIB benchmarks, we ask the question
                                         how much the time evaluation limit and benchmark size can be restricted to still obtain reasonable
                                         performance results. Furthermore, we propose a method to construct a benchmark characteristic subset
                                         which faithfully characterizes all benchmark problems. To achieve this, we collect problem performance
                                         statistics and employ clustering methods. We evaluate the quality of our benchmark characteristic
                                         subsets on the task of the best cover construction, and we compare the results with randomly selected
                                         benchmark subsets. We show that our method achieves smaller relative error than random problem
                                         selection.

                                         Keywords
                                         Benchmarking, Satisfiability Modulo Theories, Machine Learning, Clustering, Benchmark Characteristic
                                         Subsets


1. Introduction
Optimizing the performance of a Satisfiability Modulo Theories (SMT) solver, like CVC4 [1]
or Z3 [2], on a large set of problems, like the SMT-LIB library [3, 4], is a time-consuming task.
This is because of a large number of problems in SMT-LIB (over 100000). Even with the help of
parallelization, it takes several hours to evaluate a single strategy over all benchmark problems
with a standard time limit (like 60 seconds). The situation gets even worse when more than one
strategy, or a parametrized strategy with different arguments, needs to be evaluated. Then it
is common practice to restrict the evaluation to a subset of problems and/or to decrease the
evaluation time limit. In this paper, we try to address the question how much can the evaluation
be restricted so that the results are still plausible. Furthermore, we propose a method to construct
a small benchmark characteristic subset that would faithfully represent all benchmark problems
for the sake of evaluation.
   Within a large problem library, one can expect a large number of syntactically similar problems
or problems with similar performance with respect to many strategies. Identification of similar
problems could help us to speed up the evaluation, as we can select a single representative

SMT’21: 19th International Workshop on Satisfiability Modulo Theories. July 18 - 19, 2021
" jakubuv@gmail.com (J. Jakubův); mikolas.janota@gmail.com (M. Janota)
 0000-0002-8848-5537 (J. Jakubův); 0000-0003-3487-784X (M. Janota); 0000-0002-3529-8682 (A. Reynolds)
                                       © 2021 Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                          53
from each similarity class. To identify classes of similar problems, we propose to perform short
evaluation runs of several CVC4 strategies and to collect run-time statistics. Then, problems
with similar statistics might be considered similar. To evaluate this approach, we experiment
with the best cover construction, that is, a selection of strategies with the best performance on
a given set of problems. We construct the best cover on a smaller benchmark subset and we
compare its performance with the optimal cover constructed on all problems. This gives us a
quality measure of problem subset selection.
   The paper is structured as follows. Section 2 describes the proposed method of identification
of similar problems based on performance statistics, and how to construct a benchmark charac-
teristic subset. Section 3 describes the best cover construction task which is used for empirical
evaluation of the quality of constructed subsets in Section 4. We conclude in Section 5.


2. Benchmark Characteristic Subsets by Clustering
Restricting the evaluation of a solver to a smaller random subset of all benchmark problems
is a common and effective method used by developers. Considering a large set of benchmark
problems, like the SMT-LIB library [3, 4], one can expect a significantly large number of very
similar or even duplicate problems. Random selection does not take problem similarities into
account and thus can lead to unnecessary and duplicate computations. In this section, we
propose a simple method for selecting a characteristic subset of benchmark problems based
on performance features and clustering algorithms. The desired property of this characteristic
subset is that it faithfully characterizes all benchmark problems. That is, that any development,
like parameter tuning or scheduler construction, performed on this subset yields similar results
as the same development performed on all benchmark problems, but faster.
   Clustering algorithms [5] are capable of dividing a set of entities into disjoint subsets, called
clusters, such that similar entities end up in the same cluster. We propose to apply clustering
algorithms to create a benchmark characteristic subset. Bearing in mind that the benchmark
characteristic subset should allow us to avoid duplicate computations, we cluster the benchmark
problems and we create the characteristic subset by selecting one problem from each cluster.
   To employ clustering algorithms, the entities, in our case benchmark problems, need to be
represented by numeric feature vectors. We could create the feature vectors from the syntactic
structure of problems, similarly to ENIGMA [6, 7, 8] clause and problem features for first-
order logic, but in this work we propose an alternative to compute the feature vectors from
runtime statistics of short probing solver runs. In particular, we run CVC4 solver on all SMT-LIB
benchmark problems with small abstract resource limit1 . Independently on whether the problem
is solved or not, we collect the runtime statistics, and we extract selected statistic values (see
Appendix A) to construct the performance feature vectors of the problem. We use only statistic
values which always lead to the same value with repeated runs, in particular, we do not use any
real-time measurements. Moreover, we repeat this process with different CVC4 strategies to
obtain longer and more descriptive feature vectors. Our performance features are similar to
dynamic features found in research literature [9, 10].

   1
       We run CVC4 with argument –rlimit=10000 to limit the resources (roughly the number of SAT conflicts).


                                                     54
   Once the performance feature vectors for all benchmark problems are constructed, we employ
the 𝑘-means [11] clustering algorithm, one of the most common clustering methods. The 𝑘-
means algorithm divides benchmark problems into 𝑘 disjoint clusters, hence we set 𝑘 to the
desired size of the benchmark characteristic subset. Briefly, the 𝑘-means algorithm works as
follows. Firstly, 𝑘 points in the vector space called cluster centroids are selected. Different
variants of the 𝑘-means algorithm use different centroid initialization methods, with random
initialization being one of the common variants. In the next step, all distances between centroids
and feature vectors are computed, and each feature vector is assigned to the closest cluster
centroid. After that, the average vector of the vectors assigned to the same cluster is computed
for each cluster, and the cluster centroid is moved to the computed average. This process is
iterated until the centroids stop moving.
   The above gives us the following method for the construction of the benchmark characteristic
set. We construct the performance feature vectors for all benchmark problems by extracting
runtime statistics from short probing solver runs. As different features might have values of
different units, we normalize the vectors by dividing each feature by the standard deviation
across all feature values. Then we employ the 𝑘-means algorithm and construct 𝑘 problem
clusters and their centroids. From each cluster, we select the problem whose performance feature
vector is the closest to the cluster centroid. Thusly selected problems form the benchmark
characteristic subset of size 𝑘.


3. Evaluation Tasks: Greedy and Exact Covers
To evaluate the quality of the benchmark characteristic subsets from Section 2, we consider the
task of the best cover construction. Let 𝑃 be the set of benchmark problems and let 𝑆 be a set
of strategies. Suppose we evaluate all strategies 𝑆 over problems 𝑃 . The coverage of 𝑆 over 𝑃 ,
denoted coverage(S,P), is the count of problems solved by strategies 𝑆.
                                  ⃒ ⋃︁                                        ⃒
                  coverage(S,P) = ⃒     {𝑝 ∈ 𝑃 : strategy 𝑠 solves problem 𝑝}⃒
                                  ⃒                                           ⃒
                                    𝑠∈𝑆

The best cover of strategies 𝑆 over 𝑃 of size 𝑛 is the subset of 𝑆0 ⊆ 𝑆 with |𝑆0 | ≤ 𝑛 and with
the highest possible coverage(𝑆0 , 𝑃 ).
   Suppose we construct the best cover 𝑆0 ⊆ 𝑆 based only on a partial evaluation of strategies
𝑆 over a subset 𝑃0 of problems 𝑃 . This is typically done when the set of strategies and the set of
problems are large. We can then compare the performance of the best cover constructed on 𝑃0
with the optimal performance of the best cover constructed with the full evaluation over 𝑃 . The
performance of the two covers should be similar, if 𝑃0 faithfully characterizes 𝑃 . More formally,
let us fix strategies 𝑆 and the cover size 𝑛. Let 𝑆0 be the best cover over 𝑃0 and let 𝑆best be the
best cover over 𝑃 . We will measure the difference between their coverages coverage(𝑆0 , 𝑃 ) and
coverage(𝑆best , 𝑃 ) over all problems 𝑃 . We usually do not have all strategies evaluated over all
problems, but for the sake of the evaluation in this paper we shall compute them.
   We consider two methods to construct the best covers. First, an approximative but fast method
called greedy cover (Section 3.1). Second, an exact but a bit slower best cover construction using
an ILP solver (Section 3.2).


                                                55
1 Function greedy(𝑃 , 𝑆, 𝑛)
      Input : set of problems 𝑃 , set of strategies 𝑆, size 𝑛
      Output : a subset of 𝑆 of size 𝑛 approximating the best cover
2     𝐺←∅
3     while |𝐺| < 𝑛 and ((∃𝑝 ∈ 𝑃 )(∃𝑠 ∈ 𝑆) : strategy s solves problem p) do
4        for 𝑠 ∈ 𝑆 do                 // compute the problems solved by each strategy 𝑠 ∈ 𝑆
5            𝑅𝑠 ← {𝑝 ∈ 𝑃 : strategy 𝑠 solves problem 𝑝}
6         𝑔 ← argmax𝑠∈𝑆 (|𝑅𝑠 |)                // obtain the best strategy on current problems 𝑃
7         𝑃 ← 𝑃 ∖ 𝑅𝑔                                       // remove the problems solved by 𝑔 . . .
8         𝐺 ← 𝐺 ∪ {𝑔}
9     return G
                          Algorithm 1: Greedy cover algorithm.


3.1. Greedy Cover
The greedy cover algorithm is an approximative method for the construction of the best cover of
strategies 𝑆 over problems 𝑃 of size 𝑛. The algorithm greedy(P,S,n) (see Algorithm 1) proceeds
as follows. It starts with an empty cover 𝐺 and it finds the strategy 𝑔 ∈ 𝑆 which solves most
problems from 𝑃 . The strategy 𝑔 is added to 𝐺 and the problems solved by 𝑔 are removed from
𝑃 . This process is iteratively repeated until 𝐺 reaches the desired size 𝑛 or no strategy can
solve any of the remaining problems.
   We will measure the relative error of constructing the greedy cover (of size 𝑛) by a partial
evaluation of strategies 𝑆 over a subset 𝑃0 of all problems 𝑃 using the standard formula below.
The experimental evaluation of greedy cover construction on random subsets and on benchmark
characteristic subsets based on performance features is given in Section 4.
                                           ⃒    coverage(greedy(𝑃0 ,𝑆,n),P) ⃒⃒
                      error(𝑃0 ,n) = 100 · ⃒1 −
                                           ⃒
                                                coverage(greedy(𝑃 ,𝑆,n),P)
                                                                             ⃒


3.2. Exact Cover
Apart from the approximative greedy cover construction, we also consider an exact best cover
construction method based on the representation of the best cover problem by an Integer Linear
Programming (ILP) formulation. The ILP problems are solved by the Gurobi solver [12].
   We consider a maximal set cover formulation, i.e., without taking into account the solving
time. This means that given a set of strategies 𝑆, a set of solved problems 𝑃𝑠 by each strategy
𝑠 ∈ 𝑆, and the size of the cover 𝑛, the goal is to select a set of strategies 𝑆 ′ ⊆ 𝑆 with |𝑆|
                                                                                            ⋃︀ = 𝑛 that
maximizes the total number of instances solved, i.e., maximizing the cardinality of 𝑠∈𝑆 ′ 𝑃𝑠 .
   We remark that the exact cover maximization is typically only slightly better than the greedy
cover. However, interestingly, Gurobi has proven to be extremely efficient on these problems;
typically we observe solving times around 1 s.


                                                  56
Figure 1: Representation of different logics in SMT-LIB benchmark problems (left) and visualization of
problem similarities based on performance features (right).


4. Experimental Evaluation
In this section, we describe the experimental setup in Section 4.1. We experiment with the
best cover construction of random benchmark subsets in Section 4.2. In Section 4.3, we discuss
reasonable restrictions of the evaluation time limit. Mainly, in Section 4.4, we evaluate the
quality of benchmark characteristic subsets constructed using performance features and the
𝑘-means clustering method introduced in Section 2.

4.1. SMT-LIB Experiments Setup
For our experiments, we consider the quantified fragment of the SMT-LIB library [3, 4]. This
gives us altogether 75814 problems from 21 different logics. We consider 23 different CVC4 [1]
strategies (see Appendix B) and Z3 [2] in the default mode as another strategy. We evaluate all
strategies over all problems with a 60 seconds time limit. Any solver run terminating before the
time limit with the result sat or unsat is considered successful and the problem is considered
solved by the respective strategy. Computing this database of results took around 6 days of real
time.2 Once we have this large database of results, we can easily compute partial results on
problem subsets or with a smaller time limit.
   Figure 1 graphically depicts benchmark problem properties. The bar graph on the left shows
the count of problems from different logics. The plot on the right visualizes the performance
feature vectors of the problems described in Section 2. The plot is produced by dimensionality
reduction to 2 dimensions by the t-SNE [13] algorithm. Each problem is represented by one
point colored with the color of its logic corresponding to the left bar graph. Problems with
similar feature vectors should appear close to each other in the plot. Here we just note that
    2
     All the experiments are done on a server with 28 hyperthreading Intel Xeon CPU @ 2.3GHz cores and 256 GB
of memory.


                                                     57
Figure 2: Relative error of greedy (left) and exact (right) covers on random subsets.


nearby points tend to have the same color. That is, the problems from one logic have similar
performance feature vectors, which is to be expected.

4.2. Random Benchmark Subsets
In this section, we present an experiment on the best cover construction on randomly selected
subsets of benchmark problems. From all 75814 benchmark problems, we construct random
subsets of different sizes, ranging from 100 up to 75800 with step 100. Moreover, we construct
10 different collections of these subsets to measure the effect of random selection. On every
benchmark subset, we construct the greedy and exact covers of the 14 different strategies, called
validation strategies (see Section 4.1). We construct covers of different sizes (𝑛 ∈ {1, 2, . . . , 10}).
We evaluate the constructed covers on all benchmark problems and we measure the error as
described in Section 3.1 and Section 3.2.
   Figure 2 presents the error for the greedy cover (left) and exact cover (right) construction.
For every random subset size, we plot the worst case error (red solid line), and the average error
(dotted line). Given 10 random subset collections and 10 different cover sizes, each value is
computed from 100 relative error values.
   We can see that with benchmark subsets bigger than 1000 problems (roughly 1.3% of bench-
mark problems), we obtain less than 2% error even in the worst case. With subsets bigger than
10000, the worst case error drops below 0.5%. From the relatively big difference between the
worst case and the average, we can conclude that the accuracy of the benchmark approximation
by a random subset depends considerably on the coincidence of random selection. In Section 4.4
we show that performance features can help us reduce this dependence by constructing better
benchmark characteristic problems.

4.3. Time Limit Evaluation
In this section, we present basic statistics on the solver runtime. In particular, we are interested
in how many problems are solved with a decreased evaluation time limit. This information can
help developers to set time limits for evaluation experiments.


                                                   58
Figure 3: Number of problems solved in specific time by cvc1 (left) and cvc16 (right).


   For each of the 24 considered strategies, we measure how many problems are solved in
the 𝑛-th second, that is, the count of problems solved with runtime between 𝑛 − 1 and 𝑛.
Figure 3 shows the results for two representative strategies, namely, the strategy that solves the
most problems in the first second (left) and the strategy that solves the most problems with
runtime greater than 1 second. The corresponding strategies (cvc1 and cvc16 ) can be found in
Appendix B.
   From Figure 3, we can see that the majority of problems are solved within the first second
of runtime, and this is the case with all evaluated strategies. Note the logarithmic 𝑦-axis in
the figure. We have conducted experiments with greedy cover construction, similar to the
experiments in Section 4.2. The experiments show that the greedy cover constructed with the
evaluation restricted to the time limit of 1 second shows less than 0.25% error, when compared
with greedy covers constructed with the full time limit of 60 seconds. With the time limit of 10
seconds, we reach the error less than 0.15% and with the limit of 30 seconds the error drops
below 0.05%.

4.4. Performance Features Benchmark Characteristics
In this section, we evaluate the quality of benchmark characteristic subsets constructed with
performance features and 𝑘-means clustering from Section 2. We compare the relative error
of the construction of the best covers on benchmark characteristic subsets and on random
benchmark subsets. Out of the 24 strategies considered in this experiment (see Section 4.1),
we use 10 CVC4 strategies to construct the performance features of benchmark problems. We
use only the remaining 14 strategies to construct the best covers to provide an independent
validation set of strategies.
   The performance features are computed by short CVC4 runs with limited resources as
described in Section 2. Computing the performance features took less than 2 hours. We
construct benchmark characteristic subsets3 of different sizes corresponding to the sizes of
    3
    We use scipy.org’s implementation of 𝑘-means, namely function scipy.cluster.vq.kmeans2 with
parameter minit set to points.


                                                  59
Figure 4: Worst case relative errors when constructing the greedy (left) and the exact (right) cover on
random subsets and on benchmark characteristic subsets (𝑘-means).


random benchmark subsets from Section 4.2. In this case, however, we do not construct 10
different collections but only one subset for each size. On the benchmark characteristic subsets,
we compute both greedy and exact covers of different sizes (𝑛 ∈ {1, 2, . . . , 10}), and we measure
the relative error as in Section 4.2.
   Figure 4 presents the worst case relative errors for greedy cover (left) and exact cover (right)
constructions. The red lines correspond to the relative error on random benchmark subsets, that
is, they correspond to the red lines from Figure 2, but this time the results are computed only
on the 14 validation strategies. The blue lines correspond to relative errors on the benchmark
characteristic subsets. We can see that in both cases, the relative errors on the benchmark
characteristic subsets are significantly lower, especially for smaller benchmark subsets. Even
with the smallest subset of size 100, we get almost 2% worst case error. Moreover, from size 600
we obtain the worst case error below 1%. Furthermore, the error on benchmark characteristic
subsets approaches the average error on random subsets (see the dotted lines in Figure 2). This
suggests that the construction of benchmark characteristic subsets is less coincidental than
random selection.


5. Conclusions and Future Work
We propose and evaluate a method of constructing a benchmark characteristic subset that
represents the whole benchmark. We provide empirical evidence that our method, based on
clustering of problem performance feature vectors, gives better results than a random benchmark
sampling. To evaluate our method, we use the task of best cover construction. However, we
believe that our method shall prove useful for many other tasks. Furthermore, we provide an
experimental evaluation of how restricting the benchmark size and evaluation time limits affect
the solver performance. The characteristic subsets of quantified problems from SMT-LIB library
computed using the 𝑘-means clustering method are available for download.

                   https://github.com/ai4reason/public/blob/master/SMT2021


                                                  60
Hence SMT users can directly benefit from the results published in this paper by speeding up
their solver evaluation.
   As future work, we would like to experiment with different methods of measuring the quality
of benchmark characteristic subsets, other than the best cover construction. Moreover, we
would like to employ different clustering algorithms than 𝑘-means. Finally, we would like to
propose different methods of constructing problem characteristic feature vectors, for example,
extracting features directly from the problem syntax.


Acknowledgments
The results were supported by the Ministry of Education, Youth and Sports within the dedicated
program ERC CZ under the project POSTMAN no. LL1902, and by the ERC Project SMART
Starting Grant no. 714034. This scientific article is part of the RICAIP project that has received
funding from the European Union’s Horizon 2020 research and innovation programme under
grant agreement no. 857306.


References
 [1] C. W. Barrett, C. L. Conway, M. Deters, L. Hadarean, D. Jovanovic, T. King, A. Reynolds,
     C. Tinelli, CVC4, in: CAV, volume 6806 of Lecture Notes in Computer Science, Springer,
     2011, pp. 171–177.
 [2] L. M. de Moura, N. Bjørner, Z3: an efficient SMT solver, in: TACAS, volume 4963 of Lecture
     Notes in Computer Science, Springer, 2008, pp. 337–340.
 [3] C. Barrett, P. Fontaine, C. Tinelli, The Satisfiability Modulo Theories Library (SMT-LIB),
     www.SMT-LIB.org, 2016.
 [4] C. Barrett, A. Stump, C. Tinelli, The SMT-LIB Standard: Version 2.0, in: A. Gupta,
     D. Kroening (Eds.), Proceedings of the 8th International Workshop on Satisfiability Modulo
     Theories (Edinburgh, UK), 2010.
 [5] M. Omran, A. Engelbrecht, A. Salman, An overview of clustering methods, Intell. Data
     Anal. 11 (2007) 583–605. doi:10.3233/IDA-2007-11602.
 [6] J. Jakubův, K. Chvalovský, M. Olšák, B. Piotrowski, M. Suda, J. Urban, ENIGMA anonymous:
     Symbol-independent inference guiding machine (system description), in: IJCAR (2), volume
     12167 of Lecture Notes in Computer Science, Springer, 2020, pp. 448–463.
 [7] J. Jakubův, J. Urban, Enhancing ENIGMA given clause guidance, in: CICM, volume 11006
     of Lecture Notes in Computer Science, Springer, 2018, pp. 118–124.
 [8] J. Jakubův, J. Urban, ENIGMA: efficient learning-based inference guiding machine, in:
     CICM, volume 10383 of Lecture Notes in Computer Science, Springer, 2017, pp. 292–302.
 [9] J. P. Bridge, S. B. Holden, L. C. Paulson, Machine learning for first-order theorem proving -
     learning to select a good heuristic, J. Autom. Reason. 53 (2014) 141–172.
[10] M. Rawson, G. Reger, Dynamic strategy priority: Empower the strong and abandon the
     weak, in: PAAR@FLoC, volume 2162 of CEUR Workshop Proceedings, CEUR-WS.org, 2018,
     pp. 58–71.
[11] S. P. Lloyd, Least squares quantization in PCM, IEEE Trans. Inf. Theory 28 (1982) 129–136.


                                               61
[12] L. Gurobi Optimization, Gurobi optimizer reference manual, 2021. URL: http://www.gurobi.
     com.
[13] G. E. Hinton, S. T. Roweis, Stochastic neighbor embedding, in: NIPS, MIT Press, 2002, pp.
     833–840.


A. CVC4 Statistics Used as Performance Features
We use the following statistic keys, obtained by running CVC4 with option -stats, to construct
problem performance feature vectors.

sat::conflicts
sat::decisions
Instantiate::Instantiations_Total
SharedTermsDatabase::termsCount
resource::PreprocessStep
resource::RewriteStep
resource::resourceUnitsUsed


B. CVC4 Strategies Used in Experiments
The following 23 CVC4 strategies, described as CVC4 command line arguments, are used in the
experiments. The 24th strategy is Z3 in the default mode. The first ten {cvc1 , . . . , cvc10 } are
used to construct the performance feature vectors of problems. The remaining 14 are used as
validation strategies.


                                                62
cvc1    –simplification=none –full-saturate-quant
cvc2    –no-e-matching –full-saturate-quant
cvc3    –relevant-triggers –full-saturate-quant
cvc4    –trigger-sel=max –full-saturate-quant
cvc5    –multi-trigger-when-single –full-saturate-quant
cvc6    –multi-trigger-when-single –multi-trigger-priority –full-saturate-quant
cvc7    –multi-trigger-cache –full-saturate-quant
cvc8    –no-multi-trigger-linear –full-saturate-quant
cvc9    –pre-skolem-quant –full-saturate-quant
cvc10   –inst-when=full –full-saturate-quant
cvc11   –no-e-matching –no-quant-cf –full-saturate-quant
cvc12   –full-saturate-quant –quant-ind
cvc13   –decision=internal –simplification=none –no-inst-no-entail \\
        –no-quant-cf –full-saturate-quant
cvc14   –decision=internal –full-saturate-quant
cvc15   –term-db-mode=relevant –full-saturate-quant
cvc16   –fs-interleave –full-saturate-quant
cvc17   –finite-model-find –mbqi=none
cvc18   –finite-model-find –decision=internal
cvc19   –finite-model-find –macros-quant –macros-quant-mode=all
cvc20   –finite-model-find –uf-ss=no-minimal
cvc21   –finite-model-find –fmf-inst-engine
cvc22   –finite-model-find –decision=internal
cvc23   –full-saturate-quant


                                    63