Optimal Strategy Schedules for Everyone
Hans-Jörg Schurr1
1
    University of Lorraine, CNRS, Inria, and LORIA, Nancy, France


                                        Abstract
                                         Parametrization of a theorem prover is critical for solving a problem. A specific parametrization
                                         is called a strategy and the best strategy usually differs from problem to problem. Strategy
                                         scheduling means trying multiple strategies within a time limit. veriT-schedgen is a toolbox
                                         to work with strategy schedules. In its core is a simple tool that uses integer programming
                                         to generate strong schedules automatically. Another tool translates the schedules into shell
                                         scripts. Hence, the toolbox can be used with any theorem prover. The generated schedules are
                                         optimal with regards to the number of solved benchmarks. While generating optimal schedules
                                         is NP hard, the generation time is short in practice.

                                         Keywords
                                         automated theorem proving, strategy scheduling, Python


1. Introduction
Most components of any theorem prover can be parametrized and fine-tuned. Selecting
the right values for the parameters is usually not easy. Often there is no clear best choice,
and even if there is one, overall non-optimal choices might work better for certain types
of problems. A specific choice of values for all exposed parameters is a strategy. It is often
crucial to use the correct strategy to solve a problem within a given timeout.
   An approach to this problem is for the theorem prover to expose options to users that
allow them to configure the used strategy. This is a fig leaf: defining the right strategy
usually requires intimate knowledge of the inner workings of the solver. Furthermore, the
developers of the prover have to set a sensible default. This is not easy either: usually
prover developers do not know upfront the type of problems the prover will encounter.
The default should normally also be somewhat generic. Overall, designing and using
strategies is a subject that deserves some attention. Since for many problems there is a
strategy that can solve the problem in a short time, it is natural to try multiple strategies
on the problem. The easiest way to do this is to try strategies from a list one by one.
Slightly more sophisticated is to prepare a list of strategies paired with a timeout: some
strategies might be known to have diminishing returns if run for a longer time. We will
call such a list a schedule.
   In this paper we present a toolbox to generate and work with schedules. The core
of the toolbox is a method based on integer programming to find a strong schedule for
PAAR’22: 8th Workshop on Practical Aspects of Automated Reasoning, August 11–12, 2022, Haifa, Israel
email: hans-jorg.schurr@inria.fr (H. Schurr)
orcid: 0000-0002-0829-5056 (H. Schurr)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution
                                       4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
a given set of strategies and benchmarks that is optimal with regard to the number
of benchmarks solved. The theorem prover developer has to define a list of strategies
and allowed time slices and has to record how much time each strategy uses to solve
benchmarks from a training set. The method encodes this problem of finding the schedule
that solves the most benchmarks within an overall timeout as an integer programming
problem (Section 2 to Section 4). Overall, the generation procedure is easy to understand
and avoids surprises.
  This encoding is used by our tool “veriT-schedgen” to generate schedules. It is imple-
mented in Python (Section 5), has a simple user interface, and comes with auxiliary tools
that can simulate and visualize schedules. The toolbox allows users to generate simple,
static schedules and to integrate them with their solver of choice. In Section 6 we see an
evaluation of the practical usefulness of veriT-schedgen. Of particular interest to us is
how well the generated schedules perform on benchmarks that are not used to generate
the schedules. That is: do the generated schedules generalize to unseen benchmarks?
  Originally, the veriT-schedgen toolbox was developed for the SMT solver veriT [1].
While the toolbox is fully independent of the solver, this paper discusses it through the
perspective of using it with veriT.


2. Integer Programming
Let us first direct our attention to integer programming. It is widely used to express
optimization problems. Solving an integer programming problem means finding an
assignment to variables, such that a linear combination of the variables is maximized
and all linear inequalities from a given set are satisfied. If some variables must take
integer values, while others allow real values, the problem is a mixed integer programming
problem. Usually integer programming is presented in terms of linear arithmetic, but
since we are interested in SMT solving we will use a notation similar to the many-sorted
logic with theories commonly used in SMT solving [2].
   Formally, a mixed integer programming problem is a tuple (C, O) where C is a set of
formulas and O is a special term called the objective function. The formulas in C are
formulas in the theory of linear real and integer arithmetic without boolean connectives
and quantifiers. They are called the linear constraints. The objective function is a term
of sort Real or Int.
   A solution to a mixed integer programming problem is an interpretation Σ such that
t evaluates to true for all t ∈ C and there is no Σ0 6= Σ such that the evaluation of O
under Σ0 is greater than its evaluation under Σ. Hence, Σ maximizes O.
   Traditionally the objective function O is written as a linear combination
                           (c1 × x1 ) + (c2 × x2 ) + · · · + (cn × xn )
of some variables xi : σ and constants ci : Int where the sort σ ∈ {Real, Int}. The linear
constraints are traditionally written as equalities and inequalities in the form


(c1 × x1 ) + (c2 × x2 ) + · · · + (cn × xn ) + cm+1 ./ (cn+1 × xn+1 ) + · · · + (cm × xm ) + cm+2
where m ≥ n, the xi and ci are variables and constants just as in the goal function, and
./ ∈ {=,6=,<,>,≤, ≥}. If the constant factors (cm+1 and cm+2 ) are zero, they are usually
omitted.
   A special case of a variable of sort Int is a binary variable. An integer variable x is
a binary variable if the problem contains the two constraints x ≥ 0 and x ≤ 1. Hence,
a solution can only assign 0 or 1 to x. We will mark a binary variable with a dot: ẋ.
Nevertheless, the sort of ẋ is still Int. For simplicity we will also not explicitly show the
two constraints for each binary variable.
   Since integer programming is a well established field, there is a lot of material available.
One example is “A Tutorial on Integer Programming” by Cornuéjols et al. [3]. Furthermore,
the online documentation of the PuLP library that is used by veriT-schedgen contains
good introductory examples.1


3. Schedule Generation as Integer Programming
In this section we will see how we can express the problem of finding an optimal schedule
as a (mixed) integer programming problem. To do so, it is necessary to be a bit more
precise about the notion of strategies, benchmarks, experiments, and schedules.
   We have two finite, non-empty sets: the set S of strategies and the set B of benchmarks.
Furthermore, we have a family of functions (Es : B → R ∪ {∞})s∈S that represent the
experiments. The value Es (b) is the time needed by the solver to solve the benchmark b
when using the strategy s. If Es (b) = ∞, the benchmark was not solved. To generate a
schedule we also have to define the overall timeout T > 0 and a list t1 , . . . , tn of allowed
time slices. We assume that the list of time slices is in ascending order, all slices are
positive, and no time slice appears twice (i.e., 0 < ti < ti+1 for all 1 ≤ i < n). Furthermore,
there is no value in having time slices that are longer than the overall timeout. Hence,
tn ≤ T .
   Within this environment, a schedule S is a finite set of pairs (t, s) ∈ S where t ∈
{t1 , . . . , tn } and s ∈ S. The overall time used by the schedule must not exceed the total
timeout:                                     X
                                                  t ≤ T.
                                                (t,s)∈S

  The generated schedule should maximize the number of benchmarks solved within the
overall timeout. Hence, we look for the schedule S such that

                                     [
                                             {b | b ∈ B and Es (b) ≤ t}
                                   (t,s)∈S

is maximal. If the optimal solution for this objective function is found, the generated
schedule is optimal with regards to the number of benchmarks solved.
   One aspect is missing here: the order in which the strategies in S should be tried. This
order can be determined in a second phase after the optimal set of timeout, strategy
   1
       The PuLP library is available at https://coin-or.github.io/pulp/
pairs has been calculated. This keeps the encoding simple. However, it is not obvious
what measure should be optimized to find a good order. In Section 5 we will discuss
approaches to this problem.
  We start modelling the optimization problem by defining a binary variable ẋ(t,s) for
every (t, s) ∈ {t1 , . . . , tn } × S. If a ẋ(t,s) is 1, then the schedule runs the strategy s for the
time t. Since every benchmark solved by a strategy in a short timeout will also be solved
by a longer timeout, it is pointless to pick one strategy for more than one time slice. To
model this we define the additional binary variables ẋs expressing that the strategy s
runs for any timeslice. We can now add our first set of constraints. For each s ∈ S the
constraint                                             X
                                              ẋs =         ẋ(ti ,s)
                                                1≤i≤n
expresses that each strategy can be picked at most once: since ẋs is a binary variable, only
one of the summands can be 1. This constraint eliminates corner cases. For example, if
there is a schedule that solves all benchmarks well before the overall timeout, the integer
programming solver can pick a strategy twice without changing the objective function.
  With regards to solved benchmarks we will use two variables for each benchmark b ∈ B.
The binary variable ẋb is intended to be 1 iff the benchmark b is solved by at least
one picked timeout, strategy pair. That is, there is a (t, s) ∈ {t1 , . . . , tn } × S such that
Es (b) ≤ t and ẋ(t,s) is 1. To model this we use an auxiliary integer variable x0b that counts
the number of times b is solved. This is implemented for each x0b by the constraint
               X
         x0b =       ẋ with Xb := ẋ(t,s) | (t, s) ∈ {t1 , . . . , tn } × S and Es (b) ≤ t .
                                   

               ẋ∈Xb

The following two constraints force ẋb to 1 iff x0b ≥ 1.
                                         ẋb |Xb | ≥ x0b
                                              ẋb ≤ x0b + 0.5.
If x0b is not 0, then the first constraint ensures that ẋb has to be 1, if x0b is 0, then the
second constraint ensures that ẋb is 0 too. Since, xb is a binary variable there is no other
value smaller than 0.5.
   Defining the objective function is now trivial. It is just the sum
                                            X
                                               ẋb .
                                                b∈B

   To extract the schedule S from a solution for this encoding, we can collect the ẋ(t,s)
that are set to 1. Due to the constraints, no two ẋ(t1 ,s) , ẋ(t2 ,s) with t1 6= t2 are ever 1.
   A solution to this integer program maximizes the number of benchmarks solved within
the overall timeout. Every strategy is assigned at most one of the allowed time slices.
Since the time slices are predefined this solves a limited form of the general optimal
schedule problem [4] which is NP hard. Our restricted version is also a variant of the
knapsack problem: we want to maximize the value of the items (the number of solved
benchmarks by the strategies) while respecting the weight limit (the timeout). Hence it
is still NP hard.
4. Strategy Order and Combined Schedules
The solution of the integer program as discussed above does not give an order of the
timeout, strategy pairs. The goal for selecting pairs is clear: solve as many problems as
possible. There is also an intuitive goal for the order: to minimize the sum of solving times
of the benchmarks solved by the schedule. Indeed, if we want to generate a schedule that
performs as well as possible at the SMT competition [5], we would have to find this order.
The single-query track of the competition ranks solver by the number of benchmarks
solved within the overall timeout (20 min), but the sum of solving times serves as the
tiebreaker [5, Section 7.1].
   This is another optimization problem that is further complicated by the unpredictable
behaviour of the solver on unsolved benchmarks. The solver might either run until
terminated at the timeout, give up, or even crash. Since veriT uses quantifier instantiation,
it often gives up when no new ground instances are generated because the input problem
is satisfiable. In this case, no strategy can succeed, but all will be tried. Currently, the
implementation of the optimization toolbox does not provide a procedure to compute
this order. We tried an approach based on dynamic programming, but it was too slow in
practice.
   Furthermore, this order is not necessary the best for all application. For example, an
application could interrupt schedules prematurely. To see why minimizing the sum of
solving times is not the best in this situation consider two strategies a and b such that a
solves three benchmarks after 2 s and b solves only one benchmark after 1 s. The sum of
solving times of the schedule [(2, a), (1, b)] is 9 s. The sum for the schedule [(1, b), (2, a)]
is 10 s. However, if the first schedule is interrupted after one second, no benchmark is
solved. If the second benchmark is interrupted after one second, one benchmark is solved.
   Alternatively, one might sort the timeout, strategy pairs by increasing timeout. The
result is a schedule that tries short strategies in quick succession before using longer
strategies. However, it might be that strategies that are only used with a short timeout
are specialized to a small number of benchmarks and the more general strategies are
used with longer timeouts. Furthermore, since the time needed to solve benchmarks is
usually not evenly distributed, even a strategy used with a long timeout might solve many
benchmarks very fast.
   An alternative approach is to pick the pair that solves the most benchmarks not solved
so far. This best effort order has the benefit that if the schedule is not used for the total
timeout, the number of benchmarks not solved is minimized if the schedule is interrupted
between pairs. The downside of this approach, however, is that the schedule would use a
pair with a long timeout before using two pairs with short timeouts that cumulatively
solve more benchmarks.
   Hence, the order of a strong schedule should take the number of solved benchmarks,
the time it takes to solve them, and the timeout of each pair into account. We can
modify the best effort order to achieve this. Instead of picking the pair that solves the
most benchmarks, we pick the pair that has the minimal estimated cost. To estimate
the cost of a pair we calculate the sum of solving times. For benchmarks solved by the
pair this is straightforward. For the other benchmarks an approximation is needed. Our
approximation is the solving time used by the virtual best solver formed by the other pairs
in the schedule. The virtual best solver of a schedule S is the function ES : B → R ∪ {∞}
such that ES (b) = min({Es (b) | (t, s) ∈ S ∧ Es (b) ≤ t} ∪ {∞}). Based on this definition,
the cost estimate for a pair (t, s) ∈ S formally is
                         X                           X
             c(t,s) =              Es (b) +                       (t + ES 0 (b)) .
                      b∈{b | Es (b)<t}      b∈{b | Es (b)>t∧ES 0 (b)<∞}

where S 0 = S \ (s, t). We will also use the virtual best solver later as a baseline to evaluate
schedules.
   The ordered schedule is constructed by removing iteratively the pair (t, s) with the
smallest cost estimate c(t,s) from S until no pair is left. The resulting schedule will usually
first use pairs with short timeout. Longer timeouts will only be chosen if the benchmarks
solved by this strategy offset the delay to solving the other benchmarks. Since a schedule
typically contains less than a hundred pairs, ordering a schedule with this procedure is
fast.
   It is also possible to target multiple predefined timeouts with one schedule. For example,
at the SMT competition solvers compete with both a timeout of 20 min and a timeout of
24 s. A simple approach to generate a schedule that work well in this setting is to perform
multiple rounds of optimization: Let T1 and T2 be two timeouts such that T1 < T2 and
let S1 be the optimal schedule for the timeout T1 . We can now define a new set of
benchmarks B 0 that contains exactly the benchmarks from B that are not solved by the
schedule S1 within the timeout T1 . Next, we can calculate another schedule S2 that solves
the most benchmarks from B 0 within the timeout T2 − T1 . Both rounds of optimization
can use a different set of allowed time slices. To build the ordered joint schedule, we just
search for the best order for both schedules independently and concatenate the resulting
lists.
   A small downside of this approach is that some strategies appear in both schedules
and the solver has to perform some repeated work. This repeated work can be avoided
for one strategy. Assume that there is a repeated strategy s. Hence, there is a pair
(s, t1 ) ∈ S1 such that (s, t2 ) ∈ S2 for some t2 . In this case t2 > t1 since otherwise (s, t2 )
would solve no benchmarks not already solved by S1 . If we remove (s, t1 ) form the the
first schedule S1 , the schedule solves fewer benchmarks, but only runs for the time T1 − t1 .
To get those benchmarks back, we move (s, t2 ) to the beginning of the schedule S2 . In
the concatenated schedule s is used for the time t1 before T1 passes and is used for the
time t2 overall.


5. The veriT-schedgen Toolbox
We implemented the schedule generation procedures described above as part of a toolbox
called veriT-schedgen. Beside optimization, the toolbox also contains some other tools
that are useful when working with strategy schedules, such as a visualizer and simulator.
This section provides an introduction – a user manual – to the toolbox. It also describes
important implementation details of the toolbox. The source code for the toolbox is
available at https://gitlab.uliege.be/verit/schedgen/.
   The veriT-schedgen toolbox is implemented in the Python programming language and
uses the PuLP [6] package to express and solve the linear programming problems. The
PuLP package can use multiple linear programming solvers as backend, but the default
solver is good enough for the linear programming problems arising from our use case.
The default solver is the open source solver CBC2 . Both, the PuLP and CBC project, are
developed by the COIN-OR foundation that develops open source software for operations
research3 . To store the experiment results veriT-schedgen uses the Pandas library4 . This
library provides a flexible data structure to work with tables.
   All tools in the veriT-schedgen toolbox support the filtering of benchmarks by SMT-LIB
logic. Since the SMT-LIB benchmark library is big (around 100 GiB), it is often not
installed on the computer that is used for schedule generation. In this case the logic
cannot determined by parsing the benchmark file. Instead, usually the first component of
the benchmark path is used. For example, a benchmark QF_UF/test.smt2 will have the
logic QF_UF assigned. This matches the typical folder structure used by the SMT-LIB
benchmark suite. In fact, it is not necessary to use the SMT-LIB logic for filtering.
Instead, any string does the job.
   The veriT-schedgen toolbox is a standard Python package. It has been tested with
Python 3.10. To install the package it is enough to execute

$ python setup.py install

  This will install the following tools that are part of the veriT-schedgen toolbox on the
user systems.

schedgen-optimize.py is the main tool that performs the schedule generation. It takes
the name of a folder that contains the experiments, a list of time slices, and the total
timeout. After the generation is finished the result is written as a schedule into a CSV
file. It is also possible to provide an existing scheduler to build a joint schedule. The
benchmarks used to generate the schedule can be filtered as described above.

schedgen-finalize.py takes CSV files describing schedules, as generated by
schedgen-optimize.py, and produces a shell script that runs a solver according to the
schedules. The user can provide a template for the script and change various parameters
used during generation. The default script can pick schedules by SMT-LIB logic.

schedgen-simulate.py can be used to predict the outcome of running a scheduler. To
do so the user has to provide experiments for the strategies included in the schedule, but
can use different benchmarks. Furthermore, it is possible to add noise to the simulation.

   2
     Available on https://github.com/coin-or/Cbc.
   3
     For more information see https://www.coin-or.org/
   4
     See https://pandas.pydata.org/.
schedgen-query.py is a small tool that produces lists of benchmarks for debugging and
analysis purposes. For example, it can list the union of the benchmarks solved by some
strategies, or find the benchmarks from this list that a specific schedule can not solve.

schedgen-visualize.py can be used to visualize one or multiple schedules. It generates a
bar plot that represents a schedule visually.

   Each of those tools can generate an extensive documentation of the options it exposes
via the standard --help option.
   The tools accept experimental data in two formats: the GridTPT format and a simple
CSV format. The GridTPT system [7] is a platform for testing SMT solvers and other
theorem provers on grid computers. Its file format is based on CSV, but extended with
a header that stores information about the experiment. For every strategy one file is
used that contains the results for all benchmarks. The header also contains the command
line options, i.e., the strategy used for the experiment. Furthermore, the header contains
information such as the solver, when it was executed, and what timeout was used. Since
the GridTPT format uses one file per strategy, the user has to declare a folder that contains
all data files. The veriT-schedgen input parser searches recursively for compatible files in
this folder.
   The simple CSV format collects all results within one file. The data file contains five
columns and header that stores the column names. Columns are seperated by a literal
“;”. The columns can be given in any order. The parser uses the header line to determine
the order of the columns. The five columns are:

   • benchmark contains the filename of the benchmark.
   • logic stores the SMT-LIB logic of the benchmark or any other string that should
     be used to select subsets of benchmarks.
   • strategy the strategy used.
   • solved is yes if the benchmark was solved, “no” otherwise.
   • time is the solving time in seconds as a positive float number. Any value that is
     not a valid floating point number is assumed to be ∞. It is possible to indicate a
     time even if the benchmark was not solved. Sometimes the solver might give up.

  It is simple to add a custom parser. One has to implement a Python class whose
constructor takes two arguments: the path to the folder or file containing the data and
a list of logics used as filter. This list of logics might be Null if the data should not
be filtered. If parsing succeeds, the resulting object should have a member .frame that
contains the parsed data. This member is a Pandas data frame where the rows are the
benchmarks and the columns are the strategies. Each cell contains the solving time in
seconds stored as a standard float. If the benchmark was not solved by the strategy the
cell stores the float ‘∞’.
  Generated schedules are also stored in CSV files. The CSV dialect is as for the input
data, but this time only two columns are used: strategy and time. The file is intended
to be read from top to bottom and each line indicates a step in the schedule that executes
the strategy strategy for time seconds. Until schedules are transformed into a script by
using schedgen-finalize.py the are called pre-schedules.

5.1. A veriT-schedgen Tutorial
This section shows how the veriT-schedgen toolbox can be used in practice. We will build
and explore an optimal schedule for some artificial example benchmark data. This data
can be found in the file contrib/example_data.csv and uses six made up strategies.
The first strategy base-strategy solves 20 benchmarks within one second. The five
other strategies extra01 to extra05 solve up to five benchmarks exclusively such that
extra01 solves five benchmarks not solved by any other strategy, extra02 solves four,
and so on. There is one benchmark unsolved.smt2 that is not solved by any strategy
and a strategy bad-strategy that solves only one benchmark in 1.5 seconds. Hence, if
the total timeout is six seconds we expect the optimal schedule to run the strategies in
the order just presented for one second each. The solving times have been sampled from
a normal distribution to ensure the resulting graphs look interesting.
  Since the example data is in the CSV format described above the first four lines of
base-strategy are:

benchmark;logic;strategy;solved;time
base01.smt2;UF;base-strategy;yes;0.5189
base02.smt2;UF;base-strategy;yes;0.2164
base03.smt2;UF;base-strategy;yes;0.1754

  Obviously, the first step is to generate a schedule. To do so we invoke the schedgen-optimize.py
command:

$ schedgen-optimize.py -l UF -s 0.9 1.0 1.1 -t 6 \
    -c -d contrib/example_data.csv \
    contrib/example_schedule.csv

   The option -l UF selects the UF logic, the option -s 0.9 1.0 1.1 defines the time
slices, and -t 6 sets the total timeout. Finally, -c tells the tool to use the CSV parser,
and the option -d gives the data location.
   The tool then searches the optimal schedule and writes the result to contrib/example_schedule.csv.
As enforced by the syntetic data, the optimal schedule uses the strategies in the order
outlined above. The first four lines of the schedule are:

time;strategy
1.100;base-strategy
1.000;extra01
0.900;extra02

  Furthermore, the output by the tool predicts how well the schedule will perform:
[’UF’]: This schedule solves 35/37 in 52.19s.
  0                                                                                6

         1             2            3           4            5            6

  0                                                                                6

        base           e1           e2          e3          e4            e5


Figure 1: The visualization of the example schedule without and with highlights.


   Two benchmarks are not solved by this strategy: the benchmark only solved by the
“bad” strategy and the benchmark not solved by any strategy. Furthermore, we learn that
solving all benchmarks solved by this schedule will take 52.19 seconds. The output also
tells us the solving time before order optimization, but in this case nothing changes.
   The simulator can give us a better prediction of the schedule performance, but let us
first visualize the schedule:

$ schedgen-visualize.py -t 6 -p out.pgf \
    contrib/example_schedule.csv

   Figure 1 shows the result of this command. The -t option works as before and defines
the overall timeout. The options -p out.pgf tells the visualizer to generate a PGF/TikZ
picture. If this option is omitted, the tool instead opens a window that shows the schedules.
Finally, the tool expects a list of pre-schedules to visualize.
   Since the different strategies are all used for roughly the same timeout, every strategy is
shown as a block of similar size. Because some strategies are used for less than one second,
there is a small gap of 0.1 seconds at the end. By default the visualizer simply assigns
a number to each strategy when it is used the first time. Strategies often correspond
to command line options and those can be long. A number ensures that there are no
printing problems, but makes it possible to compare different schedules based on the same
strategies.
   It is possible to customize the names used for some strategies and to highlight them.
This is done with the help of a shorthand file: a CSV file with three columns. As in
all other CSV files, the fields are separated by a semicolon and there must be a header
line. The strategy column lists the full strategy string, the shorthand column gives the
name that should be printed, and the highlight can be set to true to indicate that the
strategy should be highlighted (false otherwise). Strategies not listed in the shorthand
files get assigned a number and stay unhighlighted. In Figure 1 the base strategy and the
second additional strategy are highlighted and every strategy has a custom shorthand
name. The command to produce this figure is:

$ schedgen-visualize.py -t 6 -c -p out.pgf \
    -a contrib/example_shorthand.csv \
    contrib/example_schedule.csv

  Let us now simulate the generated schedule. To do so we can use the simulation tool:
                                        40


                                        30


                    Solved Benchmarks
                                        20


                                        10                            intense jitter
                                                                      slight jitter
                                                                      base strategy


                                        0
                                             0   1   2      3     4        5           6
                                                         CPU Time

Figure 2: Some simulated density functions.


$ schedgen-simulate.py -l UF -t 6 \
    -c -d contrib/example_data.csv \
    --mu 0.05 --sigma 0.01 --seed 1 \
    contrib/example_schedule.csv simulation_1.txt

  The options -l, -t, -c, and -d are as for the other tools. The tool expects two positional
arguments. First, the schedule to simulate and, second, the file to write the results to.
The output is always using the GridTPT [7] format, independent of the input format.
The remaining options are --mu, --sigma, and --seed. Those control the random jitter
that is applied to each solving time in the input dataset. The jitter is sampled from a
normal distribution. The options --mu and --sigma control the mean and the standard
deviation, respectively. To ensure simulations can be repeated it is possible to fix an
integer seed for the random number generator with the option --seed. In this case we
choose a slight jitter.
  The cumulative density functions generated by the simulation are shown in Figure 2.
As expected, the graph for the simulation with slight jitter follows almost exactly the
base strategy for the first second. The plot also shows the result of using more jitter
(µ = 0.2, σ = 0.05). In this case two benchmarks are no longer solved.5
  If we want to have a list of the benchmarks not solved by the schedule, we can use the
schedgen-query tool:
$ schedgen-query.py -c -d contrib/example_data.csv \
    -q unsolved contrib/example_schedule.csv
special01.smt2
unsolved.smt2
   5
     The script used to generate these graphs is the script contrib/cdf.py in the veriT-schedgen
repository.
                           Category                 Benchmarks
                           AUFLIA/20170829-Rodin           3272
                           AUFLIRA/nasa                   18446
                           AUFLIRA/why                     1271
                           UF/20170428-Barrett             2963
                           UF/sledgehammer                 4140
                           UFLIA/boogie                    1191
                           UFLIA/simplify2                 2345
                           UFLIA/sledgehammer              3462
                           UFLIA/tokeneer                  1872

Table 1
Categories with more than 1000 benchmarks.


The option -q unsolved asks the tool to print the list of all benchmarks not solved. An
alternative is -q compare that shows benchmarks not solved by any strategy. To see
the benchmarks that are solved by any strategy together with the minimal solving time
the -q best option can be used. Finally, -q schedule lists all benchmarks solved by the
given schedule together with the predicted solving time (without jitter).
   Figure 2 is very artificial – it is based on synthetic example data. In the next section
we see some real examples.


6. Evaluation
This section shows some empirical datapoints related to the real world performance of the
schedules generated by veriT-schedgen. The evaluation of veriT-schedgen is simplified by
the fact that the generated schedules do not depend on randomized heuristics.
   Nevertheless, it is useful to know how many problems are solved by such schedules.
Especially in comparison to simpler approaches, such as using only the best overall
strategy, or constructing a schedule with a simple greedy approach. We generate the
greedy schedule we start with an empty schedule and then iteratively add the timeout,
strategy pair that solves the most benchmarks not yet solved. We stop when the timeout
is reached.
   In machine learning, it is common to use k-fold cross validation for such evaluations.
This means that the data set is split into k subsets of equal size. Then the model (here:
the schedule) is trained on the union of k − 1 sets and evaluated of on the remaining
set. This is repeated k times and the results are combined (for example by calculating
the mean). In our evaluation we use veriT-schedgen to generate one schedule for each
training set.
   A major problem of such an evaluation is that benchmark libraries, such as the
SMT-LIB library, are biased towards the types of problem submitted to the library.
Furthermore, the SMT-LIB benchmark collection organizes the benchmarks in categories
according to application and submitter. The benchmarks of each category likely share
many characteristics. Furthermore, some sets contain thousands of benchmarks, while
        Solved             Split 1    Split 2    Split 3    Split 4    Split 5         Mean (σ)
        virtual best        1355        1318      1328       1293       1338         1326 (23.1)
        generated           1349        1306      1317       1283       1326        1316 (24.4)
        greedy              1340        1303      1314       1275       1326         1312 (24.7)
        best strategy       1311        1267      1280       1243       1299         1280 (26.7)
        PAR-2 score                                                                    Mean (σ)
        virtual best      160 501     174 213    170 347    182 938    167 371    171 074 (8 316)
        generated        164 388     179 811    175 453    187 851    172 102    175 921 (8 736)
        greedy            169 183     183 040    178 817    192 482    173 655    179 435 (8 974)
        best strategy     176 844     192 438    187 772    201 248    180 966    187 854 (9 606)

Table 2
Results of the first scenario.


other contain only tens. To address this issue to some degree the evaluation only uses
categories with more than 1000 benchmarks, and uses three different scenarios. The first
is a random sample of 1000 benchmarks from each category. The second is to hold out
one category as the test set. This indicates how well a schedule performs on a category
that was not seen during the schedule generation. Finally, to see how well such schedule
generation can be used to adapt an SMT solver to one application, the last scenario uses
only one category of benchmarks.
   Our experiments here use the experimental data gathered for the evaluation of a
recently published technique for SMT solvers [8].6 The schedules and graphs used for this
publication were generated by veriT-schedgen.
   Instead of running the solver with the schedules, we can estimate the outcome of
each schedule with the simulator. All generated schedules use a timeout of 180 s and
the time slices 1, 2, 3, 4, 5, 8, 16, 32, 64. Since the benchmark solving rate decreases
rapidly with longer timeouts, we can reduce the granularity of the time slices for bigger
values. In the benchmarks used for the evaluation nine categories contain more than 1000
benchmarks. Table 1 lists them. The evaluation uses two metrics: the total number of
solved benchmarks and the PAR-2 score. The PAR-2 score is the sum of all solving times
plus twice the timeout for each benchmark not solved. Hence, a lower PAR-2 score is
better. The PAR-2 score is used to score competitions such as the SAT competition [10].
   Overall, the the first scenario uses 9000 benchmarks and 5-fold cross validation over all
benchmarks. Therefore, the schedules are generated on 7200 benchmarks and evaluated on
1800 benchmarks. Table 2 shows the result of the evaluation. The first part shows the total
number of benchmarks solved by the virtual best solver, the generated optimal schedule,
the schedule generated by the greedy approach, and the single strategy that solves the
most benchmarks. Here, the virtual best solver is constructed from all available strategies.
The second part shows the corresponding PAR-2 scores. We see that the generated
optimal schedule is much better in all split, but split 5. In this case both the greedy and
the generated optimal schedule solve the same number of problems. Nevertheless, even in
    6
        The data is available on Zenodo [9].
     Solved           Split 1    Split 2    Split 3     Split 4    Split 5         Mean (σ)
     virtual best        248        268        265         271        271            265 (10)
     generated           235        262        261         264        265           257 (13)
     greedy              237        255        259         261        262            255 (10)
     best strategy       222        245        244         252        245            242 (11)
     PAR-2 score                                                                   Mean (σ)
     virtual best     209 094    202 099    202 821     201 265    200 737    203 203 (3 388)
     generated       214 328    204 673    204 829     203 835    203 863    206 306 (4 508)
     greedy           215 158    207 890    206 677     205 539    206 027    208 258 (3 957)
     best strategy    218 768    210 590    210 727     208 061    210 516    211 732 (4 086)

Table 3
Results of generating schedules for UF/sledgehammer.


this case the generated schedule has a better PAR-2 score. Overall, the PAR-2 scores
also show that the optimal schedule is better than the greedy schedule and much better
then not using a schedule at all. Furthermore, the variance is also reduced by using an
optimal schedule.
   The second scenario uses the same 9000 benchmarks, but for each fold a different
category is served as a test set. Since there are nine categories, this scenario is a 9-fold
cross validation with training sets containing 8000 benchmarks, and test sets containing
1000. Note that the share of benchmarks that are unsolvable for veriT (e.g., satisfiable
benchmarks) can vary widely from category to category. As a consequence the mean of the
results is not very meaningful for this experiment. The results are in Table 4. To save space,
the names of the categories are simplified. The category “Lsledgehammer” corresponds to
the category UFLIA/sledgehammer. This scenario is, in a certain sense, a stress test for
the schedules. Each category might have some characteristics not shared with any other
category. The generated schedule seems to struggle with the AUFLIA/20170829-Rodin
category. It only improves slightly over the greedy schedule, while the virtual best solver
is significantly better. For three categories both schedules and the virtual best solver solve
the same number of benchmarks. In one case (UFLIA/tokeneer) the optimal schedule
has the worst PAR-2 score. Furthermore, in two cases (AUFLIA/20170829-Rodin and
UFLIA/simplify2) the heuristic optimization of the order slightly increased the PAR-2
score. Overall, optimal schedules are clearly doing well in this scenario too.
   The third, and final, scenario uses only the category UF/sledgehammer. This cate-
gory contains problems generated by Sledgehammer [11]. Since veriT is also used by
Sledgehammer, this is a sensible choice. The UF/sledgehammer category contains 4140
benchmarks. Again 5-fold cross validation is used: 3312 benchmarks for optimization and
828 for testing. Table 3 shows the results that mirror the other two scenarios. For two
splits, all schedules solved the same number of benchmarks, and for the optimal schedule
improves only slightly in the first split. Both effects are probably the result of the relative
small number of benchmarks in the test set. In any case, the optimal schedule improves
on the PAR-2 score over the greedy schedule.
                                                    180

                                                    150


                           Simulated Solving Time
                                                    125
                                                    100
                                                    75
                                                    50
                                                    25
                                                     0
                                                          0   25 50 75 100 125 150 180
                                                               Measured Solving Time

Figure 3: Simulation versus reality.


   We also collected the predicted sum of solving times for the schedule before and after
order optimization. The order before optimization is an implementation detail of the
toolbox and not deterministic. The average ratio of the time before optimization and
after is 0.71.
   To finish this section, let us have a look at the real world behaviour of schedules in
comparison with the simulation. This allows us to judge how reliable the simulated results
presented so far are. The data gathered for Figure 3 in [8] serves as the base for this.
Figure 3 shows a scatter plot of the solving times for the 180 s schedule with quantifier
simplification by unification. The closer a point is to the diagonal, the more accurate
the simulation was. This shows that the simulation is fairly accurate. In a few cases,
benchmarks predicted to be solved are not solved in reality and vice versa. Those are
benchmarks that are solved just before the time slice of a strategy ends. In this case a
delay can make the benchmark unsolvable. The opposite is also possible: if a benchmark
is solved during data gathering after the timeslice ends, the jitter can reduce the solving
time below the timeout. Overall, this graph indicates that the simulation is a good tool
to asses the efficacy of a schedule.
   The evaluations has been performed on a laptop with an Intel i7-7600U CPU and 16 GB
of memory. Generating the 19 schedules took 39 min 16 s overall. Hence, generation time
is short. To generate the schedules for veriT at the SMT competition, GNU parallel [12]
was used to generate the schedules for the different logics in parallel.


7. Conclusion
The usage of strategies with automatic theorem provers has attracted considerable
attention. Many sophisticated systems exist [e.g., 13, 14, 15]. These systems often
support the search for new strategies and can select strategies based on features of the
input problem. They also often incorporate machine learning. Our approach is very
simple. It is oblivious to the properties of the problems involved and generates static
schedules.
  The veriT-schedgen toolbox itself is now mature. It has been used to generate schedules
for veriTs participation in the 2021 and 2020 SMT competition. Furthermore, it helped
in the evaluation of a new SMT solving technique [8]. Since it is independent of veriT,
we hope that it can be used by developers of other theorem provers. We plan to extend
the toolbox with two additional components. The first is a flexible schedule executor
that can also execute strategies in parallel. The second is a strategy generation tool
that can generate new strategies with good performance from a specification of the
available parameters. A good starting point for this can be an algorithm based on random
permutations such as the one used by MaLeS [13].


Acknowledgments
The usage of strategy scheduling in veriT was started by Haniel Barbosa. We thank
Pascal Fontaine and the anonymous reviewers for many comments that improved the
text. We thank Jasmin Blanchette for insightful discussions on the role of schedules.
The author has received funding from the European Research Council (ERC) under the
European Union’s Horizon 2020 research and innovation program (grant agreement No.
713999, Matryoshka).
Table 4
Results of the second scenario.
                                  Distance         Rodin     nasa    why     Barrett   sledgehammer    boogie    simplify2   Lsledgehammer   tokeneer           Mean (σ)
                                  virtual best       764     984     982        535            337        689        995              424        922          720 (255.8)
                                  generated          746     984     982        527            332        677        992              406        922         714 (260.1)
                                  greedy             745     984     982        517            328        673        991              402        922          711 (262.4)
                                  best strategy      716     981     979        491            296        648        893              374        922          693 (264.4)
                                  PAR-2 score                                                                                                                   Mean (σ)
                                  virtual best     85 637   5 761   6 483    167 838        238 974    112 173      1 994          208 430    28 080     100 939 (92 338)
                                  generated       92 244    5 801   6 591   172 126        242 243    118 134       3 532         215 400     28 085    103 801 (94 333)
                                  greedy           95 380   6 146   6 697    176 215        245 105    120 403      9 799          218 310    28 081     105 909 (94 946)
                                  best strategy   102 387   6 840   7 585    183 730        253 850    127 337     38 707          226 300    28 081     110 879 (95 467)
References
 [1] T. Bouton, D. C. B. de Oliveira, D. Déharbe, P. Fontaine, veriT: An open, trustable
     and efficient SMT-solver, in: R. A. Schmidt (Ed.), CADE 22, volume 5663 of LNCS,
     Springer Berlin Heidelberg, 2009, pp. 151–156. doi:10.1007/978-3-642-02959-2_12.
 [2] C. Barrett, P. Fontaine, C. Tinelli, The satisfiability modulo theories library (SMT-
     LIB), , 2016. Accessed: 2021-09-28.
 [3] G. Cornuéjols, M. A. Trick, M. J. Saltzman, A tutorial on integer programming, ,
     1995. Accessed: 2022-04-07.
 [4] A. Wolf, R. Letz, Strategy parallelism in automated theorem proving, in: Proceed-
     ings of the Eleventh International Florida Artificial Intelligence Research Society
     Conference, AAAI Press, 1998, pp. 142––146. doi:10.5555/646811.706867.
 [5] H. Barbosa, J. Hoenicke, H. Antti, 16th international satisfiability modulo theories
     competition (smt-comp 2021): Rules and procedures, https://smt-comp.github.io/
     2021/rules.pdf, 2021. Accessed: 2021-08-08.
 [6] S. Mitchell, M. O’Sullivan, I. Dunning, PuLP : A linear programming toolkit for
     Python, 2011.
 [7] T. Bouton, D. Caminha B. De Oliveira, D. Déharbe, P. Fontaine, GridTPT: a
     distributed platform for theorem prover testing, in: 2nd Workshop on Practical
     Aspects of Automated Reasoning (PAAR), Edinburgh, United Kingdom, 2010, pp.
     33–39.
 [8] P. Fontaine, H.-J. Schurr, Quantifier simplification by unification in smt, in: B. Konev,
     G. Reger (Eds.), Frontiers of Combining Systems, Springer International Publishing,
     Cham, 2021, pp. 232–249. doi:10.1007/978-3-030-86205-3_13.
 [9] Quantifier Simplification by Unification in SMT, Zenodo, 2021. doi:10.5281/
     zenodo.5088868.
[10] M. Heule, M. Järvisalo, M. Suda, SAT race 2019, , 2019. Accessed: 2022-02-28.
[11] J. C. Blanchette, S. Böhme, L. C. Paulson, Extending Sledgehammer with SMT
     solvers, Journal of Automated Reasoning 51 (2013) 109–128. doi:10.1007/s10817-
     013-9278-5.
[12] O. Tange, GNU parallel 20210722 (’blue unity’), 2021. URL: https://doi.org/10.5281/
     zenodo.5123056. doi:10.5281/zenodo.5123056, GNU Parallel is a general parallelizer
     to run multiple serial command line programs in parallel without changing them.
[13] D. Kühlwein, J. Urban, MaLeS: A framework for automatic tuning of automated
     theorem provers, Journal of Automated Reasoning 55 (2015) 91–116. doi:10.1007/
     s10817-015-9329-1.
[14] E. K. Holden, K. Korovin, Heterogeneous heuristic optimisation and scheduling
     for first-order theorem proving, in: F. Kamareddine, C. Sacerdoti Coen (Eds.),
     Intelligent Computer Mathematics, Springer International Publishing, Cham, 2021,
     pp. 107–123. doi:10.1007/978-3-030-81097-9_8.
[15] J. Scott, A. Niemetz, M. Preiner, S. Nejati, V. Ganesh, Machsmt: A machine learning-
     based algorithm selector for smt solvers, in: J. F. Groote, K. G. Larsen (Eds.), Tools
     and Algorithms for the Construction and Analysis of Systems, Springer International
     Publishing, Cham, 2021, pp. 303–325. doi:10.1007/978-3-030-72013-1_16.