<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimal Strategy Schedules for Everyone</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hans-Jörg Schurr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Lorraine</institution>
          ,
          <addr-line>CNRS, Inria, and LORIA, Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Parametrization of a theorem prover is critical for solving a problem. A specific parametrization is called a strategy and the best strategy usually difers from problem to problem. Strategy scheduling means trying multiple strategies within a time limit. veriT-schedgen is a toolbox to work with strategy schedules. In its core is a simple tool that uses integer programming to generate strong schedules automatically. Another tool translates the schedules into shell scripts. Hence, the toolbox can be used with any theorem prover. The generated schedules are optimal with regards to the number of solved benchmarks. While generating optimal schedules is NP hard, the generation time is short in practice.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;automated theorem proving</kwd>
        <kwd>strategy scheduling</kwd>
        <kwd>Python</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>a given set of strategies and benchmarks that is optimal with regard to the number
of benchmarks solved. The theorem prover developer has to define a list of strategies
and allowed time slices and has to record how much time each strategy uses to solve
benchmarks from a training set. The method encodes this problem of finding the schedule
that solves the most benchmarks within an overall timeout as an integer programming
problem (Section 2 to Section 4). Overall, the generation procedure is easy to understand
and avoids surprises.</p>
      <p>This encoding is used by our tool “veriT-schedgen” to generate schedules. It is
implemented in Python (Section 5), has a simple user interface, and comes with auxiliary tools
that can simulate and visualize schedules. The toolbox allows users to generate simple,
static schedules and to integrate them with their solver of choice. In Section 6 we see an
evaluation of the practical usefulness of veriT-schedgen. Of particular interest to us is
how well the generated schedules perform on benchmarks that are not used to generate
the schedules. That is: do the generated schedules generalize to unseen benchmarks?</p>
      <p>
        Originally, the veriT-schedgen toolbox was developed for the SMT solver veriT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
While the toolbox is fully independent of the solver, this paper discusses it through the
perspective of using it with veriT.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Integer Programming</title>
      <p>
        Let us first direct our attention to integer programming. It is widely used to express
optimization problems. Solving an integer programming problem means finding an
assignment to variables, such that a linear combination of the variables is maximized
and all linear inequalities from a given set are satisfied. If some variables must take
integer values, while others allow real values, the problem is a mixed integer programming
problem. Usually integer programming is presented in terms of linear arithmetic, but
since we are interested in SMT solving we will use a notation similar to the many-sorted
logic with theories commonly used in SMT solving [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Formally, a mixed integer programming problem is a tuple (C, O) where C is a set of
formulas and O is a special term called the objective function. The formulas in C are
formulas in the theory of linear real and integer arithmetic without boolean connectives
and quantifiers. They are called the linear constraints. The objective function is a term
of sort Real or Int.</p>
      <p>A solution to a mixed integer programming problem is an interpretation Σ such that
t evaluates to true for all t ∈ C and there is no Σ0 6= Σ such that the evaluation of O
under Σ0 is greater than its evaluation under Σ. Hence, Σ maximizes O.</p>
      <p>Traditionally the objective function O is written as a linear combination
(c1 × x1) + (c2 × x2) + · · · + (cn × xn)
of some variables xi : σ and constants ci : Int where the sort σ ∈ {Real, Int}. The linear
constraints are traditionally written as equalities and inequalities in the form
(c1 × x1) + (c2 × x2) + · · · + (cn × xn) + cm+1 ./ (cn+1 × xn+1) + · · · + (cm × xm) + cm+2
where m ≥ n, the xi and ci are variables and constants just as in the goal function, and
./ ∈ {=,6=,&lt;,&gt;,≤, ≥}. If the constant factors (cm+1 and cm+2) are zero, they are usually
omitted.</p>
      <p>A special case of a variable of sort Int is a binary variable. An integer variable x is
a binary variable if the problem contains the two constraints x ≥ 0 and x ≤ 1. Hence,
a solution can only assign 0 or 1 to x. We will mark a binary variable with a dot: x˙ .
Nevertheless, the sort of x˙ is still Int. For simplicity we will also not explicitly show the
two constraints for each binary variable.</p>
      <p>
        Since integer programming is a well established field, there is a lot of material available.
One example is “A Tutorial on Integer Programming” by Cornuéjols et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Furthermore,
the online documentation of the PuLP library that is used by veriT-schedgen contains
good introductory examples.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Schedule Generation as Integer Programming</title>
      <p>In this section we will see how we can express the problem of finding an optimal schedule
as a (mixed) integer programming problem. To do so, it is necessary to be a bit more
precise about the notion of strategies, benchmarks, experiments, and schedules.</p>
      <p>We have two finite, non-empty sets: the set S of strategies and the set B of benchmarks.
Furthermore, we have a family of functions (Es : B → R ∪ {∞})s∈S that represent the
experiments. The value Es(b) is the time needed by the solver to solve the benchmark b
when using the strategy s. If Es(b) = ∞, the benchmark was not solved. To generate a
schedule we also have to define the overall timeout T &gt; 0 and a list t1, . . . , tn of allowed
time slices. We assume that the list of time slices is in ascending order, all slices are
positive, and no time slice appears twice (i.e., 0 &lt; ti &lt; ti+1 for all 1 ≤ i &lt; n). Furthermore,
there is no value in having time slices that are longer than the overall timeout. Hence,
tn ≤ T .</p>
      <p>Within this environment, a schedule S is a finite set of pairs (t, s) ∈ S where t ∈
{t1, . . . , tn} and s ∈ S. The overall time used by the schedule must not exceed the total
timeout:</p>
      <p>X t ≤ T.</p>
      <p>(t,s)∈S</p>
      <p>The generated schedule should maximize the number of benchmarks solved within the
overall timeout. Hence, we look for the schedule S such that</p>
      <p>[
(t,s)∈S</p>
      <p>{b | b ∈ B and Es(b) ≤ t}
is maximal. If the optimal solution for this objective function is found, the generated
schedule is optimal with regards to the number of benchmarks solved.</p>
      <p>One aspect is missing here: the order in which the strategies in S should be tried. This
order can be determined in a second phase after the optimal set of timeout, strategy</p>
      <sec id="sec-3-1">
        <title>1The PuLP library is available at https://coin-or.github.io/pulp/</title>
        <p>pairs has been calculated. This keeps the encoding simple. However, it is not obvious
what measure should be optimized to find a good order. In Section 5 we will discuss
approaches to this problem.</p>
        <p>We start modelling the optimization problem by defining a binary variable x˙ (t,s) for
every (t, s) ∈ {t1, . . . , tn} × S. If a x˙ (t,s) is 1, then the schedule runs the strategy s for the
time t. Since every benchmark solved by a strategy in a short timeout will also be solved
by a longer timeout, it is pointless to pick one strategy for more than one time slice. To
model this we define the additional binary variables x˙ s expressing that the strategy s
runs for any timeslice. We can now add our first set of constraints. For each s ∈ S the
constraint
x˙ s =</p>
        <p>X</p>
        <p>x˙ (ti,s)
1≤i≤n
expresses that each strategy can be picked at most once: since x˙ s is a binary variable, only
one of the summands can be 1. This constraint eliminates corner cases. For example, if
there is a schedule that solves all benchmarks well before the overall timeout, the integer
programming solver can pick a strategy twice without changing the objective function.</p>
        <p>With regards to solved benchmarks we will use two variables for each benchmark b ∈ B.
The binary variable x˙ b is intended to be 1 if the benchmark b is solved by at least
one picked timeout, strategy pair. That is, there is a (t, s) ∈ {t1, . . . , tn} × S such that
Es(b) ≤ t and x˙ (t,s) is 1. To model this we use an auxiliary integer variable x0b that counts
the number of times b is solved. This is implemented for each x0b by the constraint
x0b = X x˙ with Xb := x˙ (t,s) | (t, s) ∈ {t1, . . . , tn} × S and Es(b) ≤ t .</p>
        <p>x˙ ∈Xb
The following two constraints force x˙ b to 1 if x0b ≥ 1.</p>
        <p>x˙ b|Xb| ≥ x0b
If x0b is not 0, then the first constraint ensures that x˙ b has to be 1, if x0b is 0, then the
second constraint ensures that x˙ b is 0 too. Since, xb is a binary variable there is no other
value smaller than 0.5.</p>
        <p>Defining the objective function is now trivial. It is just the sum</p>
        <p>To extract the schedule S from a solution for this encoding, we can collect the x˙ (t,s)
that are set to 1. Due to the constraints, no two x˙ (t1,s), x˙ (t2,s) with t1 6= t2 are ever 1.</p>
        <p>
          A solution to this integer program maximizes the number of benchmarks solved within
the overall timeout. Every strategy is assigned at most one of the allowed time slices.
Since the time slices are predefined this solves a limited form of the general optimal
schedule problem [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] which is NP hard. Our restricted version is also a variant of the
knapsack problem: we want to maximize the value of the items (the number of solved
benchmarks by the strategies) while respecting the weight limit (the timeout). Hence it
is still NP hard.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Strategy Order and Combined Schedules</title>
      <p>
        The solution of the integer program as discussed above does not give an order of the
timeout, strategy pairs. The goal for selecting pairs is clear: solve as many problems as
possible. There is also an intuitive goal for the order: to minimize the sum of solving times
of the benchmarks solved by the schedule. Indeed, if we want to generate a schedule that
performs as well as possible at the SMT competition [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we would have to find this order.
The single-query track of the competition ranks solver by the number of benchmarks
solved within the overall timeout (20 min), but the sum of solving times serves as the
tiebreaker [5, Section 7.1].
      </p>
      <p>This is another optimization problem that is further complicated by the unpredictable
behaviour of the solver on unsolved benchmarks. The solver might either run until
terminated at the timeout, give up, or even crash. Since veriT uses quantifier instantiation,
it often gives up when no new ground instances are generated because the input problem
is satisfiable. In this case, no strategy can succeed, but all will be tried. Currently, the
implementation of the optimization toolbox does not provide a procedure to compute
this order. We tried an approach based on dynamic programming, but it was too slow in
practice.</p>
      <p>Furthermore, this order is not necessary the best for all application. For example, an
application could interrupt schedules prematurely. To see why minimizing the sum of
solving times is not the best in this situation consider two strategies a and b such that a
solves three benchmarks after 2 s and b solves only one benchmark after 1 s. The sum of
solving times of the schedule [(2, a), (1, b)] is 9 s. The sum for the schedule [(1, b), (2, a)]
is 10 s. However, if the first schedule is interrupted after one second, no benchmark is
solved. If the second benchmark is interrupted after one second, one benchmark is solved.</p>
      <p>Alternatively, one might sort the timeout, strategy pairs by increasing timeout. The
result is a schedule that tries short strategies in quick succession before using longer
strategies. However, it might be that strategies that are only used with a short timeout
are specialized to a small number of benchmarks and the more general strategies are
used with longer timeouts. Furthermore, since the time needed to solve benchmarks is
usually not evenly distributed, even a strategy used with a long timeout might solve many
benchmarks very fast.</p>
      <p>An alternative approach is to pick the pair that solves the most benchmarks not solved
so far. This best efort order has the benefit that if the schedule is not used for the total
timeout, the number of benchmarks not solved is minimized if the schedule is interrupted
between pairs. The downside of this approach, however, is that the schedule would use a
pair with a long timeout before using two pairs with short timeouts that cumulatively
solve more benchmarks.</p>
      <p>Hence, the order of a strong schedule should take the number of solved benchmarks,
the time it takes to solve them, and the timeout of each pair into account. We can
modify the best efort order to achieve this. Instead of picking the pair that solves the
most benchmarks, we pick the pair that has the minimal estimated cost. To estimate
the cost of a pair we calculate the sum of solving times. For benchmarks solved by the
pair this is straightforward. For the other benchmarks an approximation is needed. Our
approximation is the solving time used by the virtual best solver formed by the other pairs
in the schedule. The virtual best solver of a schedule S is the function ES : B → R ∪ {∞}
such that ES (b) = min({Es(b) | (t, s) ∈ S ∧ Es(b) ≤ t} ∪ {∞}). Based on this definition,
the cost estimate for a pair (t, s) ∈ S formally is
c(t,s) =</p>
      <p>X
b∈{b | Es(b)&lt;t}</p>
      <p>Es(b) +</p>
      <p>X
b∈{b | Es(b)&gt;t∧ES0 (b)&lt;∞}
(t + ES0 (b)) .
where S0 = S \ (s, t). We will also use the virtual best solver later as a baseline to evaluate
schedules.</p>
      <p>The ordered schedule is constructed by removing iteratively the pair (t, s) with the
smallest cost estimate c(t,s) from S until no pair is left. The resulting schedule will usually
ifrst use pairs with short timeout. Longer timeouts will only be chosen if the benchmarks
solved by this strategy ofset the delay to solving the other benchmarks. Since a schedule
typically contains less than a hundred pairs, ordering a schedule with this procedure is
fast.</p>
      <p>It is also possible to target multiple predefined timeouts with one schedule. For example,
at the SMT competition solvers compete with both a timeout of 20 min and a timeout of
24 s. A simple approach to generate a schedule that work well in this setting is to perform
multiple rounds of optimization: Let T1 and T2 be two timeouts such that T1 &lt; T2 and
let S1 be the optimal schedule for the timeout T1. We can now define a new set of
benchmarks B0 that contains exactly the benchmarks from B that are not solved by the
schedule S1 within the timeout T1. Next, we can calculate another schedule S2 that solves
the most benchmarks from B0 within the timeout T2 − T1. Both rounds of optimization
can use a diferent set of allowed time slices. To build the ordered joint schedule, we just
search for the best order for both schedules independently and concatenate the resulting
lists.</p>
      <p>A small downside of this approach is that some strategies appear in both schedules
and the solver has to perform some repeated work. This repeated work can be avoided
for one strategy. Assume that there is a repeated strategy s. Hence, there is a pair
(s, t1) ∈ S1 such that (s, t2) ∈ S2 for some t2. In this case t2 &gt; t1 since otherwise (s, t2)
would solve no benchmarks not already solved by S1. If we remove (s, t1) form the the
ifrst schedule S1, the schedule solves fewer benchmarks, but only runs for the time T1 − t1.
To get those benchmarks back, we move (s, t2) to the beginning of the schedule S2. In
the concatenated schedule s is used for the time t1 before T1 passes and is used for the
time t2 overall.</p>
    </sec>
    <sec id="sec-5">
      <title>5. The veriT-schedgen Toolbox</title>
      <p>We implemented the schedule generation procedures described above as part of a toolbox
called veriT-schedgen. Beside optimization, the toolbox also contains some other tools
that are useful when working with strategy schedules, such as a visualizer and simulator.
This section provides an introduction – a user manual – to the toolbox. It also describes
important implementation details of the toolbox. The source code for the toolbox is
available at https://gitlab.uliege.be/verit/schedgen/.</p>
      <p>
        The veriT-schedgen toolbox is implemented in the Python programming language and
uses the PuLP [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] package to express and solve the linear programming problems. The
PuLP package can use multiple linear programming solvers as backend, but the default
solver is good enough for the linear programming problems arising from our use case.
The default solver is the open source solver CBC2. Both, the PuLP and CBC project, are
developed by the COIN-OR foundation that develops open source software for operations
research3. To store the experiment results veriT-schedgen uses the Pandas library4. This
library provides a flexible data structure to work with tables.
      </p>
      <p>All tools in the veriT-schedgen toolbox support the filtering of benchmarks by SMT-LIB
logic. Since the SMT-LIB benchmark library is big (around 100 GiB), it is often not
installed on the computer that is used for schedule generation. In this case the logic
cannot determined by parsing the benchmark file. Instead, usually the first component of
the benchmark path is used. For example, a benchmark QF_UF/test.smt2 will have the
logic QF_UF assigned. This matches the typical folder structure used by the SMT-LIB
benchmark suite. In fact, it is not necessary to use the SMT-LIB logic for filtering.
Instead, any string does the job.</p>
      <p>The veriT-schedgen toolbox is a standard Python package. It has been tested with
Python 3.10. To install the package it is enough to execute
$ python setup.py install</p>
      <p>This will install the following tools that are part of the veriT-schedgen toolbox on the
user systems.
schedgen-optimize.py is the main tool that performs the schedule generation. It takes
the name of a folder that contains the experiments, a list of time slices, and the total
timeout. After the generation is finished the result is written as a schedule into a CSV
ifle. It is also possible to provide an existing scheduler to build a joint schedule. The
benchmarks used to generate the schedule can be filtered as described above.
schedgen-finalize.py takes CSV files describing schedules, as generated by
schedgen-optimize.py, and produces a shell script that runs a solver according to the
schedules. The user can provide a template for the script and change various parameters
used during generation. The default script can pick schedules by SMT-LIB logic.
schedgen-simulate.py can be used to predict the outcome of running a scheduler. To
do so the user has to provide experiments for the strategies included in the schedule, but
can use diferent benchmarks. Furthermore, it is possible to add noise to the simulation.
2Available on https://github.com/coin-or/Cbc.
3For more information see https://www.coin-or.org/
4See https://pandas.pydata.org/.
schedgen-query.py is a small tool that produces lists of benchmarks for debugging and
analysis purposes. For example, it can list the union of the benchmarks solved by some
strategies, or find the benchmarks from this list that a specific schedule can not solve.
schedgen-visualize.py can be used to visualize one or multiple schedules. It generates a
bar plot that represents a schedule visually.</p>
      <p>Each of those tools can generate an extensive documentation of the options it exposes
via the standard --help option.</p>
      <p>
        The tools accept experimental data in two formats: the GridTPT format and a simple
CSV format. The GridTPT system [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a platform for testing SMT solvers and other
theorem provers on grid computers. Its file format is based on CSV, but extended with
a header that stores information about the experiment. For every strategy one file is
used that contains the results for all benchmarks. The header also contains the command
line options, i.e., the strategy used for the experiment. Furthermore, the header contains
information such as the solver, when it was executed, and what timeout was used. Since
the GridTPT format uses one file per strategy, the user has to declare a folder that contains
all data files. The veriT-schedgen input parser searches recursively for compatible files in
this folder.
      </p>
      <p>The simple CSV format collects all results within one file. The data file contains five
columns and header that stores the column names. Columns are seperated by a literal
“;”. The columns can be given in any order. The parser uses the header line to determine
the order of the columns. The five columns are:
• benchmark contains the filename of the benchmark.
• logic stores the SMT-LIB logic of the benchmark or any other string that should
be used to select subsets of benchmarks.
• strategy the strategy used.
• solved is yes if the benchmark was solved, “ no” otherwise.
• time is the solving time in seconds as a positive float number. Any value that is
not a valid floating point number is assumed to be ∞. It is possible to indicate a
time even if the benchmark was not solved. Sometimes the solver might give up.</p>
      <p>It is simple to add a custom parser. One has to implement a Python class whose
constructor takes two arguments: the path to the folder or file containing the data and
a list of logics used as filter. This list of logics might be Null if the data should not
be filtered. If parsing succeeds, the resulting object should have a member .frame that
contains the parsed data. This member is a Pandas data frame where the rows are the
benchmarks and the columns are the strategies. Each cell contains the solving time in
seconds stored as a standard float. If the benchmark was not solved by the strategy the
cell stores the float ‘ ∞’.</p>
      <p>Generated schedules are also stored in CSV files. The CSV dialect is as for the input
data, but this time only two columns are used: strategy and time. The file is intended
to be read from top to bottom and each line indicates a step in the schedule that executes
the strategy strategy for time seconds. Until schedules are transformed into a script by
using schedgen-finalize.py the are called pre-schedules.
5.1. A veriT-schedgen Tutorial
This section shows how the veriT-schedgen toolbox can be used in practice. We will build
and explore an optimal schedule for some artificial example benchmark data. This data
can be found in the file contrib/example_data.csv and uses six made up strategies.
The first strategy base-strategy solves 20 benchmarks within one second. The five
other strategies extra01 to extra05 solve up to five benchmarks exclusively such that
extra01 solves five benchmarks not solved by any other strategy, extra02 solves four,
and so on. There is one benchmark unsolved.smt2 that is not solved by any strategy
and a strategy bad-strategy that solves only one benchmark in 1.5 seconds. Hence, if
the total timeout is six seconds we expect the optimal schedule to run the strategies in
the order just presented for one second each. The solving times have been sampled from
a normal distribution to ensure the resulting graphs look interesting.</p>
      <p>Since the example data is in the CSV format described above the first four lines of
base-strategy are:
benchmark;logic;strategy;solved;time
base01.smt2;UF;base-strategy;yes;0.5189
base02.smt2;UF;base-strategy;yes;0.2164
base03.smt2;UF;base-strategy;yes;0.1754</p>
      <p>Obviously, the first step is to generate a schedule. To do so we invoke the schedgen-optimize.py
command:
$ schedgen-optimize.py -l UF -s 0.9 1.0 1.1 -t 6 \
-c -d contrib/example_data.csv \
contrib/example_schedule.csv</p>
      <p>The option -l UF selects the UF logic, the option -s 0.9 1.0 1.1 defines the time
slices, and -t 6 sets the total timeout. Finally, -c tells the tool to use the CSV parser,
and the option -d gives the data location.</p>
      <p>The tool then searches the optimal schedule and writes the result to contrib/example_schedule.csv.
As enforced by the syntetic data, the optimal schedule uses the strategies in the order
outlined above. The first four lines of the schedule are:
time;strategy
1.100;base-strategy
1.000;extra01
0.900;extra02</p>
      <p>Furthermore, the output by the tool predicts how well the schedule will perform:
[’UF’]: This schedule solves 35/37 in 52.19s.
base
2
e1
3
e2
4
e3
5
e4
6
e5
6</p>
      <p>Two benchmarks are not solved by this strategy: the benchmark only solved by the
“bad” strategy and the benchmark not solved by any strategy. Furthermore, we learn that
solving all benchmarks solved by this schedule will take 52.19 seconds. The output also
tells us the solving time before order optimization, but in this case nothing changes.</p>
      <p>The simulator can give us a better prediction of the schedule performance, but let us
ifrst visualize the schedule:
$ schedgen-visualize.py -t 6 -p out.pgf \</p>
      <p>contrib/example_schedule.csv</p>
      <p>Figure 1 shows the result of this command. The -t option works as before and defines
the overall timeout. The options -p out.pgf tells the visualizer to generate a PGF/TikZ
picture. If this option is omitted, the tool instead opens a window that shows the schedules.
Finally, the tool expects a list of pre-schedules to visualize.</p>
      <p>Since the diferent strategies are all used for roughly the same timeout, every strategy is
shown as a block of similar size. Because some strategies are used for less than one second,
there is a small gap of 0.1 seconds at the end. By default the visualizer simply assigns
a number to each strategy when it is used the first time. Strategies often correspond
to command line options and those can be long. A number ensures that there are no
printing problems, but makes it possible to compare diferent schedules based on the same
strategies.</p>
      <p>It is possible to customize the names used for some strategies and to highlight them.
This is done with the help of a shorthand file: a CSV file with three columns. As in
all other CSV files, the fields are separated by a semicolon and there must be a header
line. The strategy column lists the full strategy string, the shorthand column gives the
name that should be printed, and the highlight can be set to true to indicate that the
strategy should be highlighted (false otherwise). Strategies not listed in the shorthand
ifles get assigned a number and stay unhighlighted. In Figure 1 the base strategy and the
second additional strategy are highlighted and every strategy has a custom shorthand
name. The command to produce this figure is:
$ schedgen-visualize.py -t 6 -c -p out.pgf \
-a contrib/example_shorthand.csv \
contrib/example_schedule.csv
Let us now simulate the generated schedule. To do so we can use the simulation tool:</p>
      <p>3
CPU Time
4
intense jitter
slight jitter
base strategy
5
6
$ schedgen-simulate.py -l UF -t 6 \
-c -d contrib/example_data.csv \
--mu 0.05 --sigma 0.01 --seed 1 \
contrib/example_schedule.csv simulation_1.txt</p>
      <p>
        The options -l, -t, -c, and -d are as for the other tools. The tool expects two positional
arguments. First, the schedule to simulate and, second, the file to write the results to.
The output is always using the GridTPT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] format, independent of the input format.
The remaining options are --mu, --sigma, and --seed. Those control the random jitter
that is applied to each solving time in the input dataset. The jitter is sampled from a
normal distribution. The options --mu and --sigma control the mean and the standard
deviation, respectively. To ensure simulations can be repeated it is possible to fix an
integer seed for the random number generator with the option --seed. In this case we
choose a slight jitter.
      </p>
      <p>The cumulative density functions generated by the simulation are shown in Figure 2.
As expected, the graph for the simulation with slight jitter follows almost exactly the
base strategy for the first second. The plot also shows the result of using more jitter
(μ = 0.2, σ = 0.05). In this case two benchmarks are no longer solved.5</p>
      <p>If we want to have a list of the benchmarks not solved by the schedule, we can use the
schedgen-query tool:
$ schedgen-query.py -c -d contrib/example_data.csv \</p>
      <p>-q unsolved contrib/example_schedule.csv
special01.smt2
unsolved.smt2</p>
      <p>5The script used to generate these graphs is the script contrib/cdf.py in the veriT-schedgen
repository.</p>
      <sec id="sec-5-1">
        <title>AUFLIA/20170829-Rodin AUFLIRA/nasa AUFLIRA/why UF/20170428-Barrett</title>
        <p>UF/sledgehammer
UFLIA/boogie
UFLIA/simplify2
UFLIA/sledgehammer
UFLIA/tokeneer</p>
        <p>Benchmarks</p>
        <p>The option -q unsolved asks the tool to print the list of all benchmarks not solved. An
alternative is -q compare that shows benchmarks not solved by any strategy. To see
the benchmarks that are solved by any strategy together with the minimal solving time
the -q best option can be used. Finally, -q schedule lists all benchmarks solved by the
given schedule together with the predicted solving time (without jitter).</p>
        <p>Figure 2 is very artificial – it is based on synthetic example data. In the next section
we see some real examples.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluation</title>
      <p>This section shows some empirical datapoints related to the real world performance of the
schedules generated by veriT-schedgen. The evaluation of veriT-schedgen is simplified by
the fact that the generated schedules do not depend on randomized heuristics.</p>
      <p>Nevertheless, it is useful to know how many problems are solved by such schedules.
Especially in comparison to simpler approaches, such as using only the best overall
strategy, or constructing a schedule with a simple greedy approach. We generate the
greedy schedule we start with an empty schedule and then iteratively add the timeout,
strategy pair that solves the most benchmarks not yet solved. We stop when the timeout
is reached.</p>
      <p>In machine learning, it is common to use k-fold cross validation for such evaluations.
This means that the data set is split into k subsets of equal size. Then the model (here:
the schedule) is trained on the union of k − 1 sets and evaluated of on the remaining
set. This is repeated k times and the results are combined (for example by calculating
the mean). In our evaluation we use veriT-schedgen to generate one schedule for each
training set.</p>
      <p>A major problem of such an evaluation is that benchmark libraries, such as the
SMT-LIB library, are biased towards the types of problem submitted to the library.
Furthermore, the SMT-LIB benchmark collection organizes the benchmarks in categories
according to application and submitter. The benchmarks of each category likely share
many characteristics. Furthermore, some sets contain thousands of benchmarks, while
other contain only tens. To address this issue to some degree the evaluation only uses
categories with more than 1000 benchmarks, and uses three diferent scenarios. The first
is a random sample of 1000 benchmarks from each category. The second is to hold out
one category as the test set. This indicates how well a schedule performs on a category
that was not seen during the schedule generation. Finally, to see how well such schedule
generation can be used to adapt an SMT solver to one application, the last scenario uses
only one category of benchmarks.</p>
      <p>
        Our experiments here use the experimental data gathered for the evaluation of a
recently published technique for SMT solvers [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].6 The schedules and graphs used for this
publication were generated by veriT-schedgen.
      </p>
      <p>
        Instead of running the solver with the schedules, we can estimate the outcome of
each schedule with the simulator. All generated schedules use a timeout of 180 s and
the time slices 1, 2, 3, 4, 5, 8, 16, 32, 64. Since the benchmark solving rate decreases
rapidly with longer timeouts, we can reduce the granularity of the time slices for bigger
values. In the benchmarks used for the evaluation nine categories contain more than 1000
benchmarks. Table 1 lists them. The evaluation uses two metrics: the total number of
solved benchmarks and the PAR-2 score. The PAR-2 score is the sum of all solving times
plus twice the timeout for each benchmark not solved. Hence, a lower PAR-2 score is
better. The PAR-2 score is used to score competitions such as the SAT competition [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Overall, the the first scenario uses 9000 benchmarks and 5-fold cross validation over all
benchmarks. Therefore, the schedules are generated on 7200 benchmarks and evaluated on
1800 benchmarks. Table 2 shows the result of the evaluation. The first part shows the total
number of benchmarks solved by the virtual best solver, the generated optimal schedule,
the schedule generated by the greedy approach, and the single strategy that solves the
most benchmarks. Here, the virtual best solver is constructed from all available strategies.
The second part shows the corresponding PAR-2 scores. We see that the generated
optimal schedule is much better in all split, but split 5. In this case both the greedy and
the generated optimal schedule solve the same number of problems. Nevertheless, even in</p>
      <sec id="sec-6-1">
        <title>6The data is available on Zenodo [9].</title>
        <p>virtual best
generated
greedy
best strategy</p>
        <sec id="sec-6-1-1">
          <title>PAR-2 score</title>
          <p>virtual best
generated
greedy
best strategy
248
235
237
222
268
262
255
245
265
261
259
244
271
264
261
252
271
265
262
245
265 (10)
257 (13)
255 (10)
242 (11)
this case the generated schedule has a better PAR-2 score. Overall, the PAR-2 scores
also show that the optimal schedule is better than the greedy schedule and much better
then not using a schedule at all. Furthermore, the variance is also reduced by using an
optimal schedule.</p>
          <p>The second scenario uses the same 9000 benchmarks, but for each fold a diferent
category is served as a test set. Since there are nine categories, this scenario is a 9-fold
cross validation with training sets containing 8000 benchmarks, and test sets containing
1000. Note that the share of benchmarks that are unsolvable for veriT (e.g., satisfiable
benchmarks) can vary widely from category to category. As a consequence the mean of the
results is not very meaningful for this experiment. The results are in Table 4. To save space,
the names of the categories are simplified. The category “Lsledgehammer” corresponds to
the category UFLIA/sledgehammer. This scenario is, in a certain sense, a stress test for
the schedules. Each category might have some characteristics not shared with any other
category. The generated schedule seems to struggle with the AUFLIA/20170829-Rodin
category. It only improves slightly over the greedy schedule, while the virtual best solver
is significantly better. For three categories both schedules and the virtual best solver solve
the same number of benchmarks. In one case (UFLIA/tokeneer) the optimal schedule
has the worst PAR-2 score. Furthermore, in two cases (AUFLIA/20170829-Rodin and
UFLIA/simplify2) the heuristic optimization of the order slightly increased the PAR-2
score. Overall, optimal schedules are clearly doing well in this scenario too.</p>
          <p>
            The third, and final, scenario uses only the category UF/sledgehammer. This
category contains problems generated by Sledgehammer [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. Since veriT is also used by
Sledgehammer, this is a sensible choice. The UF/sledgehammer category contains 4140
benchmarks. Again 5-fold cross validation is used: 3312 benchmarks for optimization and
828 for testing. Table 3 shows the results that mirror the other two scenarios. For two
splits, all schedules solved the same number of benchmarks, and for the optimal schedule
improves only slightly in the first split. Both efects are probably the result of the relative
small number of benchmarks in the test set. In any case, the optimal schedule improves
on the PAR-2 score over the greedy schedule.
25 50 75 100 125 150 180
          </p>
          <p>Measured Solving Time</p>
          <p>We also collected the predicted sum of solving times for the schedule before and after
order optimization. The order before optimization is an implementation detail of the
toolbox and not deterministic. The average ratio of the time before optimization and
after is 0.71.</p>
          <p>
            To finish this section, let us have a look at the real world behaviour of schedules in
comparison with the simulation. This allows us to judge how reliable the simulated results
presented so far are. The data gathered for Figure 3 in [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] serves as the base for this.
Figure 3 shows a scatter plot of the solving times for the 180 s schedule with quantifier
simplification by unification. The closer a point is to the diagonal, the more accurate
the simulation was. This shows that the simulation is fairly accurate. In a few cases,
benchmarks predicted to be solved are not solved in reality and vice versa. Those are
benchmarks that are solved just before the time slice of a strategy ends. In this case a
delay can make the benchmark unsolvable. The opposite is also possible: if a benchmark
is solved during data gathering after the timeslice ends, the jitter can reduce the solving
time below the timeout. Overall, this graph indicates that the simulation is a good tool
to asses the eficacy of a schedule.
          </p>
          <p>
            The evaluations has been performed on a laptop with an Intel i7-7600U CPU and 16 GB
of memory. Generating the 19 schedules took 39 min 16 s overall. Hence, generation time
is short. To generate the schedules for veriT at the SMT competition, GNU parallel [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]
was used to generate the schedules for the diferent logics in parallel.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>The usage of strategies with automatic theorem provers has attracted considerable
attention. Many sophisticated systems exist [e.g., 13, 14, 15]. These systems often
support the search for new strategies and can select strategies based on features of the
input problem. They also often incorporate machine learning. Our approach is very
simple. It is oblivious to the properties of the problems involved and generates static
schedules.</p>
      <p>
        The veriT-schedgen toolbox itself is now mature. It has been used to generate schedules
for veriTs participation in the 2021 and 2020 SMT competition. Furthermore, it helped
in the evaluation of a new SMT solving technique [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Since it is independent of veriT,
we hope that it can be used by developers of other theorem provers. We plan to extend
the toolbox with two additional components. The first is a flexible schedule executor
that can also execute strategies in parallel. The second is a strategy generation tool
that can generate new strategies with good performance from a specification of the
available parameters. A good starting point for this can be an algorithm based on random
permutations such as the one used by MaLeS [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The usage of strategy scheduling in veriT was started by Haniel Barbosa. We thank
Pascal Fontaine and the anonymous reviewers for many comments that improved the
text. We thank Jasmin Blanchette for insightful discussions on the role of schedules.
The author has received funding from the European Research Council (ERC) under the
European Union’s Horizon 2020 research and innovation program (grant agreement No.
713999, Matryoshka).
r 2 2 2 2
e 2 2 2 2
e
n 9 9 9 9
e
k
o
t
re 42 06 02 74
m 4 4 4 3
m
a
h
e
g
d
e
l
s
L
0 5 1 1
8 8 8 8
0 0 0 0
6 6 6 7
5 5 6 6</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bouton</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. C. B. de Oliveira</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Déharbe</surname>
          </string-name>
          , P. Fontaine, veriT:
          <article-title>An open, trustable and eficient SMT-solver</article-title>
          , in: R. A.
          <string-name>
            <surname>Schmidt</surname>
          </string-name>
          (Ed.),
          <source>CADE 22</source>
          , volume
          <volume>5663</volume>
          <source>of LNCS</source>
          , Springer Berlin Heidelberg,
          <year>2009</year>
          , pp.
          <fpage>151</fpage>
          -
          <lpage>156</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -02959-2_
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fontaine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tinelli</surname>
          </string-name>
          ,
          <article-title>The satisfiability modulo theories library (SMTLIB)</article-title>
          , ,
          <year>2016</year>
          . Accessed:
          <fpage>2021</fpage>
          -09-28.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cornuéjols</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Trick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Saltzman</surname>
          </string-name>
          ,
          <article-title>A tutorial on integer programming</article-title>
          , ,
          <year>1995</year>
          . Accessed:
          <fpage>2022</fpage>
          -04-07.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Letz</surname>
          </string-name>
          ,
          <article-title>Strategy parallelism in automated theorem proving</article-title>
          ,
          <source>in: Proceedings of the Eleventh International Florida Artificial Intelligence Research</source>
          Society Conference, AAAI Press,
          <year>1998</year>
          , pp.
          <fpage>142</fpage>
          --
          <lpage>146</lpage>
          . doi:
          <volume>10</volume>
          .5555/646811.706867.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Barbosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hoenicke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Antti</surname>
          </string-name>
          ,
          <article-title>16th international satisfiability modulo theories competition (smt-comp</article-title>
          <year>2021</year>
          )
          <article-title>: Rules and procedures</article-title>
          , https://smt-comp.github.io/ 2021/rules.pdf,
          <year>2021</year>
          . Accessed:
          <fpage>2021</fpage>
          -08-08.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. O'Sullivan</surname>
            ,
            <given-names>I. Dunning</given-names>
          </string-name>
          ,
          <article-title>PuLP : A linear programming toolkit for</article-title>
          <source>Python</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bouton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Caminha B. De Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Déharbe</surname>
          </string-name>
          , P. Fontaine,
          <article-title>GridTPT: a distributed platform for theorem prover testing</article-title>
          ,
          <source>in: 2nd Workshop on Practical Aspects of Automated Reasoning (PAAR)</source>
          , Edinburgh, United Kingdom,
          <year>2010</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fontaine</surname>
          </string-name>
          , H.
          <article-title>-</article-title>
          <string-name>
            <surname>J. Schurr</surname>
          </string-name>
          ,
          <article-title>Quantifier simplification by unification in smt</article-title>
          , in: B.
          <string-name>
            <surname>Konev</surname>
          </string-name>
          , G. Reger (Eds.),
          <source>Frontiers of Combining Systems</source>
          , Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>232</fpage>
          -
          <lpage>249</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -86205-3_
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Quantifier</given-names>
            <surname>Simplification by Unification in</surname>
          </string-name>
          <string-name>
            <surname>SMT</surname>
          </string-name>
          , Zenodo,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .5281/ zenodo.5088868.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Heule</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Järvisalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suda</surname>
          </string-name>
          , SAT race
          <year>2019</year>
          , ,
          <year>2019</year>
          . Accessed:
          <fpage>2022</fpage>
          -02-28.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Blanchette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Böhme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Paulson</surname>
          </string-name>
          ,
          <article-title>Extending Sledgehammer with SMT solvers</article-title>
          ,
          <source>Journal of Automated Reasoning</source>
          <volume>51</volume>
          (
          <year>2013</year>
          )
          <fpage>109</fpage>
          -
          <lpage>128</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10817- 013-9278-5.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>O.</given-names>
            <surname>Tange</surname>
          </string-name>
          ,
          <source>GNU parallel 20210722 ('blue unity')</source>
          ,
          <year>2021</year>
          . URL: https://doi.org/10.5281/ zenodo.5123056. doi:
          <volume>10</volume>
          .5281/zenodo.5123056,
          <string-name>
            <surname>GNU</surname>
          </string-name>
          <article-title>Parallel is a general parallelizer to run multiple serial command line programs in parallel without changing them</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kühlwein</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Urban,</surname>
          </string-name>
          <article-title>MaLeS: A framework for automatic tuning of automated theorem provers</article-title>
          ,
          <source>Journal of Automated Reasoning</source>
          <volume>55</volume>
          (
          <year>2015</year>
          )
          <fpage>91</fpage>
          -
          <lpage>116</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s10817-015-9329-1.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E. K.</given-names>
            <surname>Holden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korovin</surname>
          </string-name>
          ,
          <article-title>Heterogeneous heuristic optimisation and scheduling for first-order theorem proving</article-title>
          , in: F. Kamareddine, C. Sacerdoti Coen (Eds.),
          <source>Intelligent Computer Mathematics</source>
          , Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>123</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -81097-
          <issue>9</issue>
          _
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Niemetz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Preiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nejati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Machsmt:</surname>
          </string-name>
          <article-title>A machine learningbased algorithm selector for smt solvers</article-title>
          , in: J.
          <string-name>
            <given-names>F.</given-names>
            <surname>Groote</surname>
          </string-name>
          , K. G. Larsen (Eds.),
          <source>Tools and Algorithms for the Construction and Analysis of Systems</source>
          , Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>303</fpage>
          -
          <lpage>325</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -72013-1_
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>