<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Practical Aspects of Automated Reasoning, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Cautious Specialization of Strategy Schedules (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Filip Bártek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karel Chvalovský</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Suda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Czech Technical University in Prague - Czech Institute of Informatics</institution>
          ,
          <addr-line>Robotics and Cybernetics. Jugoslávských partyzánů 1580/3, 160 00 Prague 6 - Dejvice</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Czech Technical University in Prague - Faculty of Electrical Engineering.</institution>
          <addr-line>Technická 2, 166 27 Prague 6 - Dejvice</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2</volume>
      <issue>2024</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Combining theorem proving strategies into schedules is a well-known recipe for improving prover performance, as a well-chosen set of complementary strategies in practice covers many more problems within a predetermined time budget than a single strategy could. A strategy schedule can be a monolithic sequence, but may also consist of multiple schedule branches, intended to be selected based on certain problem's features and thus to specialize in solving the corresponding problem classes. When allowing schedule branching, there is an intuitive trade-of between covering the known (training) problems quickly, by splitting them into many classes, and the generalization of the obtained branching schedule, i.e., its ability to solve previously unseen (test) problems. In this paper, we attempt to shed more light on this trade-of. Starting of with a large set of proving strategies of the automatic theorem prover Vampire evaluated on the first-order part of the TPTP library, we report on two preliminary experiments with building branching schedules while keeping track of how well they generalize to test problems. We hope to attract more attention to the exciting topic of schedule construction by this work-in-progress report.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;strategy schedules</kwd>
        <kwd>schedule specialization</kwd>
        <kwd>regularization</kwd>
        <kwd>Vampire</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Proof search in most automatic theorem provers (ATPs) relies on parameterized heuristics. Strong
configurations ( strategies) of an ATP are, typically, specialized to certain classes of input problems – no
configuration performs well on all possible problems of interest. Selecting a strong strategy for a given
input problem is, in general, hard. This is namely due to the chaotic behavior of the provers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: Using
a fixed strategy, a small change in the input problem may lead to a substantially diferent proof search.
      </p>
      <p>
        Typically, if a strategy solves a problem, the solution is reached in a relatively short time [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In
contexts where suficient time or parallelism is available, running a portfolio of diverse strategies is a
pragmatic approach to increase the power of a prover. Such portfolio, or schedule in case a time limit is
assigned to each of the strategies, may be either hand-crafted [3] or constructed automatically [4, 5].
      </p>
      <p>Further gain in performance can be achieved by specializing the schedule: We partition the possible
input problems into classes and construct a specialized schedule for each class. We prefer the classes to
be homogeneous in terms of the problems’ response to strategies, so that a schedule specialized for a
class needs less time to cover the same percentage of problems. Increasing the granularity of the classes
increases the specialization and the associated potential performance gain.</p>
      <p>Decision trees [6] provide a suitable architecture to capture such gradual specialization. A set of
features, such as the number of atoms, the presence of certain syntactic traits, or the membership in
standard logical fragments, are computed for the input problem. A binary tree is then traversed from
the root to a leaf. Each internal node is labeled with a criterion that partitions the domain of one of
the features into two parts. When a node is traversed, the next node is selected depending on the
corresponding feature of the problem. The schedule inhabiting the final leaf is then used to attempt to
solve the problem.</p>
      <p>The ATP Vampire [7] won the prestigious First-Order Form theorems (FOF) division of the CADE
ATP System Competition (CASC) [8] every year from 2002 to 2023 [9]. Arguably, this continued success
is partially thanks to the famous “CASC mode” of the prover and the corresponding use of strong
strategy schedules. The schedules were constructed by Andrei Voronkov with the help of a system
called Spider [10] and arranged in a shallow decision tree for specialization.1</p>
      <p>When the specialization problem classes are chosen in a way that optimizes the performance on a
training set of problems, we run the risk of overfitting , that is increasing performance on the training
problems at the cost of decreasing performance on unseen problems. Regularization is the process of
reducing the expressive power of a machine-learned model to prevent overfitting.</p>
      <p>In our previous work [4], we studied the regularization of monolithic schedule construction, i.e., the
case where the algorithm’s output is a single sequence with no specialization to problem features. We
discovered and evaluated more than a thousand Vampire strategies targeting the first-order fragment of
the TPTP library [11], which are now available as an independent data set for experimentation [12].</p>
      <p>In this work, we use the same data set to start shedding some light on regularization during schedule
specialization. Namely, after brief preliminaries (Sect. 2), we first discuss (and measure) how a single
branching criterion (i.e., a single problem feature) can afect the generalization of the simplest branching
schedule with just one level of branching (Sect. 3). We complement this view by an experiment with
gradient boosted trees (Sect. 4), which explores the opposite extreme case where every problem can —
based on its features — be, in principle, assigned its unique schedule. Our paper also briefly overviews
related work (Sect. 5) and ofers several concluding remarks (Sect. 6).</p>
      <p>We are far from any definite answer on how specialization in schedule construction should be best
done. Yet we hope to spark more interest in this exciting topic by this work-in-progress report.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <sec id="sec-2-1">
        <title>2.1. Training and evaluation data</title>
        <p>In our previous work [4], we generated a collection of 1096 strong and mutually complementary
strategies for the ATP Vampire. We evaluated the performance of each of these strategies on a set of
7866 first-order logic (FOL) problems from TPTP [ 11] v8.2.0. The results of these evaluations constitute
a dataset suitable for experimenting with schedule construction and specialization [12].</p>
        <p>We denote the set of strategies as  (|| = 1096), the set of problems as  (| | = 7866), and the
evaluation matrix as  :  ×  → N ∪ {∞}. The value  is the time at which strategy  solves
problem , or ∞ in case strategy  failed to solve problem  within the time limit.</p>
        <p>The time is measured in whole CPU megainstructions (Mi), where 1 Mi = 220 instructions reported
by the tool perf and 2000 Mi take approximately 1 second of wallclock time on our hardware.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Vampire’s CASC-mode problem features</title>
        <p>For our schedule specialization experiments, we use problem features already available in Vampire
and used there by the CASC mode. This feature set consists mainly of various syntactic counters (the
number of: goal clauses, axiom clauses, Horn clauses, equational clauses, etc.), most prominent of which
is the number of atoms (roughly corresponding to problem size); problem class according to TPTP
classification (NEQ, HEQ, . . . , EPR, UEQ); and a number of bitfields checking for properties such as “has
 =  ”, “has functional definitions”, or “has inequality resolvable with deletion”.</p>
        <p>These features are not included in the mentioned dataset [12], but can be obtained easily by running
a modified version of Vampire 2 in “profile mode” via --mode profile &lt;problem_name&gt;.
1E.g., https://github.com/vprover/vampire/blob/c7564c1d65020771079f29787ca2d5d7743f5d6a/CASC/Schedules.cpp#L5676.
2https://github.com/vprover/vampire/tree/ourProfile</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Strategy schedules</title>
        <p>A strategy schedule is a mapping  :  → N. The value  is the time limit assigned to strategy . A
schedule may be used to attack an arbitrary problem with Vampire: The strategies are run sequentially
with the assigned time limits, each attempting to solve the problem, until a solution is found or every
strategy has depleted its allocated time. In the rest of this work, instead of running the prover, we use
the evaluation matrix to simulate the evaluation of schedules, restricting our scope to problems in  .</p>
        <p>Similarly to strategies, schedules are typically evaluated in a time-constrained setting. Schedule 
satisfies (time) budget  ∈ N if and only if ∑︀∈  ≤  .</p>
        <p>In our previous work [4], we proposed a greedy algorithm that constructs a strategy schedule from
an evaluation matrix and a budget. The algorithm allows finding a relatively strong (with respect to the
evaluation matrix) schedule satisfying the budget quickly.</p>
        <p>A strategy schedule recommender produces a schedule for an arbitrary input problem. To evaluate a
schedule recommender on a set of problems, we estimate, using the evaluation matrix, the proportion
of problems solved by the respective schedules produced by the recommender.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Overfitting and regularization</title>
        <p>Extreme schedule specialization is discouraged in CASC [13]:</p>
        <p>All techniques used must be general purpose, and expected to extend usefully to new unseen
problems. [...] If machine learning procedures are used to tune a system, the learning must
ensure that suficient generalization is obtained so that no there is no specialization to
individual problems.</p>
        <p>In machine learning (ML), it is common for a trained system to perform better on the data it was
trained on than on previously unseen data. The system is said to overfit to the training data. In most
applications, we are ultimately interested in optimizing the performance on unseen data, so overfitting
is undesirable. A system that does not overfit is said to generalize (to unseen data).</p>
        <p>To evaluate generalization of our ML training procedures, we repeatedly randomly split the set of
problems  into a training set (80 %) and a test set (20 %), train the system on the training set, and
evaluate the trained system on the test set. The final score is the mean of the per-split test scores (in
our case, schedule recommender success rates). This way, we estimate the performance on unseen data.</p>
        <p>Moreover, before training we exclude strategies that have their “witness problem” in the test set.
Roughly speaking, this prunes the set of strategies only to those that could have been discovered if a
given test set was fixed beforehand for the strategy discovery phase. See [ 4, Section 6] for more details.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Single branching point</title>
      <p>Creating a schedule can be seen as an act of “covering” the training problems with the help of the
known solutions of the previously evaluated strategies. It is clear that if many strategies need to be
involved with long running times in order to eventually cover every training problem, the running
time of the whole schedule will need to be high.3 Schedule specialization comes to rescue to reduce
this running time, as it allows the covering process to focus, within each branch, on a diferent subset
of the training problems. In a nutshell, on each branch, there is less covering work to do and during
evaluation, only the “right” branch is always used.</p>
      <p>However, from the perspective of generalization, i.e., the efort of ensuring that our specializing
schedule performs well also on problems unseen in training, splitting the work into branches may be
detrimental. Indeed, if many training problems end up in their respective subset for mostly a random
3Here we conceptually depart from the position where there is a fixed budget and we are trying to solve as many problems as
possible within that budget. This is because we aim, in this section, to present an analysis which answers a question for all
budgets at once. We can do it thanks to the anytime nature of the mentioned greedy algorithm we employ here [4].</p>
      <p>Random split into three groups</p>
      <p>Split by # atoms: &lt; 750 &lt; 2900 &lt;
reason, the chances of a similar testing problem (that could be solved by the same strategy) being
classified into the same subset and thus subjected to the same schedule branch are low.</p>
      <p>In this experiment, we set out to demonstrate this phenomenon at its extreme. Our method of
covering is the greedy algorithm presented in previous work [4] and we run it repetitively over random
80 : 20 train/test splits of our problem set  . We compare the average performance of a monolithic
(non-branching) schedule, a schedule branching into three groups based on the number of problem’s
atoms, and a schedule branching into three random equally-sized groups. For the branching on atoms,
we split the problems into small (number of atoms &lt; 750), large (number of atoms &gt; 2900) and medium
(the rest). The cutof points were chosen not to split the problems into equally-sized subsets, but by
aiming for a split for which each subset can be fully covered within approximately the same budget.</p>
      <p>The results are presented in cactus plot form in Fig. 1. The results of the monolithic schedule building
(1-Train/1-Test) are the same in both the left and the right plot (and roughly correspond to Fig. 7 in
Appendix E of our previous work [4]). The monolithic schedule requires approximately 2 million Mi
to cover all the training problems and the test performance converges to around 93 % of the training
one. The train performance of both the random 3-split (3-Train) and the atom split (Atom-Train) are
at all times higher than that of the monolithic schedule until they (necessarily) converge to the same
value and stabilize. (The atom split tends to finish its covering job a bit earlier than the random split.)
However, the pictures for the test performance of the branching schedules difer. While splitting on
atoms improves over monolithic even in test performance (up until around the 1 million Mi mark), and
for the lower budgets even improves over the monolithic’s train performance, the test performance of
the random 3-split (3-Test) is at all times worse.</p>
      <p>It is perhaps not surprising that random features are useless from the perspective of branching
schedule generalization, however, we find the “backwards version” of this perspective to be quite
revealing, namely the realization that even reasonable-sounding problem features may difer in how
useful (useless, or even detrimental) they could be in this regard. Schedule construction is sometimes
organized by first fixing the problem groups based on features and only then looking for strategies in
each group separately (one can save computational resources by not having to evaluate each strategy
on every training problem; evaluating just within the relevant group would be enough). However,
if such initial split is only based on educated guessing of what problems reasonably belong together,
opportunities for better (or even the best possible) generalization can be missed.</p>
      <p>Ideally, we would like to come up with generalization-conducive splits of the problems beforehand,
by just studying our evaluation matrix and asking a question such as “What could be a nice,
easy-todelineate subset of problems for which certain strategies work well together and complement each
other, while elsewhere they are mostly useless?” We remark that such splits might be defined by other
properties than those expressible by our current fixed feature set. Discovering such new properties
that would still be easy to compute (and thus easy to understand) would be an interesting instance of
the explainable AI endeavour. At this moment, however, we do not have a good way of doing this and
leave this research direction for future work.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Gradient boosting</title>
      <p>Unlike in the previous section, where a hand-crafted split based on the number of atoms occurring in the
problem is discussed, we want to produce finer splits of the problems based on Vampire’s CASC-mode
problem features here. In other words, we want to learn more complicated decision trees, or as in our
case ensembles of small decision trees. Hence we want to train a model that for a given problem 
produces a schedule 1, . . . , ||, where ∑︀∈  ≤  , such that there is a strategy  ∈  that solves
 ∈  in time at most .</p>
      <p>
        Although we want to obtain schedules, we may characterize them by probability distributions. A
discrete probability mass function p :  → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] defines a schedule by setting  =  · p. Hence we can
use standard approaches for multi-class classification, where usually we get a probability distribution
over possible classes and select the class as the one with the highest estimated probability. Here,
however, we take the estimated probability distribution and compute a schedule based on it.
      </p>
      <p>This of course immediately leads to the question of what is the true probability distribution we want
to estimate. One possible approach is to assign, for example, probability 1 to the best strategy for the
problem. However, we likely need a more fine-grained approach as we are interested in generalization.
First, we are happy to split the budget among more strategies, as long as at least one of them solves the
problem. Second, we do not require that the problem be solved by the best available strategy.</p>
      <p>Hence instead of having, for each (training) problem, a fixed true distribution that we want to learn,
we may produce a “true” distribution iteratively during the training based on our previous estimates
and change it as necessary. For example, we have a training problem  and an estimated probability
p1, . . . , p|| for it. We may want to produce a “true” distribution for  by assigning p = 0 if the
strategy  does not solve the problem in the given budget  , and hence increase p if the strategy 
solves the problem. In other words, we want to split the budget only among strategies that can possibly
solve the problem. Moreover, we can boost the strategy that is closest to solving the problem; the
strategy requiring the smallest increase in the estimated budget to solve the problem. And there are
many other ways how to get a new “true” distribution. It is worth noticing that we may obtain this “true”
distribution in diferent ways: one way for those training examples that our model predicts correctly
(estimated schedule solves the problem) and another for those that are predicted incorrectly.
4.1. Model
Gradient boosting for decision trees, see, e.g., [14], fits nicely into this picture. Loosely speaking, it
combines weak models that iteratively improve their predictions by correcting the previous mistakes.
In particular, we learn decision trees that split our problems into diferent subsets based on the static
Vampire’s features. Or, more precisely, in each iteration, we learn || regression trees (multi-class
problem); one per strategy. When we want to estimate the probability distribution after  iterations, we
combine the output of all  regression trees learned for each strategy, and get the probability distribution
by the softmax function. In the next iteration, we fix the “true” distribution based on the newly estimated
distribution and iterate this process until we get a satisfactory model or reach the limit on the number
of iterations.</p>
      <sec id="sec-4-1">
        <title>4.2. Initial experiments</title>
        <p>We use LightGBM [15], a framework for gradient boosting decision trees, with custom objective
functions. We get probability estimates from our models by the softmax function and use the
crossentropy between predicted probabilities and “true” probabilities as the loss function.</p>
        <p>We report some preliminary results, as we have not performed much of hyper-parameter tuning.4
We use the learning rate 0.05, the maximal number of bins that feature values are bucketed in is 8, the
number of boosting iterations is 500, and, most importantly, we restrict the number of leaves to 2; hence
we use decision stumps as weak learners.</p>
        <p>Our custom objective functions create “true” probabilities p′1, . . . , p′|| from the estimated probabilities
p1, . . . , p|| for a problem . We have tried various combinations of the following procedures:
(a) p′ = 0 if the strategy  does not solve  in the time budget  , otherwise p′ = p ,
(b) p′ = 1 if the strategy  is the best available strategy5 and p′ = 0 for  ̸= ,
(c) boost6 the best available strategy,
(a) works quite well even though it can get stuck in a local minima. Hence, for simplicity, all the
subsequent results use this objective function. It takes roughly 10 s, using one core, to train one
iteration of the model with our non-optimized custom objective function written in Python.</p>
        <p>We train our models using 10 random 80:20 train/test splits over the dataset described in Section 2.1.
In Table 1, we report the success rate. The best iteration is selected based on the number of solved
problems on the training set. For shorter budgets, the gradient boosting model significantly outperforms
the simple split into small-medium-large subsets based on the number of atoms in the problem. As
expected, the gap is narrowing as the budget is increasing.</p>
        <p>Interestingly, when we inspect the produced schedules, the majority of problems (both on the train
and test sets) are solved by multiple strategies in the schedule. Moreover, it is unlikely that a problem is
solved by the overall best available strategy for the problem, supporting our idea behind the careful
creation of “true” probabilities.</p>
        <p>As the “true” probability that we aim to estimate evolves, it is possible that the majority of trees
learned do not ofset incorrect estimates from previous iterations but rather compensate for the changing
“true” probability. Note that we often add the same tree (with diferent leaf values) multiple times.</p>
        <p>These initial experiments show that it is possible to automatically train a model that both leverages
the available feature space and keeps a reasonable level of generalization.
4The performance of LightGBM is sensitive to choices of hyper-parameters. Moreover, our custom objective functions make
these choices even more significant.
5There are many possible definitions here. For example, we have experimented with selecting the strategy that has the
maximal diference between the time assigned to the strategy and the time necessary to solve the problem by the strategy.
6We tried two variants. First, in combination with (a), we split “saved time” not among all strategies that solve the problem,
but we only boost the best available strategy. Second, we increase the allocation to the best available strategy just enough for
it to solve the problem.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Related work</title>
      <p>Strategy schedule specialization is closely related to strategy selection (also referred to as “heuristic
selection”), a special case of algorithm selection [16, 17]. Automatic mode in the E prover [18] uses
hand-crafted problem classes to select a strategy for the input problem [19]. Strategy selection for E has
also been approached by machine learning [20]. Given an input problem, MaLeS [19] predicts solving
time for all strategies in a portfolio and selects the strategy that minimizes the predicted time. Similarly,
Grackle [21] selects a strategy using a trained LightGBM model that predicts a ranking of strategies in
a pre-computed portfolio.</p>
      <p>Strategy schedules have been constructed manually in Gandalf [3] and automatically in E-SETHEO
[22], Spider for Vampire [10], CPhydra [23], BliStrTune [24], HOS-ML [5], and in our previous work
[4]. Given an input problem, HOS-ML [5] selects a strategy schedule automatically. The problem’s
static features are used to identify a problem class, for which a schedule was optimized during the
training. CPhydra [23] identifies 10 training problems that are the most similar to the input problem
and optimizes a schedule for these problems using a pre-computed evaluation matrix.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Working with a dataset of 1096 Vampire strategies, which were evaluated on the first-order part of
TPTP, and a selection of syntactic problem features, which Vampire computes eficiently for its input
problem at startup, we presented two experiments with strategy schedule construction and schedule
specialization. In the first, we noticed that the selection of the splitting features not only influences how
fast a specialized schedule can cover the training problems, but, more interestingly, how well it can
generalize to previously unseen problems. In the second, we demonstrated that a good performance
can be achieved by using gradient boosted decision trees. The second approach is interesting in that it
departs from the concept of problem covering and instead develops a schedule in the form of a probability
distribution over the available strategies. This requires the use of a custom target distribution which
develops during the learning process.</p>
      <p>There are many avenues for future research. Our next target for experimentation is the
grow-andprune technique for decision tree construction and regularization [6] and its adaptation to cautious
schedule specialization. However, we would also be more than happy for others to join us in
experimenting with our strategy data [12] to see which method could eventually lead to the most useful
approach.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The work on this paper was supported by the Czech Science Foundation grant 24-12759S and the project
RICAIP no. 857306 under the EU-H2020 programme.
[3] T. Tammet, Towards eficient subsumption, in: C. Kirchner, H. Kirchner (Eds.), Automated
Deduction - CADE-15, 15th International Conference on Automated Deduction, Lindau, Germany,
July 5-10, 1998, Proceedings, volume 1421 of Lecture Notes in Computer Science, Springer, 1998, pp.
427–441. URL: https://doi.org/10.1007/BFb0054276. doi:10.1007/BFb0054276.
[4] F. Bártek, K. Chvalovský, M. Suda, Regularization in Spider-style strategy discovery and schedule
construction, 2024. URL: https://arxiv.org/abs/2403.12869. doi:10.48550/arXiv.2403.12869.
arXiv:2403.12869.
[5] E. K. Holden, K. Korovin, Heterogeneous heuristic optimisation and scheduling for first-order
theorem proving, in: F. Kamareddine, C. S. Coen (Eds.), Intelligent Computer Mathematics
14th International Conference, CICM 2021, Timisoara, Romania, July 26-31, 2021, Proceedings,
volume 12833 of Lecture Notes in Computer Science, Springer, 2021, pp. 107–123. URL: https:
//doi.org/10.1007/978-3-030-81097-9_8. doi:10.1007/978-3-030-81097-9\_8.
[6] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Classification and Regression Trees,</p>
      <p>Wadsworth, 1984.
[7] L. Kovács, A. Voronkov, First-order theorem proving and Vampire, in: N. Sharygina, H. Veith (Eds.),
Computer Aided Verification - 25th International Conference, CAV 2013, Saint Petersburg, Russia,
July 13-19, 2013. Proceedings, volume 8044 of Lecture Notes in Computer Science, Springer, 2013, pp.
1–35. URL: https://doi.org/10.1007/978-3-642-39799-8_1. doi:10.1007/978-3-642-39799-8\
_1.
[8] G. Sutclife, The CADE ATP System Competition - CASC, AI Magazine 37 (2016) 99–101.
[9] G. Sutclife, The CADE ATP system competition, 2024. URL: https://tptp.org/CASC/, [Online;
accessed 26-April-2024].
[10] A. Voronkov, Spider: Learning in the sea of options, 2023. URL: https://easychair.org/
smart-program/Vampire23/2023-07-05.html#talk:223833, unpublished. Paper accepted at
Vampire23: The 7th Vampire Workshop.
[11] G. Sutclife, The TPTP problem library and associated infrastructure, Journal of Automated</p>
      <p>Reasoning 59 (2017). doi:10.1007/s10817-017-9407-7.
[12] F. Bártek, M. Suda, Vampire strategy performance measurements, 2024. URL: https://doi.org/10.</p>
      <p>5281/zenodo.10814478. doi:10.5281/zenodo.10814478.
[13] G. Sutclife, CASC design and organization, 2024. URL: https://tptp.org/CASC/J12/Design.html,
[Online; accessed 15-April-2024].
[14] J. H. Friedman, Greedy function approximation: A gradient boosting machine., The Annals of
Statistics 29 (2001) 1189–1232. URL: https://doi.org/10.1214/aos/1013203451. doi:10.1214/aos/
1013203451.
[15] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, LightGBM: A highly eficient
gradient boosting decision tree, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30,
Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/
6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.
[16] J. R. Rice, The algorithm selection problem, Adv. Comput. 15 (1976) 65–118. URL: https://doi.org/
10.1016/S0065-2458(08)60520-3. doi:10.1016/S0065-2458(08)60520-3.
[17] P. Kerschke, H. H. Hoos, F. Neumann, H. Trautmann, Automated algorithm selection: Survey
and perspectives, Evol. Comput. 27 (2019) 3–45. URL: https://doi.org/10.1162/evco_a_00242.
doi:10.1162/EVCO\_A\_00242.
[18] S. Schulz, S. Cruanes, P. Vukmirovic, Faster, higher, stronger: E 2.3, in: P. Fontaine (Ed.),
Automated Deduction - CADE 27 - 27th International Conference on Automated Deduction,
Natal, Brazil, August 27-30, 2019, Proceedings, volume 11716 of Lecture Notes in Computer Science,
Springer, 2019, pp. 495–507. URL: https://doi.org/10.1007/978-3-030-29436-6_29. doi:10.1007/
978-3-030-29436-6\_29.
[19] D. Kühlwein, J. Urban, MaLeS: A framework for automatic tuning of automated theorem provers, J.</p>
      <p>Autom. Reason. 55 (2015) 91–116. URL: https://doi.org/10.1007/s10817-015-9329-1. doi:10.1007/
s10817-015-9329-1.
[20] J. P. Bridge, S. B. Holden, L. C. Paulson, Machine learning for first-order theorem proving
learning to select a good heuristic, J. Autom. Reason. 53 (2014) 141–172. URL: https://doi.org/10.
1007/s10817-014-9301-5. doi:10.1007/S10817-014-9301-5.
[21] J. Hůla, J. Jakubův, M. Janota, L. Kubej, Targeted configuration of an SMT solver, in: K. Buzzard,
T. Kutsia (Eds.), Intelligent Computer Mathematics - 15th International Conference, CICM 2022,
Tbilisi, Georgia, September 19-23, 2022, Proceedings, volume 13467 of Lecture Notes in Computer
Science, Springer, 2022, pp. 256–271. URL: https://doi.org/10.1007/978-3-031-16681-5_18. doi:10.
1007/978-3-031-16681-5\_18.
[22] S. Schulz, E-SETHEO, http://wwwlehre.dhbw-stuttgart.de/~sschulz/WORK/e-setheo.html, 2024.</p>
      <p>[Online; accessed 26-April-2024].
[23] D. Bridge, E. O’Mahony, B. O’Sullivan, Case-based reasoning for autonomous constraint solving,
in: Y. Hamadi, É. Monfroy, F. Saubion (Eds.), Autonomous Search, Springer, 2012, pp. 73–95. URL:
https://doi.org/10.1007/978-3-642-21434-9_4. doi:10.1007/978-3-642-21434-9\_4.
[24] J. Jakubův, J. Urban, BliStrTune: hierarchical invention of theorem proving strategies, in: Y. Bertot,
V. Vafeiadis (Eds.), Proceedings of the 6th ACM SIGPLAN Conference on Certified Programs
and Proofs, CPP 2017, Paris, France, January 16-17, 2017, ACM, 2017, pp. 43–52. URL: https:
//doi.org/10.1145/3018610.3018619. doi:10.1145/3018610.3018619.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Suda</surname>
          </string-name>
          ,
          <article-title>Vampire getting noisy: Will random bits help conquer chaos? (system description)</article-title>
          , in: J.
          <string-name>
            <surname>Blanchette</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Kovács</surname>
          </string-name>
          , D. Pattinson (Eds.),
          <source>Automated Reasoning - 11th International Joint Conference, IJCAR 2022, Haifa, Israel, August</source>
          <volume>8</volume>
          -
          <issue>10</issue>
          ,
          <year>2022</year>
          , Proceedings, volume
          <volume>13385</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2022</year>
          , pp.
          <fpage>659</fpage>
          -
          <lpage>667</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -10769-6_
          <fpage>38</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -10769-6\_
          <fpage>38</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Reger</surname>
          </string-name>
          ,
          <article-title>Boldly going where no prover has gone before</article-title>
          , in: M.
          <string-name>
            <surname>Suda</surname>
          </string-name>
          , S. Winkler (Eds.),
          <source>Proceedings of the Second International Workshop on Automated Reasoning: Challenges</source>
          , Applications, Directions, Exemplary Achievements,
          <source>ARCADE@CADE</source>
          <year>2019</year>
          , Natal, Brazil,
          <year>August 26</year>
          ,
          <year>2019</year>
          , volume
          <volume>311</volume>
          <source>of EPTCS</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>41</lpage>
          . URL: https://doi.org/10.4204/EPTCS.311.6. doi:
          <volume>10</volume>
          .4204/EPTCS.311.6.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>