Empirical Properties of Term Orderings for Superposition Stephan Schulz1 1 DHBW Stuttgart, Stuttgart, Germany Abstract Term orderings play a critical role in most modern automated theorem proving systems, in particular for systems dealing with equality. In this paper we report on a set of experiments with the superposition theorem prover E, comparing the performance of the system under Knuth-Bendix Ordering and Lexicographic Path Ordering with a number of different precedence and weight generation schemes. Results indicate that relative performance under a short time limit seems to be a good predictor for longer run times and that KBO generally outperforms LPO. Also, if other strategy elements (especially clause selection) are independent of the ordering, performance of a given ordering instance with one strategy is a decent predictor of performance with a different strategy. 1. Introduction The root cause of difficulty in automated theorem proving is the size of the search space, i.e. the space of possible derivations. Many improvements in calculi have introduced ways to reduce the branching factor of this search space, by identifying strictly unnecessary inferences. One of the more important techniques in the context, in particular for equational reasoning, is the use of term orderings, to break the symmetry of equality, and also to impose additional constraints on possible inferences. This was pioneered by Knuth and Bendix in their paper on completion [1], which also introduced what we today call the Knuth-Bendix-Ordering (KBO), still one of the most important orderings for automated theorem proving. The completion approach was continued to a complete theorem proving method for unit equality problems [2, 3], and finally lifted to full clause logic with equality in the superposition calculus [4, 5], which still forms the basis of most modern saturating calculi and most high-performance theorem provers. In these calculi, term orderings are used in two principal ways: First, equalities can only be applied in (at least potentially) decreasing order, and secondly, many inferences can be restricted to be only applied on maximal literals. When these theoretical results are translated to practical reasoning systems, there are additional challenges. Orderings need to be efficiently implemented, and they need to be instantiated for any given proof problem. Both the KBO and the LPO (Lexicographic Path Ordering), the two most frequently used orderings, are actually ordering schemata. PAAR’22: 8th Workshop on Practical Aspects of Automated Reasoning, August 11–12, 2022, Haifa, Israel © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) In the concrete case, both need a symbol precedence. The KBO additionally needs an assignments of weights to function (and predicate) symbols. While e.g. Otter [6] pioneered an automatic mode that would also set ordering param- eters, other systems, such as DISCOUNT [7]) and early Waldmeister [8], left the task to the user, who, at that time, was assumed to be familiar with the domain and have the necessary expertise. However, the CADE ATP System Competition [9], starting with CASC-13 in 1996 [10], established a paradigm where the provers were given only the logic formula proper, and had to determine all search control parameters themselves. While nearly all competitive ATP systems have implemented some method, to our knowledge there has never been a systematic practical evaluation of different ways to solve the order generation problem. Over time, the theorem prover E[11, 12] has accumulated a number of different ways to generate symbol precedence and weight generation schemes, based only on simple syntactic features of the symbols and the proof problems. It also features an efficient implementation of both KBO and LPO, based on Löchner’s excellent and possibly definitive work [13, 14]. In this paper, we want to address a number of questions regarding the performance of these different schemes. 1. How do they compare for overall performance? 2. Does good performance of an ordering in one setting translate to different settings? In particular, if we replace only the ordering, can (relative) performance be extrapolated from one search strategy to another? 3. Similarly, can (relative) performance be extrapolated from shorter to longer run times? Most of our experiments were done only on unit-equational examples, where the influence of the term ordering is more important. However, we have also ran one experiment on all usable TPTP [15] first-order problems. That allows us to address one additional question: 4. How does the relative performance on UEQ compare to that on the larger FOF/TFF set? Finally, most theorem provers we are aware of implement rewriting with unorientable unit equations in a way that is strictly weaker than supported by theory. In E, we have introduced rewriting with strong instances, a technique that often allows for more simplification than these conventional implementations. We also analyse the data with respect to this new technique. We start the paper with a short definition of the most important terms and concepts, including the idea of rewriting with strong instances. In the next section, we discuss the experimental setting, including the different ordering generation schemes and applications to simplification. In the experimental section, we discuss the primary results. Finally, we mention some related and future work, and conclude. • u|p is not a Variable, σ = mgu(u|p , s) C ∨ s't D ∨ u 6' v • sσ 6< tσ, uσ 6< vσ (SP) if (C ∨ D ∨ u[p ← t] 6' v)σ • (s ' t)σ is strictly maximal in (C ∨ s ' t)σ • (u 6' v)σ is maximal in (D ∨ u 6' v)σ s't u'v ∨ C • u|p = sσ, sσ > tσ (RP) if s't u[p ← tσ] ' v ∨ C • u ' v is not maximal or u 6> v or u|p 6= u Note that the double line in (RP) denotes a simplification rule, i.e. a rule in which the premises are replaced by the results. Figure 1: Core inferences of the superposition calculus 2. Preliminaries We assume the standard first-order notions of a signature, consisting of function symbols with associated arities, and variable symbols. Terms are either variables, or are constructed from a function symbol with the proper number of arguments. Constants are a special case of function symbols with arity zero (and hence a constant is a term). Terms without variables are called ground terms. Equations are unordered pairs of terms, literals are either equations or negated equations, and clauses are multi-sets of literals1 interpreted (and written) as disjunctions. Substitutions replace variables by terms and can be applied to variables, terms, literals and clauses. We use s, t, u, v to denote terms, C, D to denote (partial) clauses, and σ for substitutions. We use t|p to denote the subterm of t at position p and t[p ← s] for term t with the subterm at position p replaced by s. Saturating theorem provers try to show that a set of clauses is unsatisfiable by the process of saturation, i.e. by using inference rules to derive new clauses from existing clauses with the aim of eventually producing the empty clause. The proof search is usually organised by some variant of the given-clause-algorithm. This algorithm maintains the proof state in the form of two separate sets of clauses. The set P of processed clauses is initially empty. The set U of unprocessed clauses initially contains all clauses of the proof problem. The algorithm repeatedly selects a given clause g from U and performs all inferences in which g is one premise and all other premises are from P . The given clause is then moved into P , the resulting new clauses into U . The algorithm terminates if the empty clause is derived (representing a proof that the original clause set is unsatisfiable), or if the set U runs empty (in which case the proof search fails). In addition to this basic operations, a significant proportion of time is spent on simplification, removing redundant clauses and replacing clauses with, in a certain sense, smaller and simpler clauses. The selection of the given clause has to be fair (i.e. no clause can be delayed infinitely long), and is also one of the most important heuristic choice points for the proof search. The most important inference in modern calculi is superposition, a restricted version of paramodulation. A paramodulation inference has two premises: the main premise, an 1 Non-equational literals can be encoded as equations with a reserved constant > with a separate Boolean sort. instance of which forms the core of the newly inferred clause, and a side premise, which contains at least one positive equational literal and is used as a (lazy) conditional rewrite rule to modify the instance of the main premise. To perform an inference, one side of a positive literal of the side premise is overlapped into a subterm of the main premise, followed by instantiation and replacement of the overlapped term by the instance of the other side of the inference literal in the side premise. The instances of other literals of the side premise are inherited by the newly generated child clause. Paramodulation inferences can be restricted in certain ways. In particular, we can use a simplification ordering on terms, lift it to literals, and restrict inferences to maximal literals. In this case, we call such an inference a superposition inference. Fig 1 shows the inference rule (SP) for superpositions into a positive literal. Term orderings also allow us to use strictly decreasing instances of positive unit clauses for simplification (i.e. destructive rewriting, in which a parent clause is replaced by the result). Rule RP in figure 1 describes rewriting of a positive literals. Both rules have very similar counterparts for overlapping into or rewriting negative literals. See e.g. [4] for details and [11] for a more practice-oriented exposition. As described above, the two most frequently used classes of orderings are the KBO [1] and the LPO [16] (but see [14, 13] for excellent modern treatments). The KBO requires a weight assignment to function symbols, and a precedence on these symbols. It compares terms first by weight (i.e. numerical comparison of the summed weight of all symbols in the term), breaking ties by top symbol (using the precedence), and further breaking ties by lexicographic comparison of subterms. It additionally requires each variable to occur in the smaller term no more often than in the larger term. The LPO requires only a precedence on function symbols, and essentially compares terms lexicographically, recursively ensuring that no larger subterms are hidden deep in the term structure of the potentially smaller term. 3. Term Orderings in Practice As stated above, both LPO and KBO need to be instantiated with parameters - a symbol precedence in the case of LPO, both a symbol precedence and a weight assignment for KBO. In E, we have implemented 13 different mature precedence generation schemes, and 26 different mature weight assignment schemes. All are based on simple features of the signature (primarily arity) and the number of occurrences in the original input specification (frequency). E always generates the symbol precedence first, and the weight assignment scheme has the symbol precedence available as an extra input parameter. Since any precedence can be combined with any weight assignment, this represents 13 different LPO instances, and 13 times 26, or 338 different KBO instances. 3.1. Term Ordering Generation Schemes The complete list of precedence generation schemes used in the experiments is as follows: unary_first, unary_freq, arity, invarity, const_max, const_min, freq, invfreq, invconjfreq, invfreqconjmax, invfreqconjmin, invfreqconstmin, invfreqhack. We will discuss some selected schemes in some detail - see [17] for more detailed notes. While naming is not perfectly consistent, there are some general rules. Any scheme starting with inv will just order symbols in the opposite (or inverse) way of the corresponding non-inv scheme. As a simple and long-lived example, the arity function just orders symbols by ar- ity, making symbols with a bigger arity greater. Since all our schemes generate total precedences, ties are broken by the order of appearance in the signature table (which corresponds to the order in which symbols occur in the input). The unary_first scheme is a variant that will make unary function symbols maximal. This allows the prover to generate the ordering originally suggested in [1] for the completion of the group axioms. It also turns out to perform very well for UEQ problems in general. Another useful scheme, in particular for general problems, is invfreq. This makes rare symbols bigger then more frequent symbols. The 26 different weight assignment schemes are: firstmaximal0, arity, aritymax0, modarity, modaritymax0, aritysquared, aritysquaredmax0, invarity, invaritymax0, invaritysquared, invaritysquaredmax0, precedence, invprecedence, precrank5, precrank10, precrank20, freqcount, invfreqcount, freqrank, invfreqrank, invconj- freqrank, freqranksquare, invfreqranksquare, invmodfreqrank, invmodfreqrank- max0, constant. An interesting case is firstmaximal0, which complements unary_first to generate the original KBO for group completion. It assigns a weight of 1 to all symbols, except to the first (non-constant) maximal symbol in the signature, which is assigned weight 0. A more complex example is invfreqconjmin. In this scheme, symbols are ordered by inverse frequency, but symbols that occur in the problem’s conjecture (if any) are smaller than symbols that occur only in axioms. Weights are then assigned sequentially, starting with 1 for the most frequent conjecture symbol. The precedence scheme assigns weights according to the (total) precedence, with symbols that are larger in the precedence also getting higher weights. KBO also has the constraint that constants must have at least the weight of variables. In our experiments, we always assigned a fixed weight of 1 to both constants and variables. 3.2. Rewriting with strong instances One of the major advantages of superposition-based calculi is that they allow for strong redundancy elimination. One of the most important simplification techniques in practice is unconditional rewriting. It allows the use of a positive unit-equational clause (i.e. a single equation) to potentially replace a term in another clause by a smaller term. For this, two conditions have to be met2 : On the one hand, one side of the equation (called the left-hand side or lhs) must match the subterm to be rewritten. And on the other hand, the instance used for rewriting must be decreasing, i.e. the instantiated other side (rhs) of the equation has to be strictly smaller than the replaced term. 2 Rewriting is also restricted for a small set of term positions by the calculus, but that is not our concern here. Most provers combine the matching and the instantiation operation. Thus, if an lhs matches a candidate term, the corresponding substitution is applied to both sides of the equation, and then the ordering condition is checked. This is often sufficient, however, the test will automatically fail if the rhs contains a variable not occurring in the lhs, because such a free variable is always incomparable with any term in which it does not occur. To be able to still use these equations, we have introduced the concept of a strong instance of an equation. In a strong instance, we dynamically instantiate all variables of the equation before performing the ordering test. Variables in the lhs of an equation are automatically bound by the match against the term to be rewritten. Additionally, we substitute any remaining unbound rhs variables with the smallest constant (of the proper sort). As an example, consider the equation R = f (x, a) ' g(x, y) (where f is greater than g) applied to the term t = f (a, a) and let ⊥ be the smallest constant in the signature. The lhs f (x, a) matches t with σ = {x 7→ a}. But σ(R) = f (a, a) ' g(a, y) is unorientable. If we extend σ to also replace the unbound variable y by ⊥, we do get f (a, a) ' g(a, ⊥), which is orientable, and hence can be used to reduce the term. This is sound, because variables are universally quantified, i.e. the equality holds for all instances. We have implemented this simple concept in E, and it has shown a significant and consistent improvement in performance. 4. Experimental results 4.1. Experimental settings We have employed four different search strategies. All normal search parameters were fixed to (usually) known good values. The only difference between the four strategies was in the clause selection heuristic. We broadly describe the 4 strategies below, see [18] for a detailed exposition on clause selection in E. SC: This is a pure symbol-counting strategy. All function, predicate, and variable symbol occurrences are counted with a weight of one, and the lightest clauses are processed first. The exact command line options for search control options are available in the online supporting material (see below). 20:1: This selection strategy interleaves selecting the lightest and the oldest clause, with a ratio of 20 to 1. EVO: This clause selection strategy has been honed through various optimizations [19, 20]. It interleaves a number of goal-directed and conventional selection queues, and is one of the best heuristics on TPTP we have found so far. UEQ: This is similar to EVO in structure and intent, but optimised only towards unit-equational problems. We ran tests on all suitable problems from TPTP 7.5.0. Most experiments were run on unit-equational problems. This set contained 1497 different problems. Some tests used all suitable first-order problems from TPTP (i.e. all CNF and FOF problems). This set was more than 10 times larger, containing 17283 problems. Run time per problem was limited to 5 seconds of CPU time. Tests were run on a machine with four Intel Xeon Gold 6130 CPUs, running a total of 64 physical cores at a nominal frequency of 2.10GHz and with 512GB of RAM. RAM size was sufficient to guarantee no contention or swapping between parallel processes. We have run 351 different ordering schemes with 4 different search strategies and two different rewrite settings, i.e. 2808 different strategies, on 1497 UEQ problems, for a total of 4 203 576 individual UEQ test runs. We have run the same 351 strategies, with only the 20:1 clause selection strategy and conventional rewriting, on all 17283 problems for 6 066 333 individual test runs for the general case. Thus, the total corpus has data on a bit over 10 million test runs. A version of the prover suitable for reproducing the actual dataset runs, as well as the summarized results of the test runs, is available as an online supplement at http: //eprover.eu/E-eu/Term_Orderings.html. The protocols included there also specify the exact command line options and contain the lists of problems used. TPTP 7.5.0, the source of the test problems, can be downloaded from http://www.tptp.org. The latest official release of E (and future updates) can be found at https://www.eprover.org. 4.2. Performance on unit-equational problems We will first discuss the performance of the different orderings with conventional rewriting on the 1497 UEQ problems. Table 1 gives a general overview of the data. Each row corresponds to a search strategy (differentiated by clause selection strategy and use of rewriting with strong instances). For each row, we show the number of problems solved by the best, worst, and median term ordering variant, and by the union of all variants, separately for LPO, KBO, and both. We can see that the difference between the best and the worst ordering is roughly 110-120 successes, and about comparable to the difference between the weakest and strongest search strategy, which is roughly 130-145. We can also already see that for UEQ problems, the best LPO strategy is systematically worse than the worst KBO strategy. However, for each search strategy the union of both LPO and KBO strategies solves more problems than either alone. To get a better understanding of the distribution of results, we have plotted the data in Fig. 2. In this figure, each individual measurement is a lightly coloured dot. The corresponding thin line is a gliding average. For each search strategy, the individual orderings are sorted in ascending order by number of successes. This results in a very distinct hockey stick graph, with an strongly ascending “blade”, followed by a more slowly rising “handle”. The blade consists of the results for the 13 different LPO instances, the handle of the 13 · 26 = 338 KBO instances. While this graph suggests very similar distribution of performance for each search strategies, it does not tell us if we see the same orderings in the same order, or if the different ordering variants appear in different order for the different search strategies. To remedy this, we have replotted the same data in Fig. 3. In this figure, we use the SC strategy as the reference that determines the order of ordering variants for all 4 search Strategy All orderings KBO only LPO only Min Median Max Total Min Median Max Total Min Median Max Total SC 610 707 719 782 691 708 719 740 610 655 668 733 20:1 680 783 798 857 762 784 798 824 680 720 740 820 EVO 773 872 883 920 848 872 883 897 773 802 820 877 UEQ 742 852 865 923 834 852 865 896 742 774 809 890 SC-SI 577 729 718 796 700 729 718 747 577 651 627 747 20:1-SI 679 803 789 867 771 803 790 833 679 729 709 826 EVO-SI 752 877 865 929 850 877 865 910 752 809 771 881 UEQ-SI 769 887 873 929 849 887 873 902 769 820 804 892 20:1-All 6523 7427 7298 8527 6792 7427 7306 8244 6523 7262 6998 8199 Remark: Strategies marked with SI uses rewriting with strong instances. The last line contains data for the full problem set, not just UEQ problems. Table 1 Performance summary (number of successes) on UEQ and all problems LPO KBO Figure 2: Performance spectrum on UEQ problems. The vertical black line corresponds to the transition from the 13 LPO ordering variants (on the left) to the KBO variants (on the right) strategies. In other words, the ordering variants are represented by a fixed order on the X-axis for all different search strategies. We can see that the basic hockey stick shape remains, however, there is significant local variation, and the increase in the “handle” part of the graph is much less pronounced. To quantify how far the order of strategies has changed between the different strategies, we compute Kendall’s Tau value [21]. This metric is based on the number of bubble-sort style swaps needed to bring two rankings in agreement, and measures correspondence between them (in this case, the two rankings of orderings induced by the performance data with different strategies). The result lies in [1, −1], with 1 indicating perfect agreement and -1 completely opposite rankings. The algorithm also provides a p-value for the Performance on UEQ problems (by base performance) 850 800 Successes 750 700 Base (SC) 650 20:1 EVO UEQ 600 0 50 100 150 200 250 300 350 Ordering strategies (by base strat successes) Rankings Kendall’s Tau p-Value SC vs 20:1 0.084 1.84e-02 SC vs EVO) 0.321 2.67e-19 SC vs UEQ) 0.358 1.41e-23 Figure 3: Relative Performance on UEQ problems null-hypothesis that the two rankings have been produced by uncorrelated processes. All rankings are related to the SC base case. The correlation is quite weak for 20:1, but much stronger for the other strategies. The null-hypothesis can be comfortably rejected for all three pairings. We can identify a number of consistently badly performing orderings. For all 4 search strategies, LPO with precedence generation schemata invarity and const_max are always among the worst 5 orderings. It’s less easy to pick out the winners, because the curves are fairly flat at the top. However, all strong orderings are KBO instances, and precedence schemata unary_first, const_min and invfreqconstmin appear frequently among the top strategies. For weight generation, invaritymax0 and invaritysquaredmax0 appear most frequently. The absolute best single combination is unary_first combined with firstmaximal0, however, there are 5 strategies with essentially the same performance. 4.3. Different CPU limits Figure 4 shows the relative performance of the orderings for a CPU limit of 1 second, compared to a CPU limit of 5 seconds. For all 4 base strategies, there is good similarity between the curves, and for all but 20:1 there is a strong correlation. In all cases, the null-hypothesis is strongly rejected. Thus, it seems possible to extrapolate ordering performance from shorter to longer run times in most cases. Performance for different CPU limits (SC) Performance for different CPU limits (20:1) 725 800 700 780 675 760 650 740 Successes Successes 720 625 700 600 680 575 Base (SC (1s)) 660 Base (20:1 (1s)) 550 SC 20:1 640 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Ordering strategies (by base strat successes) Ordering strategies (by base strat successes) Performance for different CPU limits (Evo) Performance for different CPU limits (UEQ) 880 850 860 825 840 800 820 Successes 775 Successes 800 750 780 725 760 Base (EVO (1s)) 740 Base (UEQ (1s)) 700 EVO UEQ 720 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Ordering strategies (by base strat successes) Ordering strategies (by base strat successes) Rankings Kendall’s Tau p-Value SC 1s vs 5s 0.554 3.30e-54 20:1 1s vs 5s 0.259 4.23e-13 EVO 1s vs 5s 0.502 8.98e-45 UEQ 1s vs 5s 0.627 6.99e-69 Figure 4: Comparison of performance for 1s and 5s CPU limits 4.4. Rewriting with strong instances We have plotted the relative performance of the different orderings and strategies in Figure 5. As already visible from the summary data in Table 1, rewriting with strong instances performs consistently better than conventional rewriting. The difference is significant for all base strategies but UEQ, where it is minuscule. In all 4 cases we have an excellent agreement between relative performance of normal and strong instance variants, indicating that strong instance can probably be added to any prover and yield an improvement without disrupting existing strategies. 4.5. Performance on full first-order problems Finally, we can compare the performance of all orderings on the full set of first-order problems, not just unit-equational problems. Since the number of problems is over 10 times larger now, we had to restrict the experiment to only one clause selection strategy. Normal rewriting vs. strong instances (SC) Normal rewriting vs. strong instances (20:1) 720 800 700 780 680 760 Successes Successes 660 740 640 720 620 600 700 Base (SC) Base (20:1) 580 SC_SI 680 20:1_RG 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Ordering strategies (by base strat successes) Ordering strategies (by base strat successes) Normal rewriting vs. strong instances (Evo) Normal rewriting vs. strong instances (UEQ) 880 880 860 860 840 820 840 Successes 800 Successes 820 780 800 760 Base (EVO) 780 Base (UEQ) EVO_RG UEQ_RG 740 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Ordering strategies (by base strat successes) Ordering strategies (by base strat successes) Rankings Kendall’s Tau p-Value SC vs. SC-SI 0.685 7.12e-82 20:1 vs 20:1-SI 0.813 1.51e-114 EVO vs EVO-SI 0.808 4.58e-113 UEQ vs UEQ-SI 0.790 3.32e-108 Figure 5: Performance with conventional and strong-instance rewriting We picked 20:1, because it has decent performance and is easy to transfer to most other provers. The result is plotted in Figure 6. While the overall number of solutions for each ordering variant is several times higher for the full set (since the number of problems is higher), we can see that again performance on all problems correlates to performance on unit-equational problems, if not quite as pronounced as for some of the other comparisons. 5. Related and Future Work The long term aim of this line of research is to automatically find good term orderings for any given problem. Our current approach is driven by the very successful standard saturation approach based on the given-clause loop. For the unit-equational case, there has been a lot of impressive work done for the Waldmeister prover, some of which is described in [22], but significant parts of which are only available as folklore. In particular, Waldmeister recognised domains by matching axioms on higher-order patterns, and Relative performance on UEQ and all problems 7000 6000 Successes 5000 4000 3000 2000 Base (20:1) 1000 20-1 Full 0 50 100 150 200 250 300 350 Ordering strategies (by base strat successes) Rankings Kendall’s Tau p-Value UEQ vs All 0.345 4.72e-22 Figure 6: Relative performance on UEQ an all problems instantiated ordering schemes based on the domain. Generalising this approach to full clausal logic, and also automatising the knowledge acquisition, would be a major step into the future. One step in this direction is the work by Bartek and Suda [23], which describes an attempt to learn good precedences for KBO from features of the proof problem using a neural network architecture. There has been some work about orienting a static system of equations with a KBO, e.g. [24]. However, this was a theoretical result with no experiments about the practical effects of computing such a static ordering before starting the proof search. Based on a similar premise is MædMax [25], a completion-based system for unit-equality. It is not built on the given-clause loop, but rather on a kind of level-saturation. At each iteration of its main loop, the system tries to find an ordering to orient as many equations as possible before the next round, before computing and normalising all critical pairs of the resulting system. Our work with strong instances is quite conservative, in that it does not modify the clause set or inference rules, but only makes the implementation correspond more closely to the theoretical case. Twee [26] has an interesting alternative approach. It splits such unorientable equations into an orientable part, and a permutation equation. It would be interesting to see a direct comparison of both techniques. Finally, saturating theorem provers are moving into higher-order [27] . One significant question is how many of these techniques can be lifted to this case, and how well they will work there. 6. Conclusion Term orderings are a critical component of the search strategies of modern theorem provers. We have presented a first systematic overview of the performance of relatively naive term ordering generation schemes. With respect to our research questions, we can say the following: 1. KBO performs almost uniformly better than LPO. However, both KBO and LPO instances are necessary to solve all solvable problems. Differences between different LPO instances are much bigger than those between different KBO instances, but the latter are still significant. 2. There is some correlation between the performance of different orderings under different other settings. However, relative performance between similarly performing strategies cannot be reliably predicted. 3. It is, on the other hand, quite possible to predict relative performance for longer runtimes from shorter runtimes. That is particularly valuable for system tuning, since tuning experiments can be done with relatively short run times. 4. Maybe surprisingly, there is a quite good correlation between performance of an ordering in the UEQ domain and over all problems, despite the fact that orderings play additional roles for reasoning with general clauses. The raw data is available for further evaluation and experiments, and there is a clear path to both generate more diverse and stronger orderings, and to couple ordering generation to the presence of certain axioms and axiom structures. Acknowledgments The various anonymous reviewers of this paper have been extremely helpful, identifying, in addition to several language issues and typos, a wrong number in Table 1 that slipped in during editing. References [1] D. Knuth, P. Bendix, Simple Word Problems in Universal Algebras, in: J. Leech (Ed.), Computational Algebra, Pergamon Press, 1970, pp. 263–297. [2] J. Hsiang, M. Rusinowitch, On Word Problems in Equational Theories, in: Proc. of the 14th ICALP, Karlsruhe, volume 267 of LNCS, Springer, 1987, pp. 54–71. [3] L. Bachmair, N. Dershowitz, D. Plaisted, Completion Without Failure, in: H. Ait- Kaci, M. Nivat (Eds.), Resolution of Equations in Algebraic Structures, volume 2, Academic Press, 1989, pp. 1–30. [4] L. Bachmair, H. Ganzinger, Rewrite-Based Equational Theorem Proving with Selection and Simplification, Journal of Logic and Computation 3 (1994) 217–247. [5] R. Nieuwenhuis, A. Rubio, Paramodulation-Based Theorem Proving, in: A. Robinson, A. Voronkov (Eds.), Handbook of Automated Reasoning, volume I, Elsevier Science and MIT Press, 2001, pp. 371–443. [6] W. McCune, L. Wos, Otter: The CADE-13 Competition Incarnations, Journal of Automated Reasoning 18 (1997) 211–220. Special Issue on the CADE 13 ATP System Competition. [7] J. Denzinger, M. Kronenburg, S. Schulz, DISCOUNT: A Distributed and Learning Equational Prover, Journal of Automated Reasoning 18 (1997) 189–198. Special Issue on the CADE 13 ATP System Competition. [8] T. Hillenbrand, A. Buch, R. Vogt, B. Löchner, WALDMEISTER. High Performance Equational Deduction, Journal of Automated Reasoning 18 (1997) 265–270. Special Issue on the CADE 13 ATP System Competition. [9] G. Sutcliffe, The CADE ATP System Competition – CASC, AI Magazine 37 (2016) 99–101. doi:10.1609/aimag.v37i2.2620. [10] F. J. Pelletier, G. Sutcliffe, C. Suttner, Conclusions about the CADE-13 ATP system competition, Journal of Automated Reasoning 18 (1997) 287–296. [11] S. Schulz, E – A Brainiac Theorem Prover, Journal of AI Communications 15 (2002) 111–126. [12] S. Schulz, S. Cruanes, P. Vukmirović, Faster, higher, stronger: E 2.3, in: P. Fontaine (Ed.), Proc. of the 27th CADE, Natal, Brasil, number 11716 in LNAI, Springer, 2019, pp. 495–507. [13] B. Löchner, Things to Know when Implementing KBO, Journal of Automated Reasoning 36 (2006) 289–310. [14] B. Löchner, Things to Know When Implementing LPO, International Journal on Artificial Intelligence Tools 15 (2006) 53–80. [15] G. Sutcliffe, The TPTP problem library and associated infrastructure - from CNF to TH0, TPTP v6.4.0, Journal of Automated Reasoning 59 (2017) 483–502. [16] S. Kamin, J.-J. Levy, Two generalizations of the recursive path ordering, Technical Report, Departement of Computer Science, University of Illinois, Urbana, IL, 1980. [17] S. Schulz, E 2.4 User Manual, EasyChair preprint no. 2272, 2019. URL: https: //easychair.org/publications/preprint/RjDx. [18] S. Schulz, M. Möhrmann, Performance of clause selection heuristics for saturation- based theorem proving, in: N. Olivetti, A. Tiwari (Eds.), Proc. of the 8th IJCAR, Coimbra, volume 9706 of LNAI, Springer, 2016, pp. 330–345. [19] S. Schäfer, S. Schulz, Breeding theorem proving heuristics with genetic algorithms, in: G. Gottlob, G. Sutcliffe, A. Voronkov (Eds.), Proc. of the Global Conference on Artificial Intelligence, Tibilisi, Georgia, volume 36 of EPiC, EasyChair, 2015, pp. 263–274. [20] J. Urban, Blistr: The blind strategymaker, in: G. Gottlob, G. Sutcliffe, A. Voronkov (Eds.), Proc. of the Global Conference on Artificial Intelligence, Tibilisi, Georgia, volume 36 of EPiC, EasyChair, 2015, pp. 312–319. [21] M. G. Kendall, Rank Correlation Methods, 4th ed., Charles Griffin & Co., 1970. [22] T. Hillenbrand, A. Jaeger, B. Löchner, System Abstract: Waldmeister – Improve- ments in Performance and Ease of Use, in: H. Ganzinger (Ed.), Proc. of the 16th CADE, Trento, volume 1632 of LNAI, Springer, 1999, pp. 232–236. [23] F. Bártek, M. Suda, Neural Precedence Recommender, in: A. Platzer, G. Sutcliffe (Eds.), Proc. of the 28th CADE, Pittsburgh, volume 12699 of LNAI, Springer, 2021, pp. 525–542. [24] K. Korovin, A. Voronkov, Orienting rewrite rules with the Knuth–Bendix order, Journal of Information and Computation 183 (2003) 165–186. [25] S. Winkler, G. Moser, MædMax: A maximal ordered completion tool, in: D. Galmiche, S. Schulz, R. Sebastiani (Eds.), Proc. of the 9th IJCAR, Oxford, volume 10900 of LNAI, Springer, 2018, pp. 472–480. [26] N. Smallbone, Twee: An Equational Theorem Prover, in: A. Platzer, G. Sutcliffe (Eds.), Proc. of the 28th CADE, Pittsburgh, volume 12699 of LNAI, Springer, 2021, pp. 602–613. [27] P. Vukmirović, J. C. Blanchette, S. Cruanes, S. Schulz, Extending a Brainiac Prover to Lambda-free Higher-Order Logic, International Journal on Software Tools for Technology Transfer (2021). doi:10.1007/s10009-021-00639-7.