An Empirical Perspective on Ten Years of QBF Solving Paolo Marin2 , Massimo Narizzano1 , Luca Pulina3 , Armando Tacchella1 , and Enrico Giunchiglia1 1 DIBRIS, Università di Genova, Via Opera Pia, 13 – 16145 Genova – Italy {giunchiglia,narizzano,tacchella}@unige.it 2 Lehrstuhl für Rechnerarchitektur, Georges-Köhler-Allee 051 – 79110 Freiburg i.B. – Germany marin@informatik.uni-freiburg.de 3 POLCOMING, Università di Sassari, Viale Mancini n. 5 – 07100 Sassari – Italy lpulina@uniss.it Abstract. Twelve years have elapsed since the first QBF evaluation was held as an event linked to SAT conferences. During this period, researchers have strived to propose new algorithms and tools to solve challenging problems, with evaluations periodically trying to assess the current state of the art. In this paper, we present an experimental account of solvers and benchmarks with the aim to understand the progress, if any, in the QBF arena. Unlike typical evaluations, the analysis is not confined to the snapshot of submitted solvers and problems, but rather we consider several tools that were proposed over the last decade, and we run them on different problem sets. The main contribution of our analysis, which is also the message we would like to pass along to the research community is that some faded-to-oblivion techniques turn out to be still quite effective. 1 Introduction The first non-competitive QBF solvers evaluation (QBFEVAL’03) [3] was held as an associated event of the SAT 2003 conference. If the purpose of QBFEVAL’03 was to assess the state of the art in the relatively young – in the time – QBF reasoning field, the ensuing QBFEVAL series was established with the purpose of measuring the progress in QBF reasoning techniques – see, e.g., [18]. Since the last evaluation, what has been the progress (if any) in the QBF arena? After more than a decade of new solvers being developed and new challenge problems being proposed, we believe that QBFEVAL and, more recently, QBF Gallery [6] events offer a series of snapshots about QBF solving and related aspects, but somehow fail to provide a long-term picture about what has been achieved. Covering the whole time span of QBFEVAL and QBF Gallery events, our experiments enable us to assess the progress in the QBF field, and put the current state of the art in a historical perspective. In order to achieve this goal, the experimental setup is not confined to a snapshot in time offered by recently proposed systems. In particular, as far as systems are concerned, we consider 62 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving some legacy solvers, i.e., tools that were proposed in the literature, but are not considered in more recents comparative events, e.g., because they are no longer maintained or updated. We call new solvers all the other tools that we consider and which are not legacy. In particular, out of 9 solvers considered, the legacy ones are AIGSolve [19], aqme [21], quantor [4], QuBE [8], sKizzo [1], and StruQS [22]. These tools are chosen among winners of at least one category in the past QBFEVAL events, conditioned to their maintenance ending before 2010. The set of new solvers is assembled by including the winners of the last QBF Gallery 2014, namely depqbf [15], ghostq [13] and rareqs [10]. As for problems, we consider two different pools, namely QBF Gallery 2014 Track 1 and QBF Gallery 2014 Track 2. Overall, the problem set is purposefully biased towards more recently submitted instances, in order to (try to) assess legacy solvers on problems that are probably “unseen” to them, i.e., for which their developers did not have a chance to optimize the solver. The main conclusion that we draw by analyzing the results of our compar- ison is that the techniques implemented in legacy solvers are far from being outdated. Just to get an idea of what we observe – more details can be found in Section 4 – consider that, if we rank the tools using the number of prob- lems solved, then it turns out that at least two legacy solvers rank among the first three solvers, for all the pools considered. Further evidence in this direction can be obtained considering the “state-of-the-art” (SOTA) solver abstraction, i.e., the ideal system that always fares the best time among the systems in a solver portfolio. If we build “legacy-SOTA” and “new-SOTA” solvers based on the corresponding portfolios of legacy and new solvers, then we observe that legacy-SOTA outperforms new-SOTA – and this remains true even looking at specific subcategories in most cases. While it is difficult to single out the con- tribution of specific algorithmic techniques by looking at the performances of implemented systems – most of which are closed source – the results we observe strongly suggest that, while new solvers are better engineered than legacy ones, the latter have some combination of techniques that are probably worth taking into account for further developments. The rest of the paper is structured as follows. In Section 2 we review QBF syntax and semantics. In Section 3 we briefly describe the solvers and the prob- lems used in our experiments. Section 4 presents the results, while in Section 5 we conclude the paper with some final remarks. 2 Preliminaries In this section we consider the definition of QBFs and their satisfiability as given in the literature of QBF decision procedures (see, e.g., [9, 2, 4]), and we define features describing the structure of QBFs. Syntax and Semantics A variable is an element of a set P of propositional letters and a literal is a variable or the negation thereof. We denote with |l| the variable occurring in the literal l, and with l the complement of l, i.e., ¬l if l is a variable 63 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving and |l| otherwise. A literal is positive if |l| = l and negative otherwise. A clause C is an n-ary (n ≥ 0) disjunction of literals such that, for any two distinct disjuncts l, l0 in C, it is not the case that |l| = |l0 |. A propositional formula is a k-ary (k ≥ 0) conjunction of clauses. A quantified Boolean formula is an expression of the form Q1 z1 . . . Qn zn Φ (1) where, for each 1 ≤ i ≤ n, zi is a variable, Qi is either an existential quantifier Qi = ∃ or a universal one Qi = ∀, and Φ is a propositional formula in the variables {z1 , . . . , zn }. The expression Q1 z1 . . . Qn zn is the prefix and Φ is the matrix of (1). A literal l is existential if |l| = zi for some 1 ≤ i ≤ n and ∃zi belongs to the prefix of (1), and it is universal otherwise. The semantics of a QBF ϕ can be defined recursively as follows. A QBF clause is contradictory exactly when it does not contain existential literals. If the matrix of ϕ contains a contradictory clause then ϕ is false. If the matrix of ϕ has no conjuncts then ϕ is true. If ϕ = Qzψ is a QBF and l is a literal, we define ϕl as the QBF obtained from ψ by removing all the conjuncts in which l occurs and removing l from the others. Then we have two cases. If ϕ is ∃zψ, then ϕ is true exactly when ϕz or ϕ¬z are true. If ϕ is ∀zψ, then ϕ is true exactly when ϕz and ϕ¬z are true. The QBF satisfiability problem (QSAT) is to decide whether a given formula is true or false. It is easy to see that if ϕ is a QBF without universal quantifiers, solving QSAT is the same as solving propositional satisfiability (SAT). Representing QBFs To correlate the structure of QBFs with the performances of solvers, we extract representative features from QBFs — see, e.g., [21]. A first class is given by syntactic features: – c, total number of clauses; c1 , c2 , c3 total number of clauses with 1, 2 and more than two existential literals, respectively; ch , cdh total number of Horn and dual-Horn clauses, respectively; – v, total number of variables; v∃ , v∀ , total number of existential and universal variables, respectively; ltot , total number of literals; vs, vs∃ , vs∀ , distribution of the number of variables per quantifier set, considering all the variables, and focusing on existential and universal variables, respectively; s, s∃ , s∀ , number of total, existential and universal, quantifier sets; – l, distribution of the number of literals in each clause; l+ , l− , l∃ , l∃+ , l∃− , l∀ , l∀+ , l∀− , distribution of the number of positive, negative, existential, pos- itive existential, negative existential, universal, positive universal, negative universal number of literals in each clauses, respectively. – r, distribution of the number of variable occurrences r+ , r− , r∃ , r∃+ , r∃− , r∀ , r∀+ , r∀− , distribution of the number of positive, negative, existential, pos- itive existential, negative existential, universal, positive universal, negative universal variable occurrences, respectively. We also take into account the following combined features: 64 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving – vc , the classic clauses-to-variables ratio, and for each x ∈ {l, r} the following ratios (on mean values): • xx+ , xx− , xx− + , balance ratios; x∃ x∃+ x∃− x∃+ x∃− x∃+ x∃+ x∃− • x , x , x , x∃ , x∃ , x∃− , x+ , x− , balance ratios (existential part); • xx∀ , x∀+ x∀− x∀+ x∀− x∀+ x∀− x∀+ x , x , x+ , x− , x∀ , x∀ , x∀− , balance ratios (universal part); – cc1 , cc2 , cc3 , cch , cdh ch c , cdh , i.e., balance ratios between different kinds of clauses. A second class of features is computed on graph models of QBFs. From previous related work on SAT, see, e.g. [17], we borrow variable graphs (VG) and the clause graphs (CG). The former has a node for each variable and an edge between variables that occur together in at least one clause, while the latter has nodes representing clauses and an edge between two clauses whenever they share a negated literal. For each graph, we consider the average value on their node degree. Finally, we also consider a treewidth measure twp which accounts for the treewidth of the VG adjusted to keep into account that only elimination orders compatible to the prefix p are viable — see [20, 23] for details, and also for extensive empirical evidence about the correlation of twp with hardness of QBFs. 3 Setup In this section we present solvers and problems that we selected for our analysis. As for solvers, we consider systems participating to QBF Gallery 20141 as well as solvers participating to past QBFEVAL editions. Considering the former, we choose the winners of Track 1 and Track 2 [12] which are shortly described in the following. depqbf (v. 3.0.4) [15] is a search-based solver performing non-chronological backtracking from conflicts and solutions; depqbf can select branching vari- ables without following the prefix order by leveraging a compact representa- tion of the dependencies among variables. ghostq (v. qdimacs-gal-2014) [13] is a non-prenex DPLL-based solver which makes use of auxiliary variables to force necessary assignments, i.e., to force a value to an existential (resp. universal) variable if the opposite value directly makes the formula evaluate to false (resp. true). Additionally, it features a CEGAR-based learning to further prune the search space when the last decision literal is existential (resp. universal) and a conflict (resp. solution) is detected. rareqs (v. 1.1) [10] is a counterexample guided abstraction refinement (CE- GAR) based solver which performs a kind of resolution and expansion pro- cedure but in a depth-first way, i.e., by expanding first only one value of a variable, and learns abstractions of the local partial solutions to refine the global solution. 1 http://qbf.satisfiability.org/gallery 65 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving We did not consider the system hiqqer [12] because we could not find a version available for download. In the remainder of the paper, we refer to this pool of solvers as s-new. Solvers participating to past editions of QBFEVAL – to which we refer as s-legacy from now on – are described in the following. AIGSolve [19] uses And-Inverter Graphs (AIGs) as the main data structure, and AIG-based operations to reason about the input formula. The solver includes preliminary phases devoted to simplification, structure extraction and early quantification of the input formula. aqme [21] is a multi-engine solver, i.e., a tool using Machine Learning tech- niques to select among its reasoning engines the one which is more likely to yield optimal results. The reasoning engines of aqme are a subset of those submitted to QBFEVAL’06, namely 2clsQ, quantor, QuBE, sKizzo, and sSolve. Engine selection is performed according to the adaptive strategy de- scribed in [21]. quantor [4] is based on Q-resolution (to eliminate existential variables) and Shannon expansion (to eliminate universal variables), plus a number of fea- tures, such as equivalence reasoning, subsumption checking, pure literal de- tection, unit propagation, and also a scheduler for the elimination step. QuBE [8] is a solver that first applies, among other simplification techniques, deep equivalence reasoning and removes variables by Q-Resolution. Then, it uses a search-based decision procedure that performs monotone and “don’t care” literal propagation, non-chronological backtracking from conflicts and solutions, in which it produces and removes less clauses/terms made tauto- logical by blocking universal/existential literals than its predecessor. sKizzo [1] is a reasoning engine for QBF featuring several techniques, including search, resolution and skolemization. StruQS [22] main feature is a dynamic combination of search – with solution- and conflict-backjumping – and variable-elimination. The key point in this approach is to implicitly leverage graph abstractions of QBFs to yield struc- tural features which support an effective decision between search and variable elimination. We included AIGSolve because it is the only system employing AIG-based operations to reason on input QBF. We involved aqme for its multi-engine archi- tecture; as a by-product, it can return an approximated picture of state-of-the-art QBF solvers back in 2006, so it can be used as “yardstick” to assess improve- ments. quantor, QuBE, and sKizzo implement key QBF solution techniques, namely resolution and expansion, DPLL-search, and Skolemization, respectively. Finally, we included StruQS because it represents the first — and, to the best of our knowledge, the only — attempt to combine dynamically very different solu- tion techniques. Almost all the s-legacy solvers also collected accolades in past QBF evaluations. AIGSolve was the winner of the QBFEVAL’10 small hard track, while aqme was the system able to solve the highest number of formulas in QBFEVAL’07, ’08, and in the main track of QBFEVAL’10. quantor was the 66 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving winner of QBFEVAL’04, while QuBE won the 2QBF track of QBFEVAL’10. Finally, sKizzo has been the winner of QBFEVAL’05 and ’07. We evaluate the above mentioned systems on different pools of problem in- stances. The syntax of the instances is prenex-CNF using the qdimacs 1.1 format. The problem pools we consider are briefly outlined in the following. – The formulas included in QBF Gallery 2014 Track 1. These are 276 instances collectively denoted as qbfg-t1. – The formulas included in QBF Gallery 2014 Track 2. These are collectively denoted as qbfg-t2. The pool qbfg-t2 includes formulas coming from six different families, namely: bomb and dungeon [14] are encodings of conformant planning problems with optimal length and uncertainty of the initial state. complexity [11] result from a QBF encoding of automatic reduction between decision problems. The original problem is undecidable in general, but it can be reduced to Σ2p if the dimension of the reduction is fixed and given, and the size of the inputs is bounded. hardness [16] Black-Box bounded model checking instances for an incomplete parametrized arbiter of a bus system. planning [5] This instance set include different planning problems encoded into QBF using two different strategies: the first one is based on the iterative squaring formulation, and the second one relies on a more compact tree-like encoding. testing [24] The solutions to these problems are test patterns for sequential circuits coming from ISCAS 89 and ITC 99 benchmarks having a maximum amount of inputs set to don’t care. 4 Experimental Analysis In this section we report and analyze the results of our empirical evaluation. All the experiments ran on a cluster of Intel Xeon E3-1245 PCs at 3.30 GHz equipped with 64 bit Ubuntu 12.04. All solvers were limited to 600 seconds of CPU time and to 4GB of memory. 4.1 QBF Gallery 2014 formulas – Track 1 The aim of our first experiment is to evaluate the selected solvers in the qbfg- t1 pool of instances. In Table 1 we report the raw results of such evaluation. Looking at the results, we can see that only 6 solvers out of 9 were able to solve at least 25% of the test set. If we rank solvers according to the number of problems solved within the time limit, then the best system is AIGSolve, which can solve about 42% of the test set, followed by QuBE and aqme. To find the best solver in s-new, namely ghostq, we must go down to the fourth position. ghostq performs only slightly worse than aqme, and it tops at 33% of the 67 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving Solver Total True False Unique # Time # Time # Time # Time AIGSolve 116 5333.01 56 2177.45 60 3155.56 22 1458.26 QuBE 106 8764.73 53 3997.78 53 4766.95 8 1195.58 aqme 97 3287.20 39 1098.00 58 2189.20 – – ghostq 91 4814.73 48 2912.38 43 1902.17 4 158.97 depqbf 88 2388.32 39 1163.15 49 1225.17 5 454.77 rareqs 79 2588.64 32 1593.25 47 995.39 6 787.33 sKizzo 51 948.81 18 556.76 33 392.06 – – quantor 50 1498.37 28 911.72 22 586.65 2 161.67 StruQS 43 6092.64 31 4052.98 12 2039.66 1 16.53 Table 1. Runtime of solvers on qbfg-t1. For each solver, the table reports its name (column “Solver”), the total number of instances solved and the cumulative time to solve them (columns “#” and “Time”, group “Total”), the number of instances found satisfiable and the time to solve them (columns “#” and “Time”, group “True”), the number of instances found unsatisfiable and the time to solve them (columns “#” and “Time”, group “False”), and, finally, the number of instances uniquely solved and the time to solve them (columns “#” and “Time”, group “Unique”); a “–” (dash) means that the solver did not solve any instance. The table is sorted in descending order according to the number of instances solved, and, in case of a tie, in ascending order according to the cumulative time taken to solve them. test set. This result is relevant for our case in point, particularly if we consider that both AIGSolve and QuBE are systems dating back to 2010, while aqme combines solvers dating back to QBFEVAL 2006. Finally, despite quantor and StruQS were not able to solve more than 20% of qbfg-t1, still they were the only ones able to solve some instances — 2 and 1, respectively. If we consider the structure of the instances comprised in qbfg-t1, then we can observe several structural differences between those solved by at least one solver and those that remained unsolved. For instance, if we focus on the formula size in terms of variables v and clauses c, then we can see that unsolved instances feature, on average, higher values of both parameters, i.e., they are somewhat larger. Looking at the median values v̂ and ĉ of the parameters v and c, we can see that v̂ = 3412 if the population is restricted to solved instances, whereas v̂ = 10188 on the population of unsolved ones. A similar picture holds for c, with ĉ = 14818 and ĉ = 57130 for solved and unsolved instances, respectively. As expected, twp is also indicative of this spread, since tw ˆ p = 486 for solved instances, whereas tw ˆ p = 1102 for unsolved ones. Another perspective about the results of Table 1 can be obtained by resorting to the state-of-the-art solver abstraction (sota in the following), i.e., the ideal solver that always fares the best time among all the solvers in a portfolio. In this case, sota was able to cope with about 73% of qbfg-t1 (202 formulas). What is more relevant is that all the systems contributed to its composition. In particular, the main contributors — in percentage — were AIGSolve, depqbf, 68 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving variables clauses avg clause length treewidth Fig. 1. Box-plots of different features distributions related to QBFs comprised in qbfg- t1. Features are v and c (top-left and top-right, respectively), l (bottom left), and twp (bottom right). Each distribution in the plots is labeled as follows: “SOTA-L” and “SOTA-N” are placeholders for sota-legacy and sota-new, respectively, while “S” and “U” – in parentheses – stand for “solved” and “unsolved”. For each plot, we show a box-and-whiskers diagram representing the median (bold line), the first and third quartile (bottom and top edges of the box), the minimum and maximum (whiskers at the top and the bottom) of a distribution. An approximated 95% confidence interval for the difference in the two medians is represented by the notches cut in the boxes: if the notches of two plots do not overlap, this is strong evidence that the two medians differ. Finally, in the y-axes of each plot are reported the values of the related features. and rareqs with 22%, 20%, and 17%, respectively. Notice that 2 out of 3 of the main contributors are indeed in s-new. With the aid of the sota abstraction we can also compare the overall perfor- mances of solvers in s-legacy vs. those in s-new. In order to do that, we com- pute two abstractions, namely sota-legacy — considering only legacy systems 69 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving — and sota-new— considering only new solvers. The rationale of this analysis is twofold: on one hand, we want to evaluate the advancement of the state of the art with respect to legacy systems (and related solving techniques); on the other, we want to look for patterns, expressed by means of features, enabling us to spot differences in the type of QBFs solved by old and new systems. As far as advancing the state of the art is concerned, we report that sota-legacy solves 185 formulas — about 92% of those solved by sota — while sota-new tops at 70% (142 instances). In view of these results, and considering that most of the formulas in qbfg-t1 where not available at the time in which the solvers in s-legacy were developed, there does not seem to be a stark advancement in solvers’ abilities from s-legacy to s-new. As for the nature of the instances solved by legacy vs. new solvers, we can try to observe differences in the structure of QBFs solved by solvers in sota- legacy and solvers in sota-new. In Figure 1 we present the distributions of four features across four different populations obtained by combining sota-legacy, sota-new with solved and unsolved formulas. In the figure, we can see that the parameter l (average clause length) is not significantly different among the various classes of problems — all the notches overlap. If we consider v (number of variables) then we see that for sota-new the value of v̂ is significantly different between solved and unsolved instances, while the same is not true for sota- legacy. Therefore, it seems that the sheer number of variables matters most for solvers in sota-new. However, also notice that there is no significant difference between sota-legacy and sota-new when considering (un)solved formulas. As for c (number of clauses) both sota-legacy and sota-new are sensitive to this parameter: higher values of c imply harder formulas. Also in this case, no significant difference can be spotted when considering sota-legacy and sota- new on (un)solved formulas. Finally, looking at the distributions of twp , we can see that its median value is not a significant hardness predictor for solvers in sota-new, whereas it is a hardness predictor for solvers in sota-legacy, but there are no differences when considering (un)solved formulas. Overall, we can conclude that no clear pattern emerges that could help to differentiate (un)solved formulas between solvers in sota-new and sota-legacy, at least looking at the parameters shown in Figure 1. 4.2 QBF Gallery 2014 formulas – Track 2 Our next experiment aims at assessing solvers on the pool qbfg-t2. Before delv- ing into the analysis, we wish to point out that there are structural differences between the formulas in qbfg-t2 and those in qbfg-t1. On average, they are characterized by a smaller number of median variables v̂ (3374 vs. 4708), but a considerably larger number of median clauses ĉ (29492 vs. 17397). Formulas in qbfg-t2 are also characterized by a relatively small value of universal variables since v̂v̂∀ = 0.006 in the case of qbfg-t2, while the same ratio is 0.02 in the case of qbfg-t1. Finally, we report that qbfg-t1 formulas usually have a higher value of average clause length since ˆl = 2.58, whereas the same value is 2.37 for qbfg-t2. 70 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving Family Solver Total True False Unique # Time # Time # Time # Time AIGSolve 83 1003.23 40 165.20 43 838.03 – – rareqs 83 1420.61 34 165.41 49 1255.19 6 1094.71 quantor 82 923.25 53 217.92 29 705.33 – – aqme 80 674.38 53 345.02 27 329.36 – – bomb depqbf 67 2410.16 40 1693.36 27 716.79 – – (132) sKizzo 57 609.41 31 2.39 26 607.03 – – ghostq 56 532.47 29 42.66 27 489.81 – – QuBE 47 1168.86 23 470.47 24 698.39 – – StruQS 36 1051.46 19 813.58 17 237.88 – – rareqs 75 1559.65 29 466.77 46 1092.88 15 1148.51 depqbf 49 1553.73 22 1086.35 27 467.38 – – ghostq 42 1791.86 11 499.21 27 467.38 – – QuBE 39 1273.95 19 277.87 20 996.09 – – complexity aqme 33 528.28 15 188.76 18 339.52 – – (104) quantor 26 170.44 11 11.29 15 159.14 – – StruQS 21 1855.53 13 1677.81 8 177.72 – – AIGSolve 15 70.26 7 12.24 8 58.02 – – sKizzo 9 316.60 4 315.82 5 0.78 – – quantor 104 525.30 18 54.81 86 470.48 – – aqme 104 1121.43 18 86.11 86 1035.32 – – AIGSolve 87 1220.22 17 417.12 70 803.10 – – rareqs 57 1870.73 18 54.89 39 1815.85 – – dungeon depqbf 44 535.22 18 300.44 26 234.77 – – (107) QuBE 34 1429.60 7 212.89 27 1216.71 – – ghostq 7 385.11 4 4.62 3 380.49 – – sKizzo 2 0.99 – – 2 0.99 – – StruQS 1 21.96 1 21.96 – – – – StruQS 88 7826.42 1 372.74 87 7453.68 12 3033.19 QuBE 76 1346.11 – – 76 1346.11 2 328.75 ghostq 51 2649.30 2 239.56 49 2409.74 1 224.22 aqme 50 265.14 – – 50 265.14 – – hardness rareqs 14 1431.05 – – 14 1431.05 – – (114) AIGSolve 12 2038.84 – – 12 2038.84 – – depqbf 8 617.99 – – 8 617.99 – – quantor – – – – – – – – sKizzo – – – – – – – – AIGSolve 147 2371.36 38 114.02 109 2257.34 10 861.70 rareqs 137 1093.01 38 125.66 99 967.35 – – quantor 131 6750.13 37 122.68 94 6627.44 – – aqme 123 9263.25 37 464.97 86 8798.28 – – planning sKizzo 74 71.57 34 24.02 40 47.55 – – (147) depqbf 57 5134.24 29 1876.90 28 3257.34 – – QuBE 14 1270.35 12 743.61 2 526.74 – – ghostq 11 2155.26 8 1420.71 3 734.55 – – StruQS 4 1229.67 4 1229.67 – – – – aqme 71 2675.64 64 2339.68 7 335.95 1 3.00 StruQS 65 1770.09 63 1488.09 2 282.00 4 236.18 depqbf 57 692.38 46 672.96 11 19.42 2 359.15 AIGSolve 51 4194.44 46 4163.65 5 30.79 2 11.88 testing QuBE 41 765.08 31 734.85 10 30.23 1 1.24 (131) rareqs 34 428.00 22 317.04 12 110.95 1 0.53 ghostq 32 269.13 29 66.10 3 203.03 – – quantor 26 121.15 25 110.52 1 10.63 – – sKizzo 1 0.02 1 0.02 – – – – Table 2. Performances of QBF solvers on qbfg-t2: The table is split in six horizontal parts, one for each family. The first column contains families names, as well as its total amount of instances. The rest of the table is organized as Table 1. 71 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving In Table 2, we show the results of our experiments on qbfg-t2. The formulas in bomb, when compared to the whole qbfg-t2 formulas, are characterized by higher median values of l− c l (0.96 vs 0.84), v (13.61 vs 8.74), and twp (914 vs 758). On this subcategory, the best systems are AIGSolve, rareqs, and quan- tor, which are the only ones able to solve more than 60% of the total. Also in this case, two of the top three performers are solvers in s-legacy. However, we should also point out that rareqs is the only system able to solve instances uniquely. Overall, it seems that the solvers which are not purely search-based are also the most effective ones in this subcategory. This difference cuts across the separation between s-legacy and s-new, and it could be due to the fact that these formulas are relatively easy to expand into SAT instances, so solvers featuring this technique, e.g., quantor and rareqs, handle them more effec- tively. Indeed, if we consider the sota abstraction, its major contributors are quantor and rareqs, with 41 and 38 formulas, respectively. Overall, sota is able to solve 77% of the total (102 instances out of 132). In spite of the very good performances of rareqs, still sota-legacy solved 96 instances, while sota-new 89, thus confirming the picture that we observed in qbfg-t1. Regarding the results on complexity, looking at Table 2 we can see that the best solvers are all comprised in s-new. Noticeably, this is the only subcategory of qbfg-t2 and the only case throughout our experimental analysis in which this is true. rareqs, depqbf, and ghostq are able to solve 75, 49, and 42 instances, respectively. In particular, rareqs solves 15 of them uniquely. Looking at the structure of QBFs, we can see that complexity instances are smaller than bomb ones: the median values of c, v, and l are 1101, 2533, and 6601, respectively. Moreover, with respect to bomb, we also report a smaller values of v̂v̂∀ , and of median clause-to-variable ratio vc . On the other hand, the parameter ˆl on these formulas is 2.66, higher than qbfg-t2 (2.37) and bomb (2.07). Since depqbf and ghostq do not perform very well on bomb and also on other subcategories in qbfg-t2, we conjecture that (i) relatively small instances with (ii) relatively small number of universally quantified variables even with (iii) relatively long clauses, could correlate with positive performances of depqbf and ghostq. Considering the sota abstraction, we report that it solves the same number of formulas solved by the best solver (rareqs). Unsurprisingly, in this case sota- new outperforms sota-legacy — 75 and 44 solved instances, respectively. Considering dungeon, we can see that the three best solvers are comprised in s-legacy. quantor and aqme solved 97% of dungeon, while AIGSolve topping at about 81%. The structure of dungeon is characterized by large values of v̂, ĉ, and ˆl (27781, 128155, and 265184, respectively). On the other hand, we report small values of ˆl (1.99) and vˆ∀ (5). Moreover, it is worth noticing that dungeon formulas have a large amount of c1 and ch (number of unary and Horn clauses, respectively) with respect to the whole QBFs in qbfg-t2. The value of cˆ1 related to dungeon is 20299, while the one reported for qbfg-t2 is 3. Considering ĉh , the values in dungeon and qbfg-t2 are 125611 and 23277, respectively. Given, e.g., the large number of unary and Horn clauses, these formulas should not be particularly challenging in general. Despite that, looking 72 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving at the result we can see that otherwise effective solvers such as ghostq solved only 6% of the total. This fact makes us conjecture that for this family sheer size becomes an issue for some solvers. Finally, we report that sota can solve all but one formula (106 solved out of 107) and, in this case, sota-legacy outperforms sota-new (106 and 57 solved instances, respectively). Considering hardness, looking at Table 2 we can see that the best system is StruQS with 88 solved formulas, followed by QuBE and ghostq with 76 and 51 solved instances, respectively. This result is quite surprising because, considering the results described so far, StruQS always ranks among the worst three solvers. To investigate this phenomenon, we analyzed the structure of the instances comprised in hardness. First, we report that, on one hand, both v̂ and ĉ are relatively small (2191 and 7793, respectively); on the other, we can report for hardness the highest value of several features, such as ˆl (9.80), vv̂ˆ∀ (in percentage, 5%), and the number of quantified sets ŝ (26, against a value of 3 reported for qbfg-t2). This can partially explain the performance of StruQS because its hybrid resolution-search algorithm works best with small formu- las having many quantifier alternations. Finally, we report that sota was able to solve 91 instances, and its best contributors – in percentage – are QuBE, StruQS, and ghostq, with 65%, 16%, and 13% of the total, respectively. Also in this case, sota-legacy outperformed sota-new (90 and 51 solved instances, respectively). Concerning the results on planning, we can see from Table 2 the best solver is AIGSolve, able to deal with all the instances in the family. It is followed by rareqs and quantor, that solved 137 and 131 instances, respectively. The picture seems to be very similar to bomb and, indeed we can report that this family is characterized by a low value of v̂ (1947 vs. 3374 of qbfg-t2), but ˆ ) (326955 vs 96532) and ĉ (112826 vs 29492). These data also large values of ltot implies that planning has the largest value of the median clause to variable ratio. Finally, we report that variables in planning are highly connected: the median value of the VG node degree is 170.05, while the same value in qbfg-t2 is 21.61. As a final comment, we report that the performances of sota-legacy and sota-new are quite close in this case, with 147 and 137 instances solved, respectively. To conclude, looking at the results on testing, we can see that aqme is the best solver, dealing with about 54% of the instances. It is worth noticing that aqme solved 13% of the instances running sSolve [7], a system dating back to year 2000. Second and third are StruQS and depqbf, solving about 50% and 43%, respectively. Values of dimensional features of these formulas are quite similar to bomb, with the noticeable exception of the total amount of universal variables (more than 2% of the total), that makes the family more similar to hardness, which may also can explain the good performances of StruQS. As a final consideration, we report that sota solved 69% of the total, and its ma- jor contributors is depqbf (53%). Notice that also in this case sota-legacy outperforms sota-new (85 and 65 solved instances, respectively). 73 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving 5 Conclusions In the paper we have shown the results of a massive evaluation of QBF solvers and benchmarks. The picture that we have obtained is significant both because it is the first historical perspective on QBF solving technologies, and because of the results that emerged clearly from the analysis. In particular, we have shown that recently proposed solvers might benefit from some techniques implemented in legacy ones which defy aging. Indeed, new solvers seems to be fairly well en- gineered – the majority of the overall SOTA solver is made by new systems – and they made a relevant contribution to the QBF field, as witnessed by the fact that they are most often among the ones solving a formula uniquely. However, by comparing the sota-legacy and sota-new abstractions we have also shown that legacy systems still outperform the new ones in many problem categories. Therefore, we believe that it would be interesting to blend new techniques, e.g., CEGAR or dependency schemas, with legacy ones – modulo the inevitable engi- neering challenges that might arise – in order to really push forward the state of the art in the QBF arena. A contribution to the development and optimization of such blended solvers might come, e.g., from the significant number of problems that emerged as challenging throughout our evaluation. References 1. M. Benedetti. Evaluating QBFs via Symbolic Skolemization. In Eleventh Interna- tional Conference on Logic for Programming, Artificial Intelligence and Reasoning (LPAR 2004), volume 3452 of Lecture Notes in Computer Science. Springer Verlag, 2004. 2. M. Benedetti. sKizzo: a Suite to Evaluate and Certify QBFs. In 20th Int.l. Con- ference on Automated Deduction, volume 3632 of LNCS, pages 369–376. Springer Verlag, 2005. 3. D. L. Berre, L. Simon, and A. Tacchella. Challenges in the QBF arena: the SAT’03 evaluation of QBF solvers. In Sixth International Conference on Theory and Ap- plications of Satisfiability Testing (SAT 2003), volume 2919 of Lecture Notes in Computer Science, pages 468–485. Springer Verlag, 2003. 4. A. Biere. Resolve and Expand. In Seventh Intl. Conference on Theory and Ap- plications of Satisfiability Testing (SAT’04), volume 3542 of LNCS, pages 59–70. Springer Verlag, 2005. 5. M. Cashmore, M. Fox, and E. Giunchiglia. Partially grounded planning as quan- tified boolean formula. In D. Borrajo, S. Kambhampati, A. Oddi, and S. Fratini, editors, Proceedings of the Twenty-Third International Conference on Automated Planning and Scheduling, ICAPS 2013, Rome, Italy, June 10-14, 2013. AAAI, 2013. 6. A. V. G. F. Lonsing, M. Seidl. QBF gallery 2013, 2013. http://www.kr.tuwien. ac.at/events/qbfgallery2013/. 7. R. Feldmann, B. Monien, and S. Schamberger. A distributed algorithm to evaluate quantified boolean formulae. In Proceedings of the Seventeenth National Conference in Artificial Intelligence (AAAI’00), pages 285–290. AAAI Press / The MIT Press, 2000. 74 P.Marin et al. An Empirical Perspective on Ten Years of QBF Solving 8. E. Giunchiglia, P. Marin, and M. Narizzano. Qube7.0. JSAT, 7(2-3):83–88, 2010. 9. E. Giunchiglia, M. Narizzano, and A. Tacchella. Clause-Term Resolution and Learning in Quantified Boolean Logic Satisfiability. Artificial Intelligence Research, 26:371–416, 2006. 10. M. Janota, W. Klieber, J. Marques-Silva, and E. Clarke. Solving QBF with coun- terexample guided refinement. In Theory and Applications of Satisfiability Testing– SAT 2012, pages 114–128. Springer Berlin Heidelberg, 2012. 11. C. Jordan and L. Kaiser. Experiments with reduction finding. In M. Jarvisalo and A. Van Gelder, editors, Theory and Applications of Satisfiability Testing SAT 2013, volume 7962 of Lecture Notes in Computer Science, pages 192–207. Springer Berlin Heidelberg, 2013. 12. C. Jordan and M. Seidl. The QBF Gallery 2014, 2014. 13. W. Klieber, S. Sapra, S. Gao, and E. Clarke. A non-prenex, non-clausal QBF solver with game-state learning. In Theory and Applications of Satisfiability Testing–SAT 2010, pages 128–142. Springer, 2010. 14. M. Kronegger, A. Pfandler, and R. Pichler. Conformant planning as a benchmark for QBF solvers. In Intl. Workshop on Quantified Boolean Formulas (QBF 2013), page 15, 2013. 15. F. Lonsing and A. Biere. Depqbf: A dependency-aware QBF solver. JSAT, 7(2- 3):71–76, 2010. 16. C. Miller, C. Scholl, and B. Becker. Proving QBF-hardness in Bounded Model Checking for Incomplete Designs. In 14th International Workshop on Microproces- sor Test and Verification, MTV 2013, Austin, TX, USA, December 11-13, 2013, pages 23–28. IEEE Computer Society, 2013. 17. E. Nudelman, K. Leyton-Brown, A. Devkar, Y. Shoham, and H. Hoos. SATzilla: An Algorithm Portfolio for SAT. In In Seventh International Conference on Theory and Applications of Satisfiability Testing, SAT 2004 Competition: Solver Descrip- tions, pages 13–14, 2004. 18. C. Peschiera, L. Pulina, A. Tacchella, U. Bubeck, O. Kullmann, and I. Lynce. The seventh QBF solvers evaluation (QBFEVAL’10). In Theory and Applications of Satisfiability Testing–SAT 2010, pages 237–250. Springer Berlin Heidelberg, 2010. 19. F. Pigorsch and C. Scholl. An AIG-based QBF-solver using SAT for preprocessing. In Design Automation Conference (DAC), 2010 47th ACM/IEEE, pages 170–175. IEEE, 2010. 20. L. Pulina and A. Tacchella. Treewidth: a useful marker of empirical hardness in quantified Boolean logic encodings. In 15th Int.l Conf. on Logic for Program- ming, Artificial Intelligence, and Reasoning, volume 5330 of LNCS, pages 528–542. Springer, 2008. 21. L. Pulina and A. Tacchella. A self-adaptive multi-engine solver for quantified Boolean formulas. Constraints, 14(1):80–116, 2009. 22. L. Pulina and A. Tacchella. A structural approach to reasoning with quantified Boolean formulas. In 21st International Joint Conference on Artificial Intelligence (IJCAI 2009), pages 596–602, 2009. 23. L. Pulina and A. Tacchella. An empirical study of QBF encodings: from treewidth estimation to useful preprocessing. Fundamenta Informaticae, 102(3):391–427, 2010. 24. M. Sauer, S. Reimer, I. Polian, T. Schubert, and B. Becker. Provably optimal test cube generation using quantified boolean formula solving. In Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific, pages 533–539, Jan 2013. 75