1. Introduction

A Conjecture Regarding SMT Instability

Can Cebeci

Nikolaj Bjørner

George Candea

Clément Pit-Claudel

0 0 EPFL , Lausanne , Switzerland 1 Microsoft Research , Redmond , USA

2025

136 147

Automated verifiers rely on SMT solvers as a backend for proof automation. This simplifies verification in the common case, but also creates usability challenges due to instability. Previous work proposes mitigation strategies motivated by the assumption that instability is a fundamental theoretical limitation. We conjecture that the instability experienced by verifiers today is often caused by fixable engineering problems, and is thus not fundamental. Should this be true, the more consequential approach to addressing instability would be to identify root causes and connect them with solver improvements. We conducted case studies on 11 deemed-unstable queries from existing verification projects and from issues reported on the Z3 repository, with the goal of diagnosing the root causes of instability. For all of them, instability is attributable to solver bugs, misconfiguration, or misaligned expectations. We present our analysis and draw conclusions regarding better instability metrics, interfaces that make instability explicit, systematic debugging methods, and easier-to-debug SMT solvers.

eol>instability debugging

1. Introduction

Formal verification often relies on SMT solvers as a backend for proof automation. In the common case, SMT solvers significantly lower the amount of efort and expertise required for verification, as they discharge most proofs quickly.

Yet, SMT-based verifiers face significant usability challenges. A major reason is that SMT solvers sufer from instability, whereby minor changes to the input query, such as addition or removal of redundant constraints, can have disproportionate efects on performance, or even the outcome (e.g. turn an unsat into an unknown). Recent work has shown that even purely syntactic mutations such as renaming variables or reordering assertions can have such efects [ 1 ]. This makes verifiers unpredictable and often frustrating to use [ 2, 3, 4 ].

Existing (and ongoing) work [ 5, 6, 7 ] proposes mitigations for instability issues but does not analyze their root causes, so it is unclear why the mitigations work, or whether they generalize. Accordingly, the state of the art in building SMT-based verifiers is based on intuition: experts know empirically that some features do not work well together (e.g., quantifier-heavy encodings and model-based quantifier instantiation [ 8 ]), and that some features are best avoided (e.g., non-linear arithmetic). Overall, experts perceive a fundamental tradeof: that instability needs to be addressed at the cost of either the expressiveness of the encoding (by using fewer solver features), or proof automation (by reducing the context visible to the solver, so that there is less room for “bad luck”). We are not convinced that this tradeof is inherent to using SMT solvers.

We formulate a conjecture that the instability experienced by verifiers today is mostly caused by ifxable engineering problems, not fundamental theoretical limitations. SMT is combinatorially hard, so solvers must rely on heuristics to be eficient: in that sense, instability may be inescapable. However, we speculate that, in the context of a finite set of existing tools and their encodings, the heuristics that already work well (albeit sometimes unstably) can be made stable through dedicated engineering.

Motivated by this conjecture, we set out to investigate the root causes of instability. The SMT community does not have the tools and techniques today to pinpoint root causes at scale, generically, and automatically. In lieu of this and as a starting point, we manually examined a handful of queries in detail. We conducted case studies on 11 queries that were unstable with Z3 [ 9 ], taken from existing verification projects [ 10, 2, 11 ] and the Z3 forum.

In all 11 case studies, we found that the heart of the problem was neither a bad encoding nor bad luck. Rather, they boiled down to concrete, fixable root causes: • 6 of the queries were unstable due to bugs in the solver (§4.1). We have implemented three bug ifxes addressing all 6 queries, which have been merged upstream. • 2 were unstable due to misconfiguration (§4.2). These are each stabilized by a solver parameter that is set through the command line. In both cases, the parameter’s efect is intuitive and its necessity is obvious in hindsight. • 3 were unstable due to misasligned expectations (§4.3), specifically relating to how triggers are treated in the presence of relevancy filtering [ 12 ], which behaves according to its specification but still violates user expectations. These queries are stabilized by rewriting the triggers. In the following sections, we present our case studies and make three calls to action: • Tools and methods for debugging instability. Our case studies were costly: a few months of training plus multiple person-days per query. Some of our work can be systematized and automated, so as to make instability debugging accessible and cost-feasible. Such tool support would not only benefit users, but also the authors of SMT solvers, who stand to reap significant benefits when addressing user issues or during development. • Better instability metrics. Existing metrics [ 1, 13 ] are timeout-dependent and sensitive to variability in solving time, even when the variability is expected. This may both lead to stable queries being categorized as unstable (if the timeout is too short) and vice versa (if the timeout is too long). We need a metric that goes beyond simple statistical measures and captures instability as experienced by users (e.g., bimodality of time to solve). • Controlled randomness in SMT solvers. Modern solvers are afected by many sources of randomness (e.g., variable and function names, assertion order). With current verifiers, exploring the space of randomized choices is nearly impossible. To remedy this, we need to refactor solvers such that all randomized choices are determined explicitly by user-controlled random seeds. This would be a step towards making instability easier to identify and solvers easier to debug.

2. Background

In this section we define instability, and discuss recent eforts towards mitigating it. Then, we present an overview of a CDCL(T)-based SMT solver’s architecture and operation, highlighting the aspects most relevant to our discussion. We focus specifically on Z3, as it is the solver we used in our case studies.

2.1. Defining, measuring, and addressing instability

Proof instability, also called brittleness [ 14 ], the butterfly efect [ 15 ] and solver explosion [ 10 ], is a commonly experienced and acknowledged problem with SMT-based verifiers. While there is no universally agreed-upon definition, in this paper we use “instability” to broadly refer to solver behavior that is unpredictable and inconsistent from the user’s perspective. This includes disproportionate performance variations caused by minor changes to input queries, proofs breaking after solver or verifier upgrades, as well as sensitivity to redundant constraints, seemingly trivial simplifications, or reseeding. Existing work quantifies instability by fuzzing queries with syntactic mutations [ 1 ] and random seeds [ 13 ], and measuring variations in performance or outcome. These techniques detect a wide range of unstable queries, but not all. As shown in the rest of this paper, they are not perfect metrics, and certain kinds of instability are out of their reach.

Recent work has proposed various measures to reduce instability. One general direction is to pre-process queries, to simplify [ 16 ], prune [ 6 ], or canonicalize [ 5 ] them. Another is to restrict the verifier’s SMT encoding to simpler theories [ 10, 17, 18, 19 ], often at the cost of limiting expressivity or automation. Some verifiers ofer language features that let users control the encoding for specific code sections [ 7, 20 ], confining the use of theories and axioms believed to cause instability. A common thread to these techniques is that they aim to mitigate rather than fix instability. The mitigations have empirically been shown to work, but it is not clear whether they address root causes or whether the tradeofs they impose are strictly necessary.

2.2. The architecture and operation of a CDCL(T)-based solver

This section presents an overview of Z3’s architecture and operation, using the following query as a running example:

( < 0) ∧ ( < 0) ∧ (( + > 0) ∨ ( · < 0))

Z3 comprises a CDCL(T) core, and a set of theories. The former is essentially a SAT solver that treats theory atoms (e.g., ( < 0), ( + > 0)) as independent boolean variables and aims to construct a boolean model that satisfies the propositional constraints (e.g., {( < 0) :− true, ( < 0) :− true, ( + > 0) :− false, ( · < 0) :− true}). Theories are mainly responsible for ensuring that the set of propositional assignments in the model is consistent with respect to theory semantics. For this model, the theory of arithmetic would raise a conflict: ( < 0) :− true, ( < 0) :− true, and ( · < 0) :− true cannot simultaneously hold.

The CDCL(T) core makes decisions on propositional assignments one by one, either through boolean constraint propagation (BCP) or by branching. After each round of decisions, it performs theory propagation to see if the current (partial) model is consistent with the theories. In this example, before any branching, it decides that any model must have ( < 0) :− true and ( < 0) :− true through BCP, as these appear as unit clauses. Then, it propagates both assignments to the theory of arithmetic, which says that the model is consistent. When there is no unit clause left to be propagated, Z3 branches on a decision, creating a backtracking point. Let’s assume that it first decides to explore the case where ( + > 0) := true. As soon as this decision is theory-propagated, the theory of arithmetic raises a conflict. In response, the CDCL(T) core backtracks to the previous partial model, and augments it with ( + > 0) := false. Finally, the CDCL(T) core decides ( · < 0) :− true (through BCP), which causes another arithmetic conflict. As the CDCL(T) core cannot backtrack anymore, it decides that the query is unsat.

Z3 refines the basic CDCL(T) architecture with many heuristics and optimizations. One that is particularly important in the context of this paper is relevancy filtering [ 12 ]: in order to minimize the load on theories, the CDCL(T) core theory-propagates only a subset of the current assignments, those it deems relevant to the current branch of the proof. For instance, in the last round of theory propagation in our example, ( + > 0) := false would be filtered out, as ( + > 0) only appears in the query within a disjunction, and the fact that it has been assigned to false has no consequence on the truth value of the disjunction.

Another important mechanism to highlight is trigger-based quantifier instantiation , also called E-matching. We do not present E-matching here, and instead refer to prior literature [ 15 ]. We do, however, note that the CDCL(T) core treats the E-matching engine like any theory in the context of relevancy filtering: it keeps ground terms that only appear in irrelevant atoms from matching any triggers. Previous work [ 15 ] has explored the contribution of matching loops to instability. We do not discuss matching loops further, as they did not turn out to be the root cause for any of our case studies.

3. Experimental Setup

We have conducted case studies over 11 unstable queries. 4 of these were produced by early versions of TPot [ 10 ]; their instability motivated some of TPot’s encoding choices and optimizations. 4 of the queries come from Z3 bug reports or private email exchanges with verifier developers. These include queries produced by Dafny [ 21 ], Ivy [ 22 ] and Verus [ 7 ]. The remaining 3 queries are taken from the Mariposa benchmark [ 23 ].

We did not choose the queries for our case studies systematically. Instead, we started with examples we organically came across and were puzzled by. Then, we investigated issues that were recently raised with Z3 developers or directly shared with us, as well as some of the simpler queries in the Mariposa benchmark.

We used Z3 builds between versions 4.13.4 and 4.14.2 for all of our case studies.

4. Findings Origin Query name

TPot raw [ 24 ] TPot april2nd [ 26 ] TPot ofset [ 28 ] TPot bv_rewrite [ 29 ] Forum/Dafny issue_7444 [ 30 ] Forum/Dafny IfNorm0 Forum/Verus root0 Forum/Ivy thing2 [ 31 ] Mariposa [ 23 ] set_bit_to_0_self_uint64 Mariposa set_bit_to_1_self_uint64 Mariposa lemma_2toX Issue solver bug solver bug solver bug misconfiguration wrong expectations wrong expectations wrong expectations solver bug solver bug solver bug misconfiguration Solution patch-dynack [ 25 ] patch-fixedeq [ 27 ] patch-fixedeq tactic.default_tactic=smt rewrite triggers remove extraneous quantifier rewrite triggers patch-mbqi [ 32 ] patch-dynack patch-dynack smt.qi.eager_threshold=200

4.1. Instability may be caused by solver bugs

SMT solvers, like all software, have bugs. It is well-known [ 33 ] that these bugs may lead to unsoundness; It is not as well-documented that they may lead to instability as well.

We have identified and fixed three bugs that lead to instability in Z3. All three patches have been merged upstream. (declare-fun loc ((_ BitVec 64)) Int) mdata: struct metadata page aligned (declare-fun addr_mdata () (_ BitVec 64)) bytes..[.0+:8] p: char * ... (((dddeeeccclllaaarrreee---fffuuunnn maaddradrtra_(a)(r)r(A((rA)rrar(y_ayIBniIttnVt(_ec(_B6iB4t)iV)teVce8c)8)))))

arr: char[] ;(demcdlaatrae-ifsunpafge((a_liBginteVdec 8)) Int) arr[0] arr[ 1 ] arr[ 2 ] ... ;asmsdearttal.opc(paodidnrt_smdtaotat)he%s4t0a9r6t =o=f 0the array

assert mdata[0+:8] == addr_arr (a) initial state ; the assumption

assert f(arr[ 2 ]) < f(arr[ 3 ]) ;[C] ;(7)(:6)m,da[tBa][,p_[oFf]fset+:8] != mdata[0+:8] //*c/maa//ph/essatssAW=rRaeusredrm0*siaate;putdt((meaff=pp((*t0oaaaaiirrddotnrrddnht[[22rrre]]__sor))mmhuddogf<<aauhrttloffaad*m((__paaaas,prrlltarriiiwg[[gglhe33nnli-]]eeca))ddhhl))-oi;;=>lpgpdon(;.ie&ndmtdsmaetttaoa/d4aa0rt9ra6[)0s*]t4.r0u9c6t;. llll;aeeeesttttnsl#eeapploxgrr_oc0atro:c(0tn'f=_p)i;ofm)ot:smdn=eda-ftato(stalfat:a_oro=[actrrpl(h'el_iae[(oogd2acfnda]r(fers)rasd_s,deae<dt:rrr+=rtf_:)i8(m(,o]adln;raortc'a([)a3d]-d)rl_omcd_amtdaa)t/4a0_9a6l)i*g4n0e9d6 ;;;;[[[[EGDF]]]] (;;((;981))0([()::9A7:)])lp,,,po__co[[ao_fEDrfmf]]rfds,asaeyettmtaoa_!dx=a=ui=llo0iam0grsneadr;t=ic=homnletotrcia(cdaidcdtri_omndawtiat)h 8 (b) symbolically executed code (c) pseudo-SMT representing the query produced (d) proof sketch ; [C] (1): f(arr[ 2 ]) < f(arr[ 3 ]) ; [G] (2): f(arr'[ 2 ]) >= f(arr’[ 3 ]) ; (1), (2) (3): arr'[ 2 ] != arr[ 2 ] ∨ arr'[ 3 ] != arr[ 3 ] ; (3), array axioms (4): loc(p)-loc(addr_arr)==2 ∨ loc(p)-loc(addr_arr)==3 ;[A] ; ackermann axiom

(5): p == addr_arr ==> loc(p) == loc(addr_arr) ;[B] ;(6)(:4)p, !(=5)addr_arr

• patch-dynack fixes a bug that kept dynamic Ackermannization axioms (see §4.1.1) from being propagated to theories properly and that caused the relevancy of certain atoms to be forgotten during backtracking. • patch-fixedeq fixes relevancy tracking for equality lemmas between bitvectors that are assigned the same value by the bitvector theory, such that the equality is propagated to the egraph structure. • patch-mbqi fixes a bug related to model-based quantifier instantiation for uninterpreted sorts with finite domains, whereby Z3 would complete partial functions models with arbitrary values. patch-dynack and patch-fixedeq are bugs that afect the communication between the CDCL(T) core and the theories. This is a channel that is regulated by relevancy filtering, and can easily keep the solver from finding the intended proof. Section §4.1.1 presents patch-dynack in detail.

4.1.1. Deep dive on a solver bug

We discovered the bug related to patch-dynack while experimenting with an early version of TPot, when the solver got stuck on the raw query. This was an instance of instability: raw mostly comprised the same assertions as previous queries TPot made on the same symbolic execution path, and used the same logic (quantifier-free bitvectors and arrays), so we expected Z3 to discharge the query with similar reasoning.

Fig. 1 presents a simplified (and still unstable) version of the query, along with the context that produced it and the sketch of an intended proof. The original query included 86 assertions spanning 3600 lines of SMT2. The proof sketch justifies why we expect the query to be easily solvable, and helps explain how the bug leads to explosion. We have found writing pen-and-paper proof sketches to be helpful for debugging as well.

Dynamic Ackermannization [ 34 ] is a technique used by Z3 to significantly speed up reasoning about transitivity of equality and congruence. In the intended proof, the Ackermann axiom in step 5 is essential. Without it, the solver would have to discover the same fact through congruence closure, which results in producing much weaker lemmas [ 35 ].

While Ackermannization itself worked as expected for the query, a Z3 bug related to the interaction between Ackermannization and relevancy filtering kept step (6) from proceeding: Ackermannization would create fresh literals for step (5) (e.g., p == addr_arr) without marking them as relevant. This efectively made the Ackermann clauses invisible to the theories, so the theory of arithmetic would not learn that assuming p == addr_arr propagates loc(p) == loc(addr_arr). Consequently, it would not learn that p != addr_arr.

When Z3 backtracks it will retain certain axioms (i.e., theory lemmas, including Ackermann axioms) but forget about their relevancy. So the lifetime of the lemma is not scope-bound, but its relevancy is. This essentially results in the axioms only participating in propositional propagation (within the CDCL(T) core), but not theory propagation. This behavior is intended if the solver creates theory axioms using existing literals, but works against leveraging theory axioms with fresh literals across backtracking points.

4.2. Instability may be caused by misconfiguration

Solvers come with many configurable parameters that toggle heuristics, determine thresholds, set resource limits, and so on. The current version of Z3 (4.14.2) has 657 parameters configurable through the command line. Intuitively, the configuration should play a role in whether the solver behaves as expected consistently for an encoding scheme; indeed, two of our case studies turned out to be configuration issues.

; a, b, arr, i are bitvectors of size 32 assert b = arr + <elem_sz> * i assert a + 200 <= b assert a + 100 >= b lemma lemma_2toX() ensures power2(64) == 18446744073709551616; ensures power2(60) == 1152921504606846976; ensures power2(32) == 4294967296; ensures power2(24) == 16777216; ensures power2(19) == 524288; ensures power2(16) == 65536; ensures power2(8) == 256; { } reveal_power2();

4.2.1. A misconfigured heuristic

bv_rewrite.smt2 is a minimized (to 14 lines) version of an unstable query that was produced by TPot. It involves 4 variables, all of which are bitvectors of size 32. As shown in Fig. 2, the time it takes Z3 to solve bv_rewrite.smt2 depends heavily on a bitvector coeficient ( elem_sz) appearing only in an assertion that is redundant and therefore should be inconsequential.

With the default configuration, Z3 determines that the query only involves quantifier-free constraints over bitvectors, and so it is essentially a boolean satisfiability problem. Then Z3 decides to bit-blast the query, representing each bit with a boolean variable, and discharge it using the built-in SAT solver (which is separate from the CDCL(T) core), without ever involving the actual SMT solver. Using the command line argument tactic.default_tactic=smt prevents this from happening and makes Z3 terminate instantly irrespective of the coeficient.

4.2.2. An inappropriate resource limit

lemma_2toX.smt2 is a query produced by Dafny to prove the lemma in Fig. 3, as a part of the IronFleet [ 36 ] project. The query is unstable with variable outcome: depending on the random seed, Z3 may quickly return either unsat or unknown.

In the SMT encoding, Dafny represents power2 with an uninterpreted function, and provides the usual [ 37 ] recursive definition ( ∀.power2() = ITE( = 0, 1, 2 · power2( − 1)))) as a quantified axiom. For each ensures statement in Fig. 3, Dafny expects Z3 to unfold the recursive definition times to compute power2().

The problem is that the global threshold Dafny sets on quantifier instantiation depth (smt.qi.eager_threshold=100) is too small to fully unfold power2(64). We have empirically observed using SMTScope [ 38 ], the latest version of the Axiom Profiler [ 39 ], that it does not allow power2 to be unrolled more than 32 times. So, Z3 can not close the branches of the proof involving power2(64) and power2(60).

The misconfiguration here is obvious in hindsight. In practice, it was not easily noticable as it made the query unstable, not consistently unknown. This is because when we get lucky, Z3 reuses unfoldings that were used to close branches corresponding to the last five ensures to close the first two as well. So whether Z3 returns unsat or unknown depends on the order in which it explores diferent branches. Setting the threshold to 200 allows all branches to close in isolation and stabilizes the query.

One may also configure Z3 such that quantifier instances are efectively shared across branches, by disabling relevancy filtering ( smt.relevancy=0). Doing so indeed stabilizes this particular query, but makes most other Dafny queries to explode, due to Dafny’s quantifer-heavy encoding.

4.3. Instability may be caused by misaligned expectations

Developers of SMT-based verifiers treat solver optimizations as black boxes that are only relevant as far as performance is concerned. They assume that, barring timeouts and resource limits, toggling an optimization may change the solver’s performance but not its completeness (e.g., may not turn an unsat into an unknown), otherwise the solver must have a bug. This assumption is incorrect, at least in the case of relevancy filtering combined with trigger-based quantifier instantiation.

Our case studies include two queries with unstable outcomes for which the root cause is a mismatch between how verifier developers expect triggers to be treated and how Z3 treats them in the presence of relevancy filtering. These are not Z3 bugs; relevancy filtering behaves according to how it is specified, through semantic tableau rules, even though it violates user expectations. In §4.3.1 we present a minimal reproduction and in-depth discussion on this mismatch.

4.3.1. A minimal example

Consider the query in Fig. 4. The query is clearly unsat, as instantiating the quantifier with :− 0 trivially contradicts the first assertion. The conventional understanding of triggers, as described by Leino and Pit-Claudel [ 15 ], dictates that the ground term (A 0) would match the trigger and produce the expected instance, and Z3 would return unsat. Surprisingly, in the presence of relevancy filtering, Z3 may return unknown depending on the branching order, precisely 50% of the time. We explain this frequency in §4.3.2.

The core of the problem is that there is a branch of the proof that does not close itself; it depends on the quantifier instance but does not trigger instantiation. In the branch where (0) is false and (0) is true, Z3 considers (0) to be an irrelevant disjunct in (or (A 0) (B 0)), as its truth value does not afect the value of the disjunction. Therefore, Z3 keeps the assignment (0) :− false from being propagated to any theory, including the E-matching engine. So, the branch only closes if Z3 visits one of the branches that triggers instantiation (e.g., (0) :− true, (0) :− false) before this one and reuses the instance.

Another way to understand the problem is to realize that relevancy filtering efectively simulates tableau-search, whereby the solver structurally branches on the problem. In this example, the solver looks at the first assertion and branches on satisfying (0) versus (0). The latter branch is essentially equivalent to solving a version of the query with (0) omitted from the first assertion, and that version should be unknown according to Leino and Pit-Claudel’s description.

4.3.2. Quantitatively understanding the minimal example

The query’s success depends purely on the branching order and 50% of branching orders lead to success. We experimented with 8 variants of the query, changing the polarities of (0) and (0), and toggling the pattern between () and ().

Some of these seemed to be consistently unsat or unknown irrespective of the random seed, suggesting that these variations play a role in the query’s success. On closer look, some variations may fix a branching order with Z3’s default configuration: Z3 will always guess false for the first atom it branches on irrespectively of the random seed (the actual guessing process - phase caching - is somewhat more involved, but the caches are initialized to false). Also, the way Z3 chooses the first atom to branch on involves randomness but it is not uniformly random. To work around these issues, we made a surgical modification on Z3’s source code to set the frequency of randomized case splits to 1.00 (the default is 0.01). This frequency is hard-coded and is not configurable through the command line. We then ran Z3 with smt.phase_selection=5 (random).

Assignment 1 branch? literal relevant?

Assignment 2 branch? literal relevant? yes

A(0) ¬A(0) B(0) ¬B(0) yes no yes no no yes no

B(0) A(0) ¬A(0) A(0) yes no no yes Total:

Result unsat unknown unknown unknown unsat unsat unknown

Probability 1/4 1/4 1/8 1/8 1/4 1/2 1/2

5. Discussion

In this section, we discuss our experience with the case studies. We hope the lessons we learned will inform future work on addressing instability.

5.1. Sources of randomness

In §4.3.2, we describe how certain variations on the query in Fig. 4 seem to have implications on stability, when in fact all they do is resample a random distribution that is unafected by the random seed. For this query, the order of disjuncts, and the symmetric choice of the trigger term between the two disjuncts, are sources of randomness. This is an example of how having multiple sources of randomness can lead to confusion while identifying and debugging instability.

Another such example is the IfNorm0 query. The original query produced by Dafny succeeds with all random seeds, but asking Dafny to normalize the names of functions and variables (through a Dafny command line option) makes the query explode with all random seeds. This pattern is confusing, as it suggests that normalization is to blame for instability. The pattern occurs because Z3 reorders disjunctions using a name-dependent order and Dafny sets the smt.case_split=3 parameter, which orders case splits structurally, starting from the leftmost disjunct. This makes the case split order insensitive to the random seed, but sensitive to variable and function names.

Ensuring that all randomness is controllable through an explicit random seed would be a step towards making instability easier to identify and address.

5.2. Examining, comparing, and replaying solver traces

Existing approaches to examining Z3 traces [ 39 ] are based on statistical measures (e.g., number of instances of each axiom, length of instantiation chains, heuristic cost assignments to axioms). For many of our examples, root causes lie with individual events that throw the solver of. Identifying such events is hard, and to the best of our knowledge there currently is no debugging tool that helps do so.

We adopted a methodology, which we think can be largely automated, based on comparing solver traces using standard text dif tools. Given an unstable query, we could often find a “good” configuration (of solver parameters, variable names, etc.) that produces the expected behavior. When a good configuration does not appear organically, we had success with parameter fuzzing or randomness fuzzing through Mariposa. We then difed the trace from a good run against one from a bad run, identified the first point of divergence, and made surgical modifications on the Z3 source to unify the solver’s behavior over the two. Of course, the first point of divergence usually was not itself related to the root cause, but repeating this process until both configurations succeed allowed us to discover particularly elusive bugs.

This methodology would be greatly simplified if we were able to modify and replay solver traces. This would allow us to unify behavior without having to modify the solver’s source code. Beyond trace difs, replaying traces would make it much easier to test hypotheses of the form “had been done diferently, the query would not have exploded”.

We also note that we used the tracing mechanism built into Z3, which was built for VCC, and not tailored for our purpose. There are many relevant steps that do not get logged (e.g. theory propagations, the state of the egraph, relevancy), which sometimes hides root causes and makes it hard to unify the next point of divergence. 5.3. Proof sketches §4.1.1 uses a proof sketch to explain the root cause of instability for a simplified version of the raw query; we wrote similar sketches for most case studies. In our experience, writing down such pen-and-paper proof sketches and trying to identify what keeps the solver from finding them has been useful for debugging, as well as explaining root causes after the fact to convince oneself that the “right problem” was fixed. We believe debugging tools that ingest proof sketches along with solver traces and pinpoint the missing proof step would have greatly simplified our case studies.

Debugging instability through proof sketches mimics how one debugs functional correctness problems in most classes of software: by having a clear picture of the expected behavior and iteratively narrowing down where the program departs from it. Without proof sketches, this is only possible for SMT solvers at two extreme granularities, neither practical. On one extreme, one may say the expected behavior is simply “the solver must return unsat quickly”. This does not allow for iteratively narrowing down on the problem beyond the solver’s interface. Debugging at the interface by giving the solver iteratively simplified/reduced versions of the query and trying to deduce which simplification step caused the problem often leads to ill-informed and misleading conclusions. On the other extreme, one may examine solver traces. This requires the developer to be intimately familiar with the solver’s internals and even then they are often too large and too low-level to be practically understood by humans. Importantly, traces also mostly consist of “dead ends”, i.e., branches explored and axioms instantiated by the solver as part of proof search that ultimately do not make progress towards the expected proof, and are uninteresting in pinpointing the divergence from expected behavior.

Proof sketches can be written at arbitrary granularity, and deepened only in steps that need to be narrowed down on (e.g., Fig. 1 elides the bitvector and arithmetic axioms that need to be involved for steps (8) and (10); the solver follows these steps as expected so the user does not need more detail). This makes proof-sketch debugging scale well with the size of the query being debugged. The original raw query spans 55× the number of SMT lines the simplified version in Fig. 1 does, but its proof sketch still consists of 10 (considerably bigger) steps.

6. Conclusion

We conjecture that the instability experienced by SMT-based verifiers today is mostly caused by fixable engineering problems, not fundamental theoretical limitations. We support this conjecture with 11 case studies, which identified three classes of root causes: solver bugs, misconfigurations, or misaligned expectations. We hope our experience will inform future work on addressing instability in three ways. First, some of the unstable queries we studied were not flagged as such by existing metrics; we argue for an instability metric that reflects user experience. Second, part of our debugging methodology can be automated, and would greatly benefit from tool support for debugging through proof sketches, and replaying solver traces. Third, having multiple sources of randomness makes instability debugging harder, so unifying all random behavior under explicit random seeds would be a step forward.

Acknowledgements

This material is based upon work supported by the Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agencies (DARPA) under Contract No. FA8750-24-C-B044. Any opinions, ifndings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the AFRL and DARPA.

Declaration on Generative AI

The authors did not use any generative AI tools in the writing of this paper.

[1]

Zhou ,

Bosamiya ,

Takashima ,

Li ,

Heule ,

Parno , Mariposa: Measuring SMT instability in automated program verification , in: FMCAD , 2023 .

[2]

Hawblitzel ,

Howell ,

Lorch ,

Narayan ,

Parno ,

Zhang ,

Zill , Ironclad apps: End-to-end security via automated full-system verification , in: OSDI , 2014 .

[3]

Ferraiuolo ,

Baumann ,

Hawblitzel ,

Parno , Komodo: Using verification to disentangle secure-enclave hardware from software , in: SOSP , 2017 .

[4]

McLaughlin ,

G.-A.

Jaloyan ,

Xiang ,

Rabe , Enhancing proof stability , in: Dafny Workshop , 2024 .

[5]

Amrollahi ,

Preiner ,

Niemetz ,

Reynolds ,

Charikar ,

Tinelli ,

Barrett , Using normalization to improve SMT solver stability , 2024 .

[6]

Zhou ,

Bosamiya ,

Li ,

M. J.

Heule ,

Parno , Context pruning for more robust SMT-based program verification , in: FMCAD , 2024 .

[7]

Lattuada ,

Hance ,

Bosamiya ,

Brun ,

Cho , H. LeBlanc, P. Srinivasan,

Achermann ,

Chajed ,

Hawblitzel ,

Howell ,

J. R.

Lorch ,

Padon ,

Parno , Verus: A practical foundation for systems verification , in: SOSP, 2024 .

[8]

Ge , L. de Moura, Complete instantiation for quantified formulas in satisfiabiliby modulo theories , in: CAV , 2009 .

[9] L. M. de Moura , N. Bjørner, Z3: An eficient SMT solver , in: TACAS, 2008 .

[10]

Cebeci ,

Zou ,

Zhou , G. Candea,

Pit-Claudel , Practical verification of system-software components written in standard c , in: SOSP , 2024 .

[11]

Hance ,

Lattuada ,

Hawblitzel ,

Howell ,

Johnson ,

Parno , Storage systems are distributed systems (so verify them that way!) , in: OSDI, 2020 .

[12] L. de Moura , N. Bjørner, Relevancy Propagation, Technical Report MSR-TR-2007-140 , 2007 .

[13]

Dafny

Reference Manual: Measuring stability , https://dafny. org/v3.9 .0/DafnyRef/DafnyRef.html# 25751-measuring-stability, 2025 .

[14]

Bordis , K. R. M. Leino , Free facts: An alternative to ineficient axioms in Dafny , in: Formal Methods, 2024 .

[15] K. R. M. Leino , C. Pit-Claudel , Trigger selection strategies to stabilize program verifiers , in: CAV , 2016 .

[16]

Cadar ,

Dunbar ,

D. R.

Engler , KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs , in: OSDI, 2008 .

[17]

Padon , G. Losa,

Sagiv ,

Shoham , Paxos made epr: decidable reasoning about distributed protocols , in: OOPSLA , 2017 .

[18]

Pulte ,

D. C.

Makwana ,

Sewell ,

Memarian ,

Sewell ,

Krishnaswami , CN: Verifying systems C code with separation-logic refinement types , in: POPL , 2023 .

[19]

Hamza ,

Voirol ,

Kunčak , System

: Formalized foundations for the Stainless verifier , in: OOPSLA, 2019 .

[20]

Willems ,

Schlaipfer ,

Bouissou , Helping users to reduce brittleness in their Dafny programs - a success story , in: Dafny Workshop , 2025 .

[21] K. R. M. Leino , Dafny: An automatic program verifier for functional correctness , in: LPAR , 2010 .

[22] K. L. McMillan , O. Padon , Ivy: A multi-modal verification tool for distributed algorithms , in: CAV, 2020 .

[23]

Zhou ,

Parno , The Mariposa Benchmark, https://github.com/secure-foundations/ mariposa-data, 2023 .

[24] TPot query: raw .smt2, https://github.com/dslab-epfl/tpot/blob/unstable-queries/unstable-queries/ raw.smt2, 2025 .

[25]

Cebeci ,

Bjørner , Can's fix to relevancy propagation , https://github.com/Z3Prover/z3/commit/ 8c5abdf818ebd674d57a8dc43f3fe04d73f6ab73, 2025 .

[26] TPot query: april2nd .smt2, https://github.com/dslab-epfl/tpot/blob/unstable-queries/ unstable-queries/ april2nd .smt2, 2025 .

[27]

Cebeci , Mark fixed_eq literals as relevant , https://github.com/Z3Prover/z3/pull/7533, 2025 .

[28] TPot query: ofset .smt2, https://github.com/dslab-epfl/tpot/blob/unstable-queries/ unstable-queries/ofset.smt2, 2025 .

[29] TPot query: bv_rewrite .smt2, https://github.com/dslab-epfl/tpot/blob/unstable-queries/ unstable-queries/bv_rewrite.smt2, 2025 .

[30]

Leino , Z3 issue: Unexpected proof failure due to minor input tweak , https://github.com/Z3Prover/ z3/issues/7444, 2024 .

[31]

McMillan , Z3 issue: Performance regression on stratified EPR formula , https://github.com/ Z3Prover/z3/issues/7515, 2025 .

[32]

Cebeci ,

Bjørner , Fix complete_partial_func for finite domains , https://github.com/Z3Prover/ z3/pull/7533, 2025 .

[33]

Winterer ,

Zhang ,

Su , Validating SMT solvers via semantic fusion , in: PLDI , 2020 .

[34]

Dutertre , L. de Moura, The Yices SMT solver , https://yices.csl.sri.com/papers/tool-paper.pdf, 2006 .

[35]

N. S.

Bjørner , L. M. de Moura, Tractability and modern satisfiability modulo theories solvers , in: L. Bordeaux , Y. Hamadi , P. Kohli (Eds.), Tractability: Practical Approaches to Hard Problems, 2014 .

[36]

Hawblitzel ,

Howell ,

Kapritsos ,

Lorch ,

Parno ,

M. L.

Roberts ,

Setty , B. Zill, IronFleet: Proving practical distributed systems correct , 2015 .

[37]

Amin , K. R. M. Leino , T. Rompf , Computing with an SMT solver , in: TAP, 2014 .

[38] Jonáš

Fiala

, SMTScope, https://github.com/viperproject/smt-scope, 2024 .

[39]

Becker ,

Müller ,

A. J.

Summers , The Axiom Profiler: Understanding and debugging SMT quantifier instantiations , in: TACAS , 2019 .