<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Conjecture Regarding SMT Instability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Can Cebeci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolaj Bjørner</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Candea</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Clément Pit-Claudel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>EPFL</institution>
          ,
          <addr-line>Lausanne</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Microsoft Research</institution>
          ,
          <addr-line>Redmond</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>136</fpage>
      <lpage>147</lpage>
      <abstract>
        <p>Automated verifiers rely on SMT solvers as a backend for proof automation. This simplifies verification in the common case, but also creates usability challenges due to instability. Previous work proposes mitigation strategies motivated by the assumption that instability is a fundamental theoretical limitation. We conjecture that the instability experienced by verifiers today is often caused by fixable engineering problems, and is thus not fundamental. Should this be true, the more consequential approach to addressing instability would be to identify root causes and connect them with solver improvements. We conducted case studies on 11 deemed-unstable queries from existing verification projects and from issues reported on the Z3 repository, with the goal of diagnosing the root causes of instability. For all of them, instability is attributable to solver bugs, misconfiguration, or misaligned expectations. We present our analysis and draw conclusions regarding better instability metrics, interfaces that make instability explicit, systematic debugging methods, and easier-to-debug SMT solvers.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;instability</kwd>
        <kwd>debugging</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Formal verification often relies on SMT solvers as a backend for proof automation. In the common case,
SMT solvers significantly lower the amount of efort and expertise required for verification, as they
discharge most proofs quickly.</p>
      <p>
        Yet, SMT-based verifiers face significant usability challenges. A major reason is that SMT solvers
sufer from instability, whereby minor changes to the input query, such as addition or removal of
redundant constraints, can have disproportionate efects on performance, or even the outcome (e.g.
turn an unsat into an unknown). Recent work has shown that even purely syntactic mutations such as
renaming variables or reordering assertions can have such efects [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This makes verifiers unpredictable
and often frustrating to use [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>
        Existing (and ongoing) work [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ] proposes mitigations for instability issues but does not analyze
their root causes, so it is unclear why the mitigations work, or whether they generalize. Accordingly,
the state of the art in building SMT-based verifiers is based on intuition: experts know empirically
that some features do not work well together (e.g., quantifier-heavy encodings and model-based
quantifier instantiation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]), and that some features are best avoided (e.g., non-linear arithmetic).
Overall, experts perceive a fundamental tradeof: that instability needs to be addressed at the cost of
either the expressiveness of the encoding (by using fewer solver features), or proof automation (by
reducing the context visible to the solver, so that there is less room for “bad luck”). We are not convinced
that this tradeof is inherent to using SMT solvers.
      </p>
      <p>We formulate a conjecture that the instability experienced by verifiers today is mostly caused by
ifxable engineering problems, not fundamental theoretical limitations. SMT is combinatorially hard, so
solvers must rely on heuristics to be eficient: in that sense, instability may be inescapable. However,
we speculate that, in the context of a finite set of existing tools and their encodings, the heuristics that
already work well (albeit sometimes unstably) can be made stable through dedicated engineering.</p>
      <p>
        Motivated by this conjecture, we set out to investigate the root causes of instability. The SMT
community does not have the tools and techniques today to pinpoint root causes at scale, generically,
and automatically. In lieu of this and as a starting point, we manually examined a handful of queries in
detail. We conducted case studies on 11 queries that were unstable with Z3 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], taken from existing
verification projects [
        <xref ref-type="bibr" rid="ref10 ref11 ref2">10, 2, 11</xref>
        ] and the Z3 forum.
      </p>
      <p>
        In all 11 case studies, we found that the heart of the problem was neither a bad encoding nor bad
luck. Rather, they boiled down to concrete, fixable root causes:
• 6 of the queries were unstable due to bugs in the solver (§4.1). We have implemented three bug
ifxes addressing all 6 queries, which have been merged upstream.
• 2 were unstable due to misconfiguration (§4.2). These are each stabilized by a solver parameter
that is set through the command line. In both cases, the parameter’s efect is intuitive and its
necessity is obvious in hindsight.
• 3 were unstable due to misasligned expectations (§4.3), specifically relating to how triggers are
treated in the presence of relevancy filtering [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which behaves according to its specification
but still violates user expectations. These queries are stabilized by rewriting the triggers.
In the following sections, we present our case studies and make three calls to action:
• Tools and methods for debugging instability. Our case studies were costly: a few months of training
plus multiple person-days per query. Some of our work can be systematized and automated, so
as to make instability debugging accessible and cost-feasible. Such tool support would not only
benefit users, but also the authors of SMT solvers, who stand to reap significant benefits when
addressing user issues or during development.
• Better instability metrics. Existing metrics [
        <xref ref-type="bibr" rid="ref1 ref13">1, 13</xref>
        ] are timeout-dependent and sensitive to variability
in solving time, even when the variability is expected. This may both lead to stable queries being
categorized as unstable (if the timeout is too short) and vice versa (if the timeout is too long). We
need a metric that goes beyond simple statistical measures and captures instability as experienced
by users (e.g., bimodality of time to solve).
• Controlled randomness in SMT solvers. Modern solvers are afected by many sources of randomness
(e.g., variable and function names, assertion order). With current verifiers, exploring the space of
randomized choices is nearly impossible. To remedy this, we need to refactor solvers such that
all randomized choices are determined explicitly by user-controlled random seeds. This would be
a step towards making instability easier to identify and solvers easier to debug.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>In this section we define instability, and discuss recent eforts towards mitigating it. Then, we present
an overview of a CDCL(T)-based SMT solver’s architecture and operation, highlighting the aspects
most relevant to our discussion. We focus specifically on Z3, as it is the solver we used in our case
studies.</p>
      <sec id="sec-2-1">
        <title>2.1. Defining, measuring, and addressing instability</title>
        <p>
          Proof instability, also called brittleness [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], the butterfly efect [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and solver explosion [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], is
a commonly experienced and acknowledged problem with SMT-based verifiers. While there is no
universally agreed-upon definition, in this paper we use “instability” to broadly refer to solver behavior
that is unpredictable and inconsistent from the user’s perspective. This includes disproportionate
performance variations caused by minor changes to input queries, proofs breaking after solver or
verifier upgrades, as well as sensitivity to redundant constraints, seemingly trivial simplifications, or
reseeding. Existing work quantifies instability by fuzzing queries with syntactic mutations [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and
random seeds [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], and measuring variations in performance or outcome. These techniques detect a
wide range of unstable queries, but not all. As shown in the rest of this paper, they are not perfect
metrics, and certain kinds of instability are out of their reach.
        </p>
        <p>
          Recent work has proposed various measures to reduce instability. One general direction is to
pre-process queries, to simplify [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], prune [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], or canonicalize [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] them. Another is to restrict the
verifier’s SMT encoding to simpler theories [
          <xref ref-type="bibr" rid="ref10 ref17 ref18 ref19">10, 17, 18, 19</xref>
          ], often at the cost of limiting expressivity
or automation. Some verifiers ofer language features that let users control the encoding for specific
code sections [
          <xref ref-type="bibr" rid="ref20 ref7">7, 20</xref>
          ], confining the use of theories and axioms believed to cause instability. A common
thread to these techniques is that they aim to mitigate rather than fix instability. The mitigations
have empirically been shown to work, but it is not clear whether they address root causes or
whether the tradeofs they impose are strictly necessary.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. The architecture and operation of a CDCL(T)-based solver</title>
        <p>This section presents an overview of Z3’s architecture and operation, using the following query as a
running example:</p>
        <p>( &lt; 0) ∧ ( &lt; 0) ∧ (( +  &gt; 0) ∨ ( ·  &lt; 0))</p>
        <p>Z3 comprises a CDCL(T) core, and a set of theories. The former is essentially a SAT solver that
treats theory atoms (e.g., ( &lt; 0), ( +  &gt; 0)) as independent boolean variables and aims to construct
a boolean model that satisfies the propositional constraints (e.g., {( &lt; 0) :− true, ( &lt; 0) :−
true, ( +  &gt; 0) :− false, ( ·  &lt; 0) :− true}). Theories are mainly responsible for ensuring that
the set of propositional assignments in the model is consistent with respect to theory semantics. For
this model, the theory of arithmetic would raise a conflict: ( &lt; 0) :− true, ( &lt; 0) :− true, and
( ·  &lt; 0) :− true cannot simultaneously hold.</p>
        <p>The CDCL(T) core makes decisions on propositional assignments one by one, either through boolean
constraint propagation (BCP) or by branching. After each round of decisions, it performs theory
propagation to see if the current (partial) model is consistent with the theories. In this example, before
any branching, it decides that any model must have ( &lt; 0) :− true and ( &lt; 0) :− true through BCP,
as these appear as unit clauses. Then, it propagates both assignments to the theory of arithmetic, which
says that the model is consistent. When there is no unit clause left to be propagated, Z3 branches on a
decision, creating a backtracking point. Let’s assume that it first decides to explore the case where
( +  &gt; 0) := true. As soon as this decision is theory-propagated, the theory of arithmetic raises
a conflict. In response, the CDCL(T) core backtracks to the previous partial model, and augments it
with ( +  &gt; 0) := false. Finally, the CDCL(T) core decides ( ·  &lt; 0) :− true (through BCP), which
causes another arithmetic conflict. As the CDCL(T) core cannot backtrack anymore, it decides that the
query is unsat.</p>
        <p>
          Z3 refines the basic CDCL(T) architecture with many heuristics and optimizations. One that is
particularly important in the context of this paper is relevancy filtering [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]: in order to minimize the
load on theories, the CDCL(T) core theory-propagates only a subset of the current assignments, those it
deems relevant to the current branch of the proof. For instance, in the last round of theory propagation
in our example, ( +  &gt; 0) := false would be filtered out, as ( +  &gt; 0) only appears in the query
within a disjunction, and the fact that it has been assigned to false has no consequence on the truth
value of the disjunction.
        </p>
        <p>
          Another important mechanism to highlight is trigger-based quantifier instantiation , also called
E-matching. We do not present E-matching here, and instead refer to prior literature [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. We do,
however, note that the CDCL(T) core treats the E-matching engine like any theory in the context of
relevancy filtering: it keeps ground terms that only appear in irrelevant atoms from matching any
triggers. Previous work [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] has explored the contribution of matching loops to instability. We do not
discuss matching loops further, as they did not turn out to be the root cause for any of our case studies.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <p>
        We have conducted case studies over 11 unstable queries. 4 of these were produced by early versions
of TPot [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]; their instability motivated some of TPot’s encoding choices and optimizations. 4 of the
queries come from Z3 bug reports or private email exchanges with verifier developers. These include
queries produced by Dafny [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], Ivy [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and Verus [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The remaining 3 queries are taken from the
Mariposa benchmark [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>We did not choose the queries for our case studies systematically. Instead, we started with examples
we organically came across and were puzzled by. Then, we investigated issues that were recently raised
with Z3 developers or directly shared with us, as well as some of the simpler queries in the Mariposa
benchmark.</p>
      <p>We used Z3 builds between versions 4.13.4 and 4.14.2 for all of our case studies.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Findings</title>
      <sec id="sec-4-1">
        <title>Origin Query name</title>
        <p>
          TPot raw [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]
TPot april2nd [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]
TPot ofset [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]
TPot bv_rewrite [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]
Forum/Dafny issue_7444 [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]
Forum/Dafny IfNorm0
Forum/Verus root0
Forum/Ivy thing2 [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]
Mariposa [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] set_bit_to_0_self_uint64
Mariposa set_bit_to_1_self_uint64
Mariposa lemma_2toX
Issue
solver bug
solver bug
solver bug
misconfiguration
wrong expectations
wrong expectations
wrong expectations
solver bug
solver bug
solver bug
misconfiguration
Solution
patch-dynack [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]
patch-fixedeq [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]
patch-fixedeq
tactic.default_tactic=smt
rewrite triggers
remove extraneous quantifier
rewrite triggers
patch-mbqi [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]
patch-dynack
patch-dynack
smt.qi.eager_threshold=200
        </p>
        <sec id="sec-4-1-1">
          <title>4.1. Instability may be caused by solver bugs</title>
          <p>
            SMT solvers, like all software, have bugs. It is well-known [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ] that these bugs may lead to unsoundness;
It is not as well-documented that they may lead to instability as well.
          </p>
          <p>We have identified and fixed three bugs that lead to instability in Z3. All three patches have been
merged upstream.
(declare-fun loc ((_ BitVec 64)) Int)
mdata: struct metadata page aligned (declare-fun addr_mdata () (_ BitVec 64))
bytes..[.0+:8] p: char * ... (((dddeeeccclllaaarrreee---fffuuunnn maaddradrtra_(a)(r)r(A((rA)rrar(y_ayIBniIttnVt(_ec(_B6iB4t)iV)teVce8c)8)))))</p>
          <p>
            arr: char[] ;(demcdlaatrae-ifsunpafge((a_liBginteVdec 8)) Int)
arr[0] arr[
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] arr[
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] ... ;asmsdearttal.opc(paodidnrt_smdtaotat)he%s4t0a9r6t =o=f 0the array
          </p>
          <p>assert mdata[0+:8] == addr_arr
(a) initial state ; the assumption</p>
          <p>
            assert f(arr[
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]) &lt; f(arr[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]) ;[C] ;(7)(:6)m,da[tBa][,p_[oFf]fset+:8] != mdata[0+:8]
//*c/maa//ph/essatssAW=rRaeusredrm0*siaate;putdt((meaff=pp((*t0oaaaaiirrddotnrrddnht[[22rrre]]__sor))mmhuddogf&lt;&lt;aauhrttloffaad*m((__paaaas,prrlltarriiiwg[[gglhe33nnli-]]eeca))ddhhl))-oi;;=&gt;lpgpdon(;.ie&amp;ndmtdsmaetttaoa/d4aa0rt9ra6[)0s*]t4.r0u9c6t;. llll;aeeeesttttnsl#eeapploxgrr_oc0atro:c(0tn'f=_p)i;ofm)ot:smdn=eda-ftato(stalfat:a_oro=[actrrpl(h'el_iae[(oogd2acfnda]r(fers)rasd_s,deae&lt;dt:rrr+=rtf_:)i8(m(,o]adln;raortc'a([)a3d]-d)rl_omcd_amtdaa)t/4a0_9a6l)i*g4n0e9d6 ;;;;[[[[EGDF]]]] (;;((;981))0([()::9A7:)])lp,,,po__co[[ao_fEDrfmf]]rfds,asaeyettmtaoa_!dx=a=ui=llo0iam0grsneadr;t=ic=homnletotrcia(cdaidcdtri_omndawtiat)h 8
(b) symbolically executed code (c) pseudo-SMT representing the query produced (d) proof sketch
; [C]
(1): f(arr[
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]) &lt; f(arr[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ])
; [G]
(2): f(arr'[
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]) &gt;= f(arr’[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ])
; (1), (2)
(3): arr'[
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] != arr[
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] ∨ arr'[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] != arr[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]
; (3), array axioms
(4): loc(p)-loc(addr_arr)==2 ∨ loc(p)-loc(addr_arr)==3
;[A] ; ackermann axiom
          </p>
          <p>(5): p == addr_arr ==&gt; loc(p) == loc(addr_arr)
;[B] ;(6)(:4)p, !(=5)addr_arr</p>
          <p>• patch-dynack fixes a bug that kept dynamic Ackermannization axioms (see §4.1.1) from being
propagated to theories properly and that caused the relevancy of certain atoms to be forgotten
during backtracking.
• patch-fixedeq fixes relevancy tracking for equality lemmas between bitvectors that are assigned
the same value by the bitvector theory, such that the equality is propagated to the egraph structure.
• patch-mbqi fixes a bug related to model-based quantifier instantiation for uninterpreted sorts
with finite domains, whereby Z3 would complete partial functions models with arbitrary values.
patch-dynack and patch-fixedeq are bugs that afect the communication between the CDCL(T)
core and the theories. This is a channel that is regulated by relevancy filtering, and can easily keep the
solver from finding the intended proof. Section §4.1.1 presents patch-dynack in detail.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.1. Deep dive on a solver bug</title>
        <p>We discovered the bug related to patch-dynack while experimenting with an early version of TPot,
when the solver got stuck on the raw query. This was an instance of instability: raw mostly comprised
the same assertions as previous queries TPot made on the same symbolic execution path, and used
the same logic (quantifier-free bitvectors and arrays), so we expected Z3 to discharge the query with
similar reasoning.</p>
        <p>Fig. 1 presents a simplified (and still unstable) version of the query, along with the context that
produced it and the sketch of an intended proof. The original query included 86 assertions spanning
3600 lines of SMT2. The proof sketch justifies why we expect the query to be easily solvable, and helps
explain how the bug leads to explosion. We have found writing pen-and-paper proof sketches to be
helpful for debugging as well.</p>
        <p>
          Dynamic Ackermannization [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] is a technique used by Z3 to significantly speed up reasoning about
transitivity of equality and congruence. In the intended proof, the Ackermann axiom in step 5 is
essential. Without it, the solver would have to discover the same fact through congruence closure,
which results in producing much weaker lemmas [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ].
        </p>
        <p>While Ackermannization itself worked as expected for the query, a Z3 bug related to the interaction
between Ackermannization and relevancy filtering kept step (6) from proceeding: Ackermannization
would create fresh literals for step (5) (e.g., p == addr_arr) without marking them as relevant. This
efectively made the Ackermann clauses invisible to the theories, so the theory of arithmetic would not
learn that assuming p == addr_arr propagates loc(p) == loc(addr_arr). Consequently, it would
not learn that p != addr_arr.</p>
        <p>When Z3 backtracks it will retain certain axioms (i.e., theory lemmas, including Ackermann axioms)
but forget about their relevancy. So the lifetime of the lemma is not scope-bound, but its relevancy
is. This essentially results in the axioms only participating in propositional propagation (within the
CDCL(T) core), but not theory propagation. This behavior is intended if the solver creates theory
axioms using existing literals, but works against leveraging theory axioms with fresh literals across
backtracking points.</p>
        <sec id="sec-4-2-1">
          <title>4.2. Instability may be caused by misconfiguration</title>
          <p>Solvers come with many configurable parameters that toggle heuristics, determine thresholds, set
resource limits, and so on. The current version of Z3 (4.14.2) has 657 parameters configurable through
the command line. Intuitively, the configuration should play a role in whether the solver behaves
as expected consistently for an encoding scheme; indeed, two of our case studies turned out to be
configuration issues.</p>
          <p>; a, b, arr, i are bitvectors of size 32
assert b = arr + &lt;elem_sz&gt; * i
assert a + 200 &lt;= b
assert a + 100 &gt;= b
lemma lemma_2toX()
ensures power2(64) == 18446744073709551616;
ensures power2(60) == 1152921504606846976;
ensures power2(32) == 4294967296;
ensures power2(24) == 16777216;
ensures power2(19) == 524288;
ensures power2(16) == 65536;
ensures power2(8) == 256;
{
}
reveal_power2();</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.2.1. A misconfigured heuristic</title>
        <p>bv_rewrite.smt2 is a minimized (to 14 lines) version of an unstable query that was produced by TPot.
It involves 4 variables, all of which are bitvectors of size 32. As shown in Fig. 2, the time it takes Z3
to solve bv_rewrite.smt2 depends heavily on a bitvector coeficient ( elem_sz) appearing only in an
assertion that is redundant and therefore should be inconsequential.</p>
        <p>With the default configuration, Z3 determines that the query only involves quantifier-free constraints
over bitvectors, and so it is essentially a boolean satisfiability problem. Then Z3 decides to bit-blast
the query, representing each bit with a boolean variable, and discharge it using the built-in SAT solver
(which is separate from the CDCL(T) core), without ever involving the actual SMT solver. Using the
command line argument tactic.default_tactic=smt prevents this from happening and makes Z3
terminate instantly irrespective of the coeficient.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.2.2. An inappropriate resource limit</title>
        <p>
          lemma_2toX.smt2 is a query produced by Dafny to prove the lemma in Fig. 3, as a part of the
IronFleet [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] project. The query is unstable with variable outcome: depending on the random seed, Z3
may quickly return either unsat or unknown.
        </p>
        <p>
          In the SMT encoding, Dafny represents power2 with an uninterpreted function, and provides the
usual [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] recursive definition ( ∀.power2() = ITE( = 0, 1, 2 · power2( − 1)))) as a quantified
axiom. For each ensures statement in Fig. 3, Dafny expects Z3 to unfold the recursive definition 
times to compute power2().
        </p>
        <p>
          The problem is that the global threshold Dafny sets on quantifier instantiation depth
(smt.qi.eager_threshold=100) is too small to fully unfold power2(64). We have empirically
observed using SMTScope [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ], the latest version of the Axiom Profiler [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ], that it does not allow power2
to be unrolled more than 32 times. So, Z3 can not close the branches of the proof involving power2(64)
and power2(60).
        </p>
        <p>The misconfiguration here is obvious in hindsight. In practice, it was not easily noticable as it made
the query unstable, not consistently unknown. This is because when we get lucky, Z3 reuses unfoldings
that were used to close branches corresponding to the last five ensures to close the first two as well.
So whether Z3 returns unsat or unknown depends on the order in which it explores diferent branches.
Setting the threshold to 200 allows all branches to close in isolation and stabilizes the query.</p>
        <p>One may also configure Z3 such that quantifier instances are efectively shared across branches, by
disabling relevancy filtering ( smt.relevancy=0). Doing so indeed stabilizes this particular query, but
makes most other Dafny queries to explode, due to Dafny’s quantifer-heavy encoding.</p>
        <sec id="sec-4-4-1">
          <title>4.3. Instability may be caused by misaligned expectations</title>
          <p>Developers of SMT-based verifiers treat solver optimizations as black boxes that are only relevant as
far as performance is concerned. They assume that, barring timeouts and resource limits, toggling
an optimization may change the solver’s performance but not its completeness (e.g., may not turn an
unsat into an unknown), otherwise the solver must have a bug. This assumption is incorrect, at least in
the case of relevancy filtering combined with trigger-based quantifier instantiation.</p>
          <p>Our case studies include two queries with unstable outcomes for which the root cause is a mismatch
between how verifier developers expect triggers to be treated and how Z3 treats them in the presence of
relevancy filtering. These are not Z3 bugs; relevancy filtering behaves according to how it is specified,
through semantic tableau rules, even though it violates user expectations. In §4.3.1 we present a
minimal reproduction and in-depth discussion on this mismatch.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.3.1. A minimal example</title>
        <p>
          Consider the query in Fig. 4. The query is clearly unsat, as instantiating the quantifier with  :− 0
trivially contradicts the first assertion. The conventional understanding of triggers, as described by
Leino and Pit-Claudel [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], dictates that the ground term (A 0) would match the trigger and produce
the expected instance, and Z3 would return unsat. Surprisingly, in the presence of relevancy filtering,
Z3 may return unknown depending on the branching order, precisely 50% of the time. We explain this
frequency in §4.3.2.
        </p>
        <p>The core of the problem is that there is a branch of the proof that does not close itself; it depends on
the quantifier instance but does not trigger instantiation. In the branch where (0) is false and (0)
is true, Z3 considers (0) to be an irrelevant disjunct in (or (A 0) (B 0)), as its truth value does
not afect the value of the disjunction. Therefore, Z3 keeps the assignment (0) :− false from being
propagated to any theory, including the E-matching engine. So, the branch only closes if Z3 visits one
of the branches that triggers instantiation (e.g., (0) :− true, (0) :− false) before this one and reuses
the instance.</p>
        <p>Another way to understand the problem is to realize that relevancy filtering efectively simulates
tableau-search, whereby the solver structurally branches on the problem. In this example, the solver
looks at the first assertion and branches on satisfying (0) versus (0). The latter branch is essentially
equivalent to solving a version of the query with (0) omitted from the first assertion, and that version
should be unknown according to Leino and Pit-Claudel’s description.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.3.2. Quantitatively understanding the minimal example</title>
        <p>The query’s success depends purely on the branching order and 50% of branching orders lead to success.
We experimented with 8 variants of the query, changing the polarities of (0) and (0), and toggling
the pattern between () and ().</p>
        <p>Some of these seemed to be consistently unsat or unknown irrespective of the random seed, suggesting
that these variations play a role in the query’s success. On closer look, some variations may fix a
branching order with Z3’s default configuration: Z3 will always guess false for the first atom it branches
on irrespectively of the random seed (the actual guessing process - phase caching - is somewhat more
involved, but the caches are initialized to false). Also, the way Z3 chooses the first atom to branch on
involves randomness but it is not uniformly random. To work around these issues, we made a surgical
modification on Z3’s source code to set the frequency of randomized case splits to 1.00 (the default is
0.01). This frequency is hard-coded and is not configurable through the command line. We then ran Z3
with smt.phase_selection=5 (random).</p>
        <p>Assignment 1
branch? literal relevant?</p>
        <p>Assignment 2
branch? literal relevant?
yes</p>
        <p>A(0)
¬A(0)
B(0)
¬B(0)
yes
no
yes
no
no
yes
no</p>
        <p>B(0)
A(0)
¬A(0)
A(0)
yes
no
no
yes
Total:</p>
        <p>Result
unsat
unknown
unknown
unknown
unsat
unsat
unknown</p>
        <p>Probability
1/4
1/4
1/8
1/8
1/4
1/2
1/2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In this section, we discuss our experience with the case studies. We hope the lessons we learned will
inform future work on addressing instability.</p>
      <sec id="sec-5-1">
        <title>5.1. Sources of randomness</title>
        <p>In §4.3.2, we describe how certain variations on the query in Fig. 4 seem to have implications on stability,
when in fact all they do is resample a random distribution that is unafected by the random seed. For this
query, the order of disjuncts, and the symmetric choice of the trigger term between the two disjuncts,
are sources of randomness. This is an example of how having multiple sources of randomness can lead
to confusion while identifying and debugging instability.</p>
        <p>Another such example is the IfNorm0 query. The original query produced by Dafny succeeds with
all random seeds, but asking Dafny to normalize the names of functions and variables (through a Dafny
command line option) makes the query explode with all random seeds. This pattern is confusing,
as it suggests that normalization is to blame for instability. The pattern occurs because Z3 reorders
disjunctions using a name-dependent order and Dafny sets the smt.case_split=3 parameter, which
orders case splits structurally, starting from the leftmost disjunct. This makes the case split order
insensitive to the random seed, but sensitive to variable and function names.</p>
        <p>Ensuring that all randomness is controllable through an explicit random seed would be a step towards
making instability easier to identify and address.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Examining, comparing, and replaying solver traces</title>
        <p>
          Existing approaches to examining Z3 traces [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ] are based on statistical measures (e.g., number of
instances of each axiom, length of instantiation chains, heuristic cost assignments to axioms). For many
of our examples, root causes lie with individual events that throw the solver of. Identifying such events
is hard, and to the best of our knowledge there currently is no debugging tool that helps do so.
        </p>
        <p>We adopted a methodology, which we think can be largely automated, based on comparing solver
traces using standard text dif tools. Given an unstable query, we could often find a “good” configuration
(of solver parameters, variable names, etc.) that produces the expected behavior. When a good
configuration does not appear organically, we had success with parameter fuzzing or randomness fuzzing
through Mariposa. We then difed the trace from a good run against one from a bad run, identified
the first point of divergence, and made surgical modifications on the Z3 source to unify the solver’s
behavior over the two. Of course, the first point of divergence usually was not itself related to the root
cause, but repeating this process until both configurations succeed allowed us to discover particularly
elusive bugs.</p>
        <p>This methodology would be greatly simplified if we were able to modify and replay solver traces.
This would allow us to unify behavior without having to modify the solver’s source code. Beyond trace
difs, replaying traces would make it much easier to test hypotheses of the form “had  been done
diferently, the query would not have exploded”.</p>
        <p>We also note that we used the tracing mechanism built into Z3, which was built for VCC, and not
tailored for our purpose. There are many relevant steps that do not get logged (e.g. theory propagations,
the state of the egraph, relevancy), which sometimes hides root causes and makes it hard to unify the
next point of divergence.
5.3. Proof sketches
§4.1.1 uses a proof sketch to explain the root cause of instability for a simplified version of the raw query;
we wrote similar sketches for most case studies. In our experience, writing down such pen-and-paper
proof sketches and trying to identify what keeps the solver from finding them has been useful for
debugging, as well as explaining root causes after the fact to convince oneself that the “right problem”
was fixed. We believe debugging tools that ingest proof sketches along with solver traces and pinpoint
the missing proof step would have greatly simplified our case studies.</p>
        <p>Debugging instability through proof sketches mimics how one debugs functional correctness problems
in most classes of software: by having a clear picture of the expected behavior and iteratively narrowing
down where the program departs from it. Without proof sketches, this is only possible for SMT solvers
at two extreme granularities, neither practical. On one extreme, one may say the expected behavior is
simply “the solver must return unsat quickly”. This does not allow for iteratively narrowing down on
the problem beyond the solver’s interface. Debugging at the interface by giving the solver iteratively
simplified/reduced versions of the query and trying to deduce which simplification step caused the
problem often leads to ill-informed and misleading conclusions. On the other extreme, one may examine
solver traces. This requires the developer to be intimately familiar with the solver’s internals and even
then they are often too large and too low-level to be practically understood by humans. Importantly,
traces also mostly consist of “dead ends”, i.e., branches explored and axioms instantiated by the solver
as part of proof search that ultimately do not make progress towards the expected proof, and are
uninteresting in pinpointing the divergence from expected behavior.</p>
        <p>Proof sketches can be written at arbitrary granularity, and deepened only in steps that need to be
narrowed down on (e.g., Fig. 1 elides the bitvector and arithmetic axioms that need to be involved for
steps (8) and (10); the solver follows these steps as expected so the user does not need more detail). This
makes proof-sketch debugging scale well with the size of the query being debugged. The original raw
query spans 55× the number of SMT lines the simplified version in Fig. 1 does, but its proof sketch still
consists of 10 (considerably bigger) steps.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We conjecture that the instability experienced by SMT-based verifiers today is mostly caused by fixable
engineering problems, not fundamental theoretical limitations. We support this conjecture with 11 case
studies, which identified three classes of root causes: solver bugs, misconfigurations, or misaligned
expectations. We hope our experience will inform future work on addressing instability in three ways.
First, some of the unstable queries we studied were not flagged as such by existing metrics; we argue
for an instability metric that reflects user experience. Second, part of our debugging methodology can
be automated, and would greatly benefit from tool support for debugging through proof sketches, and
replaying solver traces. Third, having multiple sources of randomness makes instability debugging
harder, so unifying all random behavior under explicit random seeds would be a step forward.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This material is based upon work supported by the Air Force Research Laboratory (AFRL) and Defense
Advanced Research Projects Agencies (DARPA) under Contract No. FA8750-24-C-B044. Any opinions,
ifndings, and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the AFRL and DARPA.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors did not use any generative AI tools in the writing of this paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bosamiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Takashima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heule</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parno</surname>
          </string-name>
          ,
          <article-title>Mariposa: Measuring SMT instability in automated program verification</article-title>
          ,
          <source>in: FMCAD</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hawblitzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Howell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lorch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zill</surname>
          </string-name>
          ,
          <article-title>Ironclad apps: End-to-end security via automated full-system verification</article-title>
          ,
          <source>in: OSDI</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferraiuolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hawblitzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parno</surname>
          </string-name>
          ,
          <article-title>Komodo: Using verification to disentangle secure-enclave hardware from software</article-title>
          ,
          <source>in: SOSP</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>McLaughlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.-A.</given-names>
            <surname>Jaloyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rabe</surname>
          </string-name>
          ,
          <article-title>Enhancing proof stability</article-title>
          ,
          <source>in: Dafny Workshop</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Amrollahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Preiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Niemetz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Charikar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <article-title>Using normalization to improve SMT solver stability</article-title>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bosamiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Heule</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parno</surname>
          </string-name>
          ,
          <article-title>Context pruning for more robust SMT-based program verification</article-title>
          ,
          <source>in: FMCAD</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lattuada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bosamiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cho</surname>
          </string-name>
          , H. LeBlanc, P. Srinivasan,
          <string-name>
            <given-names>R.</given-names>
            <surname>Achermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chajed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hawblitzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Howell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Lorch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Padon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parno</surname>
          </string-name>
          ,
          <article-title>Verus: A practical foundation for systems verification</article-title>
          , in: SOSP,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          , L. de Moura,
          <article-title>Complete instantiation for quantified formulas in satisfiabiliby modulo theories</article-title>
          ,
          <source>in: CAV</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>L. M. de Moura</surname>
          </string-name>
          , N. Bjørner,
          <article-title>Z3: An eficient SMT solver</article-title>
          , in: TACAS,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cebeci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , G. Candea,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pit-Claudel</surname>
          </string-name>
          ,
          <article-title>Practical verification of system-software components written in standard c</article-title>
          ,
          <source>in: SOSP</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lattuada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hawblitzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Howell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parno</surname>
          </string-name>
          ,
          <article-title>Storage systems are distributed systems (so verify them that way!)</article-title>
          , in: OSDI,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>L. de Moura</surname>
          </string-name>
          , N. Bjørner, Relevancy Propagation,
          <source>Technical Report MSR-TR-2007-140</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Dafny</given-names>
            <surname>Reference</surname>
          </string-name>
          <article-title>Manual: Measuring stability</article-title>
          , https://dafny.
          <source>org/v3.9</source>
          .0/DafnyRef/DafnyRef.html# 25751-measuring-stability,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bordis</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. R. M. Leino</surname>
          </string-name>
          ,
          <article-title>Free facts: An alternative to ineficient axioms in Dafny</article-title>
          , in: Formal Methods,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>K. R. M. Leino</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Pit-Claudel</surname>
          </string-name>
          ,
          <article-title>Trigger selection strategies to stabilize program verifiers</article-title>
          ,
          <source>in: CAV</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cadar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dunbar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Engler</surname>
          </string-name>
          ,
          <article-title>KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs</article-title>
          , in: OSDI,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>O.</given-names>
            <surname>Padon</surname>
          </string-name>
          , G. Losa,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sagiv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shoham</surname>
          </string-name>
          ,
          <article-title>Paxos made epr: decidable reasoning about distributed protocols</article-title>
          ,
          <source>in: OOPSLA</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Pulte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Makwana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sewell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Memarian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sewell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Krishnaswami</surname>
          </string-name>
          , CN:
          <article-title>Verifying systems C code with separation-logic refinement types</article-title>
          ,
          <source>in: POPL</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hamza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Voirol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kunčak</surname>
          </string-name>
          ,
          <string-name>
            <surname>System</surname>
            <given-names>FR</given-names>
          </string-name>
          :
          <article-title>Formalized foundations for the Stainless verifier</article-title>
          , in: OOPSLA,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Willems</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlaipfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bouissou</surname>
          </string-name>
          ,
          <article-title>Helping users to reduce brittleness in their Dafny programs - a success story</article-title>
          ,
          <source>in: Dafny Workshop</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>K. R. M. Leino</surname>
            ,
            <given-names>Dafny:</given-names>
          </string-name>
          <article-title>An automatic program verifier for functional correctness</article-title>
          ,
          <source>in: LPAR</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>K. L. McMillan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Padon</surname>
          </string-name>
          ,
          <article-title>Ivy: A multi-modal verification tool for distributed algorithms</article-title>
          , in: CAV,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parno</surname>
          </string-name>
          , The Mariposa Benchmark, https://github.com/secure-foundations/ mariposa-data,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <article-title>TPot query: raw</article-title>
          .smt2, https://github.com/dslab-epfl/tpot/blob/unstable-queries/unstable-queries/ raw.smt2,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cebeci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bjørner</surname>
          </string-name>
          ,
          <article-title>Can's fix to relevancy propagation</article-title>
          , https://github.com/Z3Prover/z3/commit/ 8c5abdf818ebd674d57a8dc43f3fe04d73f6ab73,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <article-title>TPot query: april2nd</article-title>
          .smt2, https://github.com/dslab-epfl/tpot/blob/unstable-queries/ unstable-queries/
          <year>april2nd</year>
          .smt2,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cebeci</surname>
          </string-name>
          ,
          <article-title>Mark fixed_eq literals as relevant</article-title>
          , https://github.com/Z3Prover/z3/pull/7533,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <article-title>TPot query: ofset</article-title>
          .smt2, https://github.com/dslab-epfl/tpot/blob/unstable-queries/ unstable-queries/ofset.smt2,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <article-title>TPot query: bv_rewrite</article-title>
          .smt2, https://github.com/dslab-epfl/tpot/blob/unstable-queries/ unstable-queries/bv_rewrite.smt2,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>R.</given-names>
            <surname>Leino</surname>
          </string-name>
          ,
          <article-title>Z3 issue: Unexpected proof failure due to minor input tweak</article-title>
          , https://github.com/Z3Prover/ z3/issues/7444,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>K.</given-names>
            <surname>McMillan</surname>
          </string-name>
          ,
          <article-title>Z3 issue: Performance regression on stratified EPR formula</article-title>
          , https://github.com/ Z3Prover/z3/issues/7515,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cebeci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bjørner</surname>
          </string-name>
          ,
          <article-title>Fix complete_partial_func for finite domains</article-title>
          , https://github.com/Z3Prover/ z3/pull/7533,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>D.</given-names>
            <surname>Winterer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>Validating SMT solvers via semantic fusion</article-title>
          ,
          <source>in: PLDI</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dutertre</surname>
          </string-name>
          , L. de Moura,
          <article-title>The Yices SMT solver</article-title>
          , https://yices.csl.sri.com/papers/tool-paper.pdf,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>N. S.</given-names>
            <surname>Bjørner</surname>
          </string-name>
          , L. M. de Moura,
          <article-title>Tractability and modern satisfiability modulo theories solvers</article-title>
          , in: L.
          <string-name>
            <surname>Bordeaux</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Hamadi</surname>
          </string-name>
          , P. Kohli (Eds.), Tractability: Practical Approaches to Hard Problems,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hawblitzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Howell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kapritsos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lorch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Zill,</surname>
          </string-name>
          <article-title>IronFleet: Proving practical distributed systems correct</article-title>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>N.</given-names>
            <surname>Amin</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. R. M. Leino</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Rompf</surname>
          </string-name>
          ,
          <article-title>Computing with an SMT solver</article-title>
          , in: TAP,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Jonáš</surname>
            <given-names>Fiala</given-names>
          </string-name>
          , SMTScope, https://github.com/viperproject/smt-scope,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>N.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Summers</surname>
          </string-name>
          ,
          <article-title>The Axiom Profiler: Understanding and debugging SMT quantifier instantiations</article-title>
          ,
          <source>in: TACAS</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>