The Blame Game for Property-based Testing Alberto Momigliano and Mario Ornaghi DI, Università di Milano Abstract. We report on work in progress aiming to add blame features to property-based-testing in logic programming, in particular w.r.t. the mechanized meta-theory model checker αCheck. Once the latter reports a counterexample to a property stated in αProlog, a combination of ab- duction and proof explanation tries to help the user to locate the part of the program that is responsible for the unintended behavior. To eval- uate whether these explanations are in fact useful, we need an unbiased collection of faulty programs, where the bugs location is unknown to us. We have thus implemented a mutation testing tool for αProlog that generates such a set. Preliminary experiments point to the usefulness of our blame allocator. The mutator is of independent interest, allowing us to gauge the effectiveness of the various strategies of αCheck in finding bugs in αProlog specifications. 1 Introduction Property-based testing [13] (PBT) is a lightweight validation technique by which we try to refute executable specifications against automatically (typically in a pseudo-random way) generated data. Once the idea broke out in the functional programming community with QuickCheck, it spread to most programming lan- guages and turned also in a commercial enterprise [15]. PBT has also a pedagogical value and our students seem to enjoy using it during their functional programming coursework. After type-checking succeeds, the PBT message “Ok, passed 100 tests”, whereby the tool reports the failure to refute some property, gives the students a warm fuzzy feeling of having writ- ten correct code. This feeling comes to a sudden halt when a counterexample is reported, signaling a mismatch between code and specification. In our expe- rience, the student freezes, disconnects the brain, typically uttering something like “Prof, there’s a problem here”, and refrains from taking any further action. Now, no-one likes to be wrong, not only students, and this is why testers and programmers may be different entities in a software company. While a coun- terexample to a property is a more useful response than, say, the false of a failed logic programming query — or a core dump for that matter — we can do bet- ter and go beyond the mere reporting of that counterexample; namely, try to circumscribe the origin of the latter. After all, the program under test may be arbitrarily large and, arguably, even partial solutions pinpointing the slice of the program involved in an error can help pursuing the painful process of bug fixing. “Where do bugs come from?” is a crucial question that has sparked a huge amount of research. In this short paper, we report on preliminary work on ad- dressing a much, much smaller sub-problem: first, the domain is mechanized meta-theory model-checking [4], that is the validation of the mechanization in a logical framework of the meta-theory of programming languages and related calculi [11, 10]. To fix ideas, think about the formal verification of compiler cor- rectness [21] and look no further than [18] for more impressive case studies. In this domain, the specifications under validation correspond to the theorems that the formal system under test should obey and are therefore trusted. Hence, we put the blame of a counterexample on the system encoding, not on a possibly erroneous property. Secondly, and consequently, we restrict to (nominal) logic programming, that is to encoding written in αProlog [7] and properties checked by αCheck [5]. Logic programming, of course, is particularly suited to the task of blame assignment. In fact, the related notion of declarative debugging [3], starting with Shapiro’s thesis, originated there. Our approach may be applicable outside those two boundaries, but we have not enough evidence to make any strong claim. Our blame tool is written in Prolog, a choice taken mostly out of con- venience, since it allows us the flexibility to experiment with ideas by meta- interpretation, rather then wiring them in for good in αProlog’s OCAML im- plementation. In fact, meta-interpreters are forbidden by the interaction of the strongly-typed nature of αProlog, together with its being a first order language. We picture the architecture of the tool in Fig. 1: we write models and specs in αProlog and we pass them to αCheck for validation. If a coun- terexample is found, we translate the αProlog code to standard Prolog and we feed it to our explanation tool to- gether with the said counterexample. The tool returns an explanation that can be hopefully used to debug the αProlog model. And so on. This ar- Fig. 1. Flow of the blame game chitecture, of course, applies only to αProlog code that can be faithfully executed in standard Prolog and therefore does not support natively nominal features. 2 A motivating example Suppose that we wish to study the following grammar (adapted from [1]), which characterizes all the strings with the same number of a’s and b0 s: S ::= . | bA | aB A ::= aS | bAA B ::= bS | aBB Here’s the bulk of the αProlog code, where we have omitted the obvious defini- tions of list-related predicates: ab : type. pred ss(list(ab)). pred aa(list(ab)). a,b : ab. pred bb(list(ab)). pred count(ab,list(ab),nat). ss([]). ss([b|W]) :- ss(W). ss([a|W]) :- bb(W). bb([b|W]) :- ss(W). bb([a|VW]) :- bb(V), bb(W), append(V,W,VW). aa([a|W]) :- ss(W). aa([b|VW]) :- aa(V), aa(W), append(V,W,VW) However, our encoding is flawed — we ask the reader to suspend her disbelief, since the bug is quite apparent. Still, it is easy to conjure an analogous scenario where the grammar has dozens of productions and the bugs harder to spot. We shall use αCheck to debug it. We split the characterization of the gram- mar into soundness and completeness properties: #check "sound" 10 : ss(W), count(a,W,N1), count(b,W,N2) => N1 = N2. #check "compl" 10 : count(a,W,N), count(b,W,N) => ss(W). The tool dutifully reports (at least) two counterexamples: Checking for counterexamples to sound: ss(W), count(a,W,N1), count(b,W,N2) => N1 = N2 Checking depth 1 2 3 Total: 0.000329 s: N1 = z, N2 = s(z), W = [b] compl: count(a,W,N), count(b,W,N) => ss(W) Checking depth 1 2 3 4 5 6 7 8 9 Total: 0.016407 s: N = s(z), W = [b,a] Now that we have the counterexamples, where is the bug? More precisely, which clause(s) shall we blame? The #check pragma of αCheck corresponds to specification formulas of the form ∀X. G ⊃ A (1) where G is a goal and A an atomic formula (including equality and freshness constraints). A (finite) counterexample is a grounding substitution θ providing values for X that satisfy the negation of (1): that is, such that θ(G) is derivable, but the conclusion θ(A) is not. In this paper, we identify negation with negation- as-failure (NAF) and we will not make essential use of nominal features. The tool synthesizes (trusted) type-based exhaustive generators for the free variables of the conclusion, so that NAF executes safely. A complete search strategy, here naive iterative deepening, searches for a proof of: ∃X:τ . G ∧ gen[[τ ]](X1 ) ∧ · · · ∧ gen[[τ ]](Xn ) ∧ not(A) (2) To wit, completeness will be the query: ∃W. count(a,W,N), count(b,W,N), gen ablist(W), not(ss(W)). The predicate gen ablist(W) corresponds to the generator formula gen[[list(ab)]](W ) for lists of a’s and b’s. In the soundness query we know by mode analysis that we do not need generators for numerals. For a query such as the above and more in general such as (2) to unexpectedly succeed, two (possibly overlapping) things may have gone wrong [22]: MA: the atom A fails, whereas it belongs to the intended interpretation of its definition (missing answer ); WA: a bug in G creates some erroneous bindings, which again make A fail (wrong answer ): some atom in G succeeds but it is not in the intended model. Our “old-school” idea consists in coupling 1) abduction [17] to try and di- agnose MA’s with 2) proof verbalization, that is presenting, at various levels of abstraction, proof-trees for WA’s as a means to explain where, if not why, the WA occurred. Of course, since we do not know, in principle, which is which and the two phenomena can occur simultaneously, we are going to use some simple heuris- tics to drive this process. What we would rather keep our distance from is full declarative debugging (see e.g., [8]), as our experience suggests that making the user the oracle potentially at each step is just too much work. Our notion of abduction is a simple-minded proof procedure that finds a set of assumptions ∆ such that Γ, ∆ ` G0 , but Γ 6` G0 for given Γ and G0 . In our limited setting here of first-order logic programming, we, as expected, internalize Γ as a fixed program P and we see ∆ as a (additive) conjunction of atoms A. An abduction procedure depends on the selection of what the abducibles are. This choice is related to which part of the program we trust and which we want to investigate. The system will suggest abducibles based on the dependency graph of the conclusion of the property under investigation, but we give the final say to the user, while trying to make it easy to change one’s mind. We now work through the above example. Our tool reads into standard Prolog the αProlog files, adding names to rules and treating type information as inactive predicates via appropriate operator declarations. For example, here is the ss predicate and related declarations: ab :: type. a :: ab. b :: ab. pred ss(list(ab)). ss([]) :- rule(s1). ss([b|W]) :- rule(s2), ss(W). ss([a|W]) :- rule(s3), bb(W). To begin with, we need to decide what to investigate and what to trust. Consis- tently with our faith in properties, we will trust count and append: trusted(append(_,_,_)). trusted(count(_,_,_)). The dependency graph suggests (quite unhelpfully here) to put all of aa,bb,ss as abducibles. assumable(ss(_)). assumable(bb(_)). assumable(aa(_)). For exposition sake, let us investigate the completeness bug first, that is the failure of ss([b,a]). This is a case of MA, since that string should be accepted. The tool returns this explanation: ss([b,a]) for rule s2, since: ss([a]) for rule s3, since: bb([]) for assumed Once you get used to the fairly primitive verbalization, you see that if it were the case that bb([]) holds, then ss([b,a]) would have succeeded and the counterexample not found. However, there is no production for b accepting the empty string, so we should blame ss, namely either s2 or s3. By comparing the clauses to the productions they are supposed to encode we see that the body of s2 should be aa(W) and not ss(W). For the soundness bug, it is obvious that it is a case of WA. We trust uni- fication, the more since here it is just syntactic equality. Hence we need to un- derstand what went wrong in the premise ss(W) — remember, we trust count. The tool answers: ss([b]) for rule s2, since: ss([]) for fact s1. Since s1 is OK, it is again s2’s fault. Once we fix that bug, αCheck does not report any more issues. 3 How it works It is natural to split the definitions of the program under test into builtin, which we assume to be non problematic, and those we want to look into (pred, for compatibility with αProlog type declarations). By default, every user-defined predicate is under suspicion. However, once a pred overcomes our scrutiny, or we believe it, perhaps temporarily, to be well-behaved, we can graduate it to being trusted and let it off the hook. Builtins and trusted predicates are not meta-interpreted, that is we simply call them. Abducibles must be non-trusted pred definitions. The abduction/explanation mechanism can be formulated in terms of a cal- culus AE (P ), based on the notion of proof-tree (see [14], for example), whose nodes are constructed with the following inference rules: Gσ r G1 G2 Gi ∧ ∨i true > Aσ G1 ∧ G2 G1 ∨ G2 i = 1 or 2 asm builtin trusted Aσ Aσ Aσ Given a fixed program P , rule r corresponds to a single SLD step with clause A :- rule(r),G in P . Rules >, ∧, ∨ are self-explanatory. Rule asm closes a proof whenever A is declared as abducible. The last two rules are somewhat different (and hence the double bar): builtin and trusted both end the proof returning an answer substitution: in the first case if there is an external computation of the builtin predicates. In the trusted case, we apply the rule only if there is a closed proof-tree, i.e. without assumptions returning σ. In fact, this is not technically a finitary rule, in the same sense that NAF is not; nevertheless, the way in which we use the calculus, namely by replaying already found counterexamples, will ensure that the rule is effective. For example, the second clause for bb asks for a computation of append(V,W,VW) where VW is ground. This is simply executed as a Prolog query outside the calculus, returning the ground substitution σ. The meta-interpreter computes over those proof-trees, simultaneously col- lecting abducible assumptions and Prolog terms encoding the proof-tree itself, to be later verbalized. The meta-interpreter enjoys, we claim, the following cor- rectness property: a proof-tree computed by the meta-interpreter is a proof in the calculus above. Conversely, for all goals G, if there is proof in AE (P ) of Gσ, the interpreter will produce a more general proof-tree for Gδ with the same sequence of rules. Proof verbalization Clearly, proof-trees of height 2 as the ones occurring in Sec- tion 2 are not that difficult to interpret. However, we address the general case by offering various levels of explanation, from printing out the whole tree to instead presenting a skeleton containing only the involved rule names. We currently do this with different forms of printf, while it would be more flexible to use a form of proof distillation [2], by which the proof-tree itself is pruned at the desired level of explanation and only then verbalized. Heuristics Suppose you have a counterexample for property ∀X. G ⊃ A: how do you go about trying to decide whether it is a MA or a WA? We suggest the following steps with the blame tool: 1. if A is a builtin (including constraints) or a trusted predicate, it is clear that it is a case of WA and we directly verbalize the proof tree of the premise(s). 2. Otherwise, we use abduction to investigate the failure of A; start by adding A and its dependency graph as abducibles. 2.1 If the failure of A points to a case of MA, use the explanation mechanism to see if there is a clause missing in the program or which clause is otherwise involved. 2.2 If the set of assumptions found by abduction is “unreasonable”, that is we have atoms that should not hold in the intended model, it is likely a case of WA and we turn to verbalizing the proof-tree for G. 4 Mutation testing for αProlog Mutation analysis [16] aims to evaluate software testing techniques with a form of white box testing, whereby a source program is changed in a localized way by introducing a single (syntactic) fault. The resulting program is called a “mutant” and hopefully is also semantically different from its ancestor. A testing suite should recognize the faulted code, which is known as “killing” the mutant. The higher the number of killed mutants, the better the testing suite. What is the connection with our endeavor? A killed mutant is a good candi- date for blame assignment, since it should simulate reasonable bugs that occur in a program without being planted by ourselves. This is justified by the “com- petent programmer assumption” [16], according to which programmers tend to develop programs close to the correct version and thus the difference between current and correct code for each fault is small. A mutator provides us (automat- ically) with a multitude of faulty programs to try and explain. At the same time, such a tool helps to gauge the effectiveness of αCheck in locating bugs, under its various strategies. In previous work we either had to use faulty models from the literature [6], or to resort to manual generation of mutants [19] to compare αCheck with other PBT tools — but this is obviously biased, labor-intensive and very hard to scale. We have thus developed a mutator for αProlog programs, with an emphasis on trying to devise mutation operators for our intended domain, namely mod- els of programming languages artifacts. Those operators specify the mutations that the tool will inject and therefore a mutator is as effective as operators are relevant. Mutation testing comes from imperative languages and the operators thereby are pretty useless. Even operators proposed for declarative programs [9, 20] make only partial sense. In particular those in [9] not only are completely oblivious of αProlog’s type discipline, but mostly address the operational seman- tics of Prolog: clause reordering, permutation of cuts etc. Those are not relevant to us, since the checker uses a complete search strategy and does not understand cuts. Keeping in mind that we strive to produce mutants that cannot be detected already by static analysis such as type checking in αProlog or singleton variable analysis in standard Prolog, we have converged on this initial set of operators: Clause mutations: deletion of a predicate in the body of a clause, deleting the whole clause if a fact. Replacement of disjunction by conjunction and vice versa. Operator mutations: arithmetic and relational operator mutation, say < in ≤. Variable mutations: replacing a variable with an (anonymous) variable and vice versa. Constant mutations: replacing a constant by a constant (of the same type), or by an (anonymous) variable and vice versa. αProlog-specific Removing freshness annotations (#), changing their scope or confusing them with equality. In our implementation, a simple driver produces a given number of randomly generated mutants from an αProlog source P and a list LM of mutation opera- tors. At every generation step we make a random choice of operator in LM and of clause in P such that the operator is applicable, in the spirit of [27]. If the mutant is new, i.e., it has not been produced before, it is passed to the next phase: here, we try to weed out semantically equivalent mutants following the proposal in [23] of comparing them with their ancestor on a finite portion of input. We achieve this by asking αCheck to try to refute their equivalence up to a given bound. Keeping the bound small makes this a fairly quick, if imperfect, filter. If the mutant survives, we pass it again to αCheck, this time under the property that the original model was supposed to satisfy. If it fails, we go back to the flow in Fig. 1. We have been experimenting with the mutation and explanation of three main case studies: 1. The preservation and progress property for the typed arithmetic language in Chapter 8 of [25]. 2. Non-interference for static flow information systems, which we had started to address in [6]. 3. Type preservation for low-level abstract machines, as in [19]. While the results are encouraging, we are still in the process of collecting evi- dence. 5 Conclusions and future work We have given an “appetizer” of our approach to blame assignment for coun- terexamples found by PBT. We have used old-fashioned ideas in the literature, such as abduction and explanation trees, trying to strike a reasonable balance with declarative debugging: we only ask for the user’s input when we let her decide if the answer given by abduction is sensible; this is quite different, we argue, from the the constant prodding that declarative debugging requires. Our approach is also much simpler (perhaps simple-minded) than previous work aim- ing to explain why a query is true or false in a model, typically under the answer set semantics, as in e.g. [26]. There is so much work that lies in front of us: we need to gather more experience using the tool, especially with a user who is different from the authors. This is particularly crucial for tuning the amount of details in proof verbalization. Mutation testing for αProlog is of independent interest, besides the way we have used it here. It allows us to evaluate how well αCheck performs in spotting bugged models, similarly to the interaction of MuCheck [20] with QuickCheck. However, designing mutation operators that make sense in our intended domain is an open question, see [24] for the more general issue of mutations vs. real faults. The operators that we have proposed are a starting point, but it would be nice to provide the user with a basic DSL to encode her own. Finally, the main drawback of the current setup is that we can play the blame game only for αProlog programs that are indeed Prolog code, i.e. that do not use the nominal features, since they cannot be faithfully replayed in standard Prolog. The new version, already under development, will instead handle the whole of αProlog, by porting the blame tool to the latter, where αProlog’s programs are reified in object-level clauses following the two-level approach [12]. References 1. J. Blachette. Picking Nits: A User’s Guide to Nitpick for Isabelle/HOL. TUM, 2018. http://isabelle.in.tum.de/dist/doc/nitpick.pdf. 2. R. Blanco, Z. Chihani, and D. Miller. Translating between implicit and explicit versions of proof. In CADE, volume 10395 of LNCS, pages 255–273. Springer, 2017. 3. R. Caballero, A. Riesco, and J. Silva. A survey of algorithmic debugging. ACM Comput. Surv., 50(4):60:1–60:35, Aug. 2017. 4. J. Cheney and A. Momigliano. Mechanized metatheory model-checking. In PPDP, pages 75–86. ACM, 2007. 5. J. Cheney and A. Momigliano. αCheck: A mechanized metatheory model checker. TPLP, 17(3):311–352, 2017. 6. J. Cheney, A. Momigliano, and M. Pessina. Advances in property-based testing for αProlog. In B. K. Aichernig and C. A. Furia, editors, TAP, volume 9762 of LNCS, pages 37–56. Springer, 2016. 7. J. Cheney and C. Urban. Nominal logic programming. ACM Trans. Program. Lang. Syst., 30(5):26:1–26:47, 2008. 8. N. Dershowitz and Y. Lee. Logical debugging. J. Symb. Comput., 15(5/6):745–773, 1993. 9. A. Efremidis, J. Schmidt, S. Krings, and P. Körner. Measuring coverage of prolog programs using mutation testing. CoRR, abs/1808.07725, 2018. 10. G. Fachini and A. Momigliano. Validating the meta-theory of programming lan- guages. In SEFM, volume 10469 of LNCS, pages 367–374. Springer, 2017. 11. A. P. Felty and A. Momigliano. Reasoning with hypothetical judgments and open terms in Hybrid. In PPDP, pages 83–92. ACM, 2009. 12. A. P. Felty and A. Momigliano. Hybrid - A definitional two-level approach to reasoning with higher-order abstract syntax. J. Autom. Reasoning, 48(1):43–105, 2012. 13. G. Fink and M. Bishop. Property-based testing: a new approach to testing for assurance. ACM SIGSOFT Software Engineering Notes, pages 74–80, July 1997. 14. W. Hodges. Logical features of Horn clauses. In D. M. Gabbay, C. J. Hogger, and J. A. Robinson, editors, Handbook of Logic in Artificial Intelligence and Logic Programming (Vol. 1), pages 449–503. Oxford University Press, Inc., 1993. 15. J. Hughes. Quickcheck testing for fun and profit. In PADL’07, LNCS, pages 1–32, Berlin, Heidelberg, 2007. Springer-Verlag. 16. Y. Jia and M. Harman. An analysis and survey of the development of mutation testing. IEEE Trans. Softw. Eng., 37(5):649–678, Sept. 2011. 17. A. C. Kakas, R. A. Kowalski, and F. Toni. Abductive logic programming. Journal of Logic and Computation, 2(6):719–770, 1992. 18. C. Klein and et al. Run your research: on the effectiveness of lightweight mecha- nization. In POPL ’12, pages 285–296, New York, NY, USA, 2012. ACM. 19. F. Komauli and A. Momigliano. Property-based testing of the meta-theory of abstract machines: an experience report. In CILC, volume 2214 of CEUR, pages 22–39, 2018. 20. D. Le, M. A. Alipour, R. Gopinath, and A. Groce. Mucheck: An extensible tool for mutation testing of haskell programs. In ISSTA 2014, pages 429–432. ACM, 2014. 21. X. Leroy. Formal verification of a realistic compiler. CACM, 52(7):107–115, 2009. 22. L. Naish. A declarative debugging scheme. Journal of Functional and Logic Pro- gramming, 1997(3):3–29, 1997. 23. A. J. Offutt and J. Pan. Automatically detecting equivalent mutants and infeasible paths. Software Testing, Verification and Reliability, 7(3):165–192, 1997. 24. M. Papadakis, D. Shin, S. Yoo, and D.-H. Bae. Are mutation scores correlated with real fault detection? ICSE ’18, pages 537–548. ACM, 2018. 25. B. C. Pierce. Types and Programming Languages. MIT Press, 2002. 26. C. Viegas Damásio, A. Analyti, and G. Antoniou. Justifications for logic program- ming. In P. Cabalar and T. C. Son, editors, LPNMR, pages 530–542. Springer, 2013. 27. L. Zhang, M. Gligoric, D. Marinov, and S. Khurshid. Operator-based and random mutant selection: Better together. In ASE, pages 92–102. IEEE, 2013.