=Paper=
{{Paper
|id=Vol-1785/W47
|storemode=property
|title=The Impact of Proof Steps Sequence on Proof Readability - Experimental Setting
|pdfUrl=https://ceur-ws.org/Vol-1785/W47.pdf
|volume=Vol-1785
|authors=Karol Pak,Aleksy Schubert
|dblpUrl=https://dblp.org/rec/conf/cikm/PakS16
}}
==The Impact of Proof Steps Sequence on Proof Readability - Experimental Setting==
The impact of proof steps sequence on proof readability — experimental setting? Karol Pąk1 and Aleksy Schubert2 1 Institute of Informatics, University of Białystok, ul. K. Ciołkowskiego 1M, 15-245 Białystok, Poland pakkarol@uwb.edu.pl 2 Institute of Informatics, University of Warsaw ul. S. Banacha 2, 02–097 Warsaw, Poland alx@mimuw.edu.pl Abstract. Different sequences of proof steps may result in different ex- periences of the overall formal proof clarity. There is a conjecture that sequence in which more steps refer to the content of the preceding state- ment (1) is more comprehensible than the one in which references span longer distances (2). We studied the claim in experimental setting where subjects indicated their reading preference with two versions of the same proof. The difference between their reports indicates that contrary to the conjecture, proofs of the kind (2) can give cognitive advantage in a statistically significant way. 1 Introduction The readability of formal proofs, as in the case of computer programs, has signifi- cant impact on the process of proof formalisation development and maintenance. This phenomenon has already been observed in the programme of Nicolas Bour- baki — the resulting proofs written in formalist fashion were perceived as too obscure and the whole project gained the opinion of unwieldy [12]. Moreover, there are numerous situations in which proof scripts are indeed read, e.g. when a proof development is used as a library of facts [4], when users try to learn new proving techniques [8], or when users want to strengthen theorems [2]. There are many factors that have an impact on readability. In this paper we focus on the psychological readability factors associated with locality of refer- ence. They can be summarised in a high-level statement due to Behaghel that elements that belong close together intellectually should be placed close together [1] and this statement concerns particularly systems such as Mizar [3] and Is- abelle/Isar [13], which try to construct proofs close to natural language ones. Discussions among Mizar Mathematical Library developers [9] led to the conclusion that proof steps that refer to the preceding step are perceived as more comprehensible. This led to construction of algorithms that rearranged ? The paper has been supported by the resources of the Polish National Science Centre granted by decision n◦ DEC-2012/07/N/ST6/02147. 2 Karol Pąk and Aleksy Schubert Mizar proof scripts so that they maximise the number of such references [9] and further analysis of the complexity of the rearrangement process [10,11]. One important step in understanding the applicability of the techniques is to assess their impact on real readers. This paper is an attempt in this direction. We investigate the influence of one of the employed techniques — changing the sequence of steps so that the number of then steps in continuous sequences is maximal. In our study we give two versions of the same proof to a substantial number of subjects and obtain their readability assessment. The paper is constructed as follows. In Section 2 we present the design of our experiment. Next, we present the statistical methods we employ and present the way our experiment was executed. This is done in Section 3. The results of the experiments are presented in Section 4 and discussed in Section 5. Many people helped us in organising the experiment so we acknowledge them in Section 6. 2 The Design of the Experiment 2.1 The General Idea of the Experiment It is not straightforward to devise an experiment in which two text structures are compared for readability. First of all, the feature of readability is subjective so a comparing judgement of a single person must be the basic unit of the measurement. Moreover, one must note that it is not directly valid to ask subjects to compare texts of different proofs for readability since then there may be many other structural factors that must be recognised and well understood to make such a comparison informative. Therefore, our first design choice is that one subject at one moment compares two proofs of the same mathematical statement and the result of the judgement is accounted in the study. This assumption brings one major difficulty in. Comparison of essentially the same proofs requires reading two texts that convey the same content. The texts are by the nature of the reading process read in sequence. Clearly, the process of reading of the second text is strongly influenced by that of the first text. We can now expect two effects: 1. either the subjects will judge the second text as more readable due to the fact that it reflects something that is already known, 2. or the subjects will judge the second text as less readable since they already familiarised themselves with the structure of the first one and perceive the text of the second one as alien and so less favourable. It turns out that the second effect is much stronger than the first one (we provide statistical evidence for this in Section 4.2). Therefore, we decided that meaningful responses are only those that counter this strong trend and that such responses indicate visible cognitive advantage to the subjects that behave in this way. This assumption requires us to create two contrasting situations in which subjects can counter this trend. We decided here to distribute among subjects two variants of the test: one that presents a version A of the proof first and then a version B, and one that presents the version B first and A subsequently. The impact of proof steps sequence on proof readability. . . 3 One more important issue to deal with here is making sure that the tests are not neglected by the subjects. That is why we decided that the test should have a form of an assignment in which the analysis of the proofs is required to give an answer. In the end we decided that a good way to achieve this is to introduce a mistake into the proof and instruct subjects to find it. The subjects were to obtain an award for fulfilling the task. 2.2 The Actual Tests The assignments given to the students were based upon a faulty proof of the wrong set-theoretic formula (X ∪ Y )\(X ∩ Y ) = (X\Y ) ∩ (Y \X). It has a correct counterpart (X ∪ Y )\(X ∩ Y ) = (X\Y ) ∪ (Y \X) the proof of which can be formalised in Mizar as presented in Fig. 1 (version A of the proof). This proof was taken as the basis for the experiment. We devised another proof (see Fig. 2) of the same fact (version B) in which certain two steps were permuted, which is reflected by the changed line numbers (7, 8) on the left margins of the proofs. The actual internal structure of the two versions is the same and is presented in Fig. 3 where labels of the nodes correspond directly to the labels on the left margins of both the proof in version A and B.3 In this way we can see that the proofs indeed differ only in the sequence of steps and the necessary Mizar syntax reorganisations. We expected that the version B will be perceived by subjects in our experiments as less readable than the version A (and this turned out to be actually false). Since we wanted to obtain judgements of many subjects and it is difficult to gather significantly many Mizar experts in one place, we had to rewrite the proof texts in natural language so that they are digestible by people who are not familiar with Mizar syntax. Moreover, the subjects in the experiment were Polish native speakers so the assignments had to be prepared in that language.4 We provide a faithful translation of the tests in Appendix A. The two versions of the proof were printed on one page of an A4 leaf. Half of the tests had version A of the proof located on the left-hand side of the page and version B on the right-hand one and half of the tests had version B located on the left-hand side and version A on the right-hand one. To make distribution of tests quick we did not strictly keep the size of two resulting groups even. However, we made sure that the groups are of substantial size (see Section 4). 3 Graph representation of proofs was defined by Pąk [9]. We invite interested readers to consult his publication for details. 4 The original tests are available in the package with experiment data at the address http://www.mimuw.edu.pl/~alx/proof-readability.zip 4 Karol Pąk and Aleksy Schubert theorem 1: (X ∨ Y) \ (X ∧ Y) = (X \ Y) ∨ (Y \ X) proof 2: f o r x h o l d s x i n (X ∨ Y) \ (X ∧ Y) i f f x i n (X \ Y) ∨ (Y \ X) proof 3: let x; 4: t h u s x i n (X ∨ Y) \ (X ∧ Y) i m p l i e s x i n (X \ Y) ∨ (Y \ X) proof 5: a s s u m e A1 : x i n (X ∨ Y) \ (X ∧ Y ) ; 6: t h e n n o t x i n (X ∧ Y) by XBOOLE_0: d e f 5 ; 7: t h e n A2 : n o t x i n X o r n o t x i n Y by XBOOLE_0: d e f 4 ; 8: x i n X o r x i n Y by A1 ,XBOOLE_0: d e f 3 ; 9: t h e n x i n (X \ Y) o r x i n ( Y \ X) by A2 ,XBOOLE_0: d e f 5 ; 10: h e n c e x i n (X \ Y) ∨ (Y \ X) by XBOOLE_0: d e f 3 ; end ; 11: t h u s x i n (X \ Y) ∨ (Y \ X) i m p l i e s x i n (X ∨ Y) \ (X ∧ Y) proof 12: a s s u m e x i n (X \ Y) ∨ (Y \ X ) ; 13: t h e n x i n (X \ Y) o r x i n (Y \ X) by XBOOLE_0: d e f 3 ; 14: t h e n A3 : x i n X & n o t x i n Y o r x i n Y & n o t x i n X by XBOOLE_0: d e f 5 ; 15: t h e n A4 : x i n (X ∨ Y) by XBOOLE_0: d e f 3 ; 16: n o t x i n (X ∧ Y) by A3 ,XBOOLE_0: d e f 4 ; 17: h e n c e x i n (X ∨ Y) \ (X ∧ Y) by A4 ,XBOOLE_0: d e f 5 ; end ; end ; h e n c e t h e s i s by TARSKI : 2 ; end ; Fig. 1. The Mizar proof the experiments were based upon, version A. theorem 1: (X ∨ Y) \ (X ∧ Y) = (X \ Y) ∨ (Y \ X) proof 2: f o r x h o l d s x i n (X ∨ Y) \ (X ∧ Y) i f f x i n (X \ Y) ∨ (Y \ X) proof 3: let x; 4: t h u s x i n (X ∨ Y) \ (X ∧ Y) i m p l i e s x i n (X \ Y) ∨ (Y \ X) proof 5: a s s u m e A1 : x i n (X ∨ Y) \ (X ∧ Y ) ; 6: t h e n A2 : n o t x i n (X ∧ Y) by XBOOLE_0: d e f 5 ; 8: A3 : x i n X o r x i n Y by A1 ,XBOOLE_0: d e f 3 , d e f 4 ; 7: n o t x i n X o r n o t x i n Y by A3 , A2 ,XBOOLE_0: d e f 4 ; 9: t h e n x i n (X \ Y) o r x i n ( Y \ X) by A3 ,XBOOLE_0: d e f 5 ; 10: h e n c e x i n (X \ Y) ∨ (Y \ X) by XBOOLE_0: d e f 3 ; end ; 11: t h u s x i n (X \ Y) ∨ (Y \ X) i m p l i e s x i n (X ∨ Y) \ (X ∧ Y) proof 12: a s s u m e x i n (X \ Y) ∨ (Y \ X ) ; 13: t h e n x i n (X \ Y) o r x i n (Y \ X) by XBOOLE_0: d e f 3 ; 14: t h e n A4 : x i n X & n o t x i n Y o r x i n Y & n o t x i n X by XBOOLE_0: d e f 5 ; 15: t h e n A5 : x i n (X ∨ Y) by XBOOLE_0: d e f 3 ; 16: n o t x i n (X ∧ Y) by A4 ,XBOOLE_0: d e f 4 ; 17: h e n c e x i n (X ∨ Y) \ (X ∧ Y) by A5 ,XBOOLE_0: d e f 5 ; end ; end ; 18: h e n c e t h e s i s by TARSKI : 2 ; end ; Fig. 2. A Mizar proof in which steps were permuted, version B. The impact of proof steps sequence on proof readability. . . 5 1 2 18 4 11 3 6 7 12 15 5 10 14 17 8 9 13 16 Fig. 3. The structures of the two versions of the proof. 3 The Execution of the Experiment The experiments were conducted on students of informatics at Faculty of Math- ematics, Informatics and Mechanics, University of Warsaw and Faculty of Math- ematics and Computer Science, University of Białystok. They were executed in three rounds of two slightly different settings. Round I took place in January 2015 and was conducted as a separate assign- ment in an exam with a slot of 10 minutes dedicated exclusively to the task of comparing the proofs. The award for students were additional points in the exam score. The topic of the exam was Logic for Computer Scientists. The subjects of the experiment were students of first year postgraduate studies. These students had studied the full curriculum of undergraduate informatics, including Foun- dations of Mathematics as taught on the first year of undergraduate studies. We can assume that their experience with formal systems, including programming languages, was good enough for this test. The experiment was conducted in parallel in two lecture halls which con- tained groups of students of approximately 35 people each. The assignments were given in parallel and in the conditions of the exam, which guarantee the lack of communication between subjects. Therefore, we can assume that the results were independent. The instruction of the assignment was as follows You are presented two proofs of the same fact. Both have the same flaw. Choose the version of the reasoning for which you can easier perform the following task: please show the place where the flaw is located and describe in one sentence what is the reason of the mistake. 6 Karol Pąk and Aleksy Schubert Therefore, the students were given information that the two texts are essentially equivalent and they were instructed to indicate which of the texts gave them cognitive advantage. Both versions of the proof were formulated using the same natural language phrases so that one formulation corresponds to the same kind of inference step. Round II took place in February 2015 and was conducted in Białystok. The procedure and tests were the same as in the case of Round I. The topic of the exam was Logic and Set Theory. The subjects of the experiment were students of first year undergraduate studies of mathematics and informatics. We can assume that their experience with formal systems was fair and enough for this test, since they were taught specifically to prove facts using natural deduction format that was close to the one used in Mizar. Still, we obtained significantly more answers than in Round I which indicated that subjects did not understand the subject matter of the proof. The exam was taken in two groups of approximately 20 and 40 people respectively Round III took place in January 2016 in Warsaw and was conducted as a sep- arate assignment during blackboard classes with a slot of 10 minutes dedicated exclusively to the task of comparing the proofs. The award for students were additional points in their score of the classes. The topic of the classes was Foun- dations of Mathematics. The subjects of the experiment were students of first year undergraduate studies. The experience of the students with formal systems was limited. The experiment was conducted at the end of the course on foun- dations of mathematics so their understanding of the notion of proof should be satisfactory for the purpose of our procedure. The experiment was conducted in 4 different groups of classes at different times. The size of the groups was approximately 10 people. However, we can assume that the measurements were independent since the students of the first year of undergraduate studies rarely communicate between different groups (all other classes are given in groups of the same composition) and the details of the assignment necessary to complete it successfully are not very specific and so very difficult to explain without having the actual test at hand. Therefore, we can safely assume that despite the assignments were not given in parallel their results are independent. Of course, the assignments within each of the group were executed so that subjects did not communicate one with another. The instruction of the assignment was as follows You are presented two proofs of the same fact. Find in them as many flaws as you can. In case a mistake occurs in both proofs, mark with a star the version in which you found it first. Is the structure of some of the proofs more readable for you? Therefore, the students were given information that the two texts are essentially equivalent and they were instructed to indicate which of the texts gave them cognitive advantage. Additionally, they were instructed to search for as many flaws as possible and mark places where the mistake was found first. We intro- The impact of proof steps sequence on proof readability. . . 7 duced this change to get deeper confidence that the proofs were read thoroughly and check if this change has impact on the effects (E1) and (E2) In this case the two versions of the proof were formulated in a different fashion. Moreover, as the proofs actually demonstrate equivalence, they divide into two parts: one for (⇒) and one for (⇐). We decided to have a different sequence of these parts in each version. The rationale behind these changes was to make line to line comparison more difficult and force subjects to perceive the structure of the proofs first and then judge their clarity. We would like to point out that in Round I and II the instruction was less direct and referred to easiness of the task while in Round III the instruction appealed directly to the intuitive understanding of the term readability. However, we would like to stress that each of the formulations actually means that the subjects were instructed to indicate their cognitive preference for the two versions of the proof, i.e. which of the texts is easier to digest. 3.1 The Mann-Whitney Test For our verification, we use the non-parametric Mann-Whitney test (also called Willcoxon test), as the use of standard parametric tests assumes the distributions are normal. This test is used to statistically refute a null hypothesis H0 in favour of its negation, i.e. the alternative hypothesis HA . This kind of test can be applied when the following prerequisites are met: Hypotheses The null hypothesis H0 for the study should be formulated in the fashion such that the probability of an observation from the population G1 exceeding an observation from the second population G2 is less then or equal the probability of an observation from G2 exceeding an observation from G1 , i.e. P (score(G1 ) > score(G2 )) ≤ P (score(G2 ) > score(G1 )). Consequently, the alternative hypothesis HA is that the probability of an observation from the group G1 exceeding an observation from the group G2 is greater than the probability of an observation from G2 exceeding an obser- vation from G1 , i.e. P (score(G1 ) > score(G2 )) > P (score(G2 ) > score(G1 )). Uniformity of populations We should also be able to assume that in case the distribution of results from G1 and G2 is the same, the probability of an observation from G1 exceeding one from G2 is equal to the probability of an observation from G1 exceeding one from G2 . In other words the background of both groups is the same and their members were chosen randomly from the point of view of the experiment. Independence We should be able to assume that the observations in G1 and G2 are independent. Comparable responses The scores should be numeric so they can be easily compared one with another so a single ranking of the subjects can be formed. The basic idea of the test is that we build a ranking list of the scores from the lowest to the highest (with possible ties). Then we try to get a numerical representation on which of the groups has more results close to the top one. 8 Karol Pąk and Aleksy Schubert Consider the following pattern n2 n2 (n2 + 1) X U = n1 n2 + − Ri 2 i=1 where U is the value of the Mann-Whitney U test, n1 is the sample size of G1 , n2 is the sample size of G2 and Ri is the ranking position of the i-th score in the group G2 . We can see that the more scores of G2 are closer to the top the number U is greater. Contemporary statistics packages return a normalised value Z of the statis- tical test that is computed according to the pattern U − mU Z= σU where mU is the mean of U and σU is the standard deviation of U , which are in turn computed using the formulas r n1 n2 n1 n2 (n1 + n2 + 1) mU = , σU = . 2 12 However, the formula for σU must be corrected in case ties (caused by equal scores of certain subjects) are present and it is then v u k u n1 n2 X t3i − ti σU = t ((n + 1) − ) 12 i=1 n(n − 1) where k is the number of tied positions, ti is the number of subjects that attained the i-th tied rank, and n = n1 + n2 . 3.2 Stouffer-Lipták Method Since the tests given to the subjects were different and the students were of different maturity we cannot directly sum the groups that took part in the two rounds. Therefore we need a method to combine results of three experiments. For this we use a meta-analysis technique called Stouffer-Lipták method [7]. In this method we compute a combined Z value based upon Z-values obtained from the Mann-Whitney test using the pattern (tailored to the case where three experiments are combined): w1 Z1 + w2 Z2 + w3 Z3 Z= p w12 + w22 + w32 where Z1 , Z2 , Z3 are the cumulative normal distribution values that match the probabilities 1−p1 , 1−p2 and 1−p3 with p1 , p2 , p3 being p-values obtained in two sub-experiments while w1 , w2 , w3 are weights of the sub-experiments that can be used to accommodate for different sample sizes (note that sub-experiment with The impact of proof steps sequence on proof readability. . . 9 a bigger number of subjects should weigh more than the sub-experiment with the smaller one, according to Lipták square roots of sample sizes are optimal here). The p-value for Z-value obtained in this way is 1−v where v is the probability that a normally distributed random number will be less than that this Z-value. 3.3 The Binomial Test The binomial test is used to check which of two outcomes has greater proba- bility. We assume here that our process has n outcomes represented by random X1 , . . . , Xn . These outcomes are either 0 (failure) or 1 (success) and variables P n now Y = i=1 Xi represents the number of successes in the process. We observe now that n k P (Y = k) = q (1 − q)n−k for k ∈ {0, . . . , n} k where q is the probability of success (i.e. that Xi = 1). We can now use the value Qn,q (α) of the quantile of order α for the distribution to reject H0 telling that q ≥ p0 versus the alternate hypothesis HA that q < p0 if and only if Y ≤ Qn,p0 (α) where α is the desired level of confidence. Typically p0 = 0.5 and we actually check that the probability of success is greater than the one of failure. 4 Results of the Experiment Since we use the Mann-Whitney test, we have to make sure that the assumptions of the test are met. First of all we have to ensure that the scores are numeric. In our setting we give the score 1 to those assignments that clearly indicate the right-hand side of the test. The other score, 0, is given when the left-hand side is indicated as preferred, when there is no preference and also when the assignment was solved incorrectly by the subject. We can now compare two populations depending on what was the content of the left-hand side (i.e. first side) of obtained assignment: (G1 ) these who obtained version A there with (G2 ) those who obtained version B. We understand that subjects in G1 , by choosing the right-hand side in their test, judged that version B gives them cognitive advantage while subjects in G2 , by choosing the same right-hand side, judged that version A gives them this. Our main statistical hypotheses can be formulated in the following way – H0a : the probability that a subject from G2 has the score 1 is greater or equal to the probability that a subject from G1 does it; a – HA : the probability that a subject from G2 has the score 1 is smaller than the probability that a subject from G1 does it. We have two more populations, G01 with subjects that obtained 0 in the assess- ment of assignments and G02 with subjects that obtained 1 there. This time we assign score 1 to all subjects of the groups. We can now formulate two other hypotheses 10 Karol Pąk and Aleksy Schubert – H0f : the probability that a subject from G02 has the new score 1 is greater than the probability that a subject from G01 does it; f – HA : the probability that a subject from G02 has the new score 1 is smaller or equal to the probability that a subject from G01 does it.5 Let us take a look at the as- sumptions of the Mann-Whitney test. This formulation of the two Round I pairs of hypotheses directly falls Version size mean sd under the format prescribed in Sec- V. A first (G1 ) 33 0 0 tion 3.1. In each case the leaves V. B first (G2 ) 43 0.1627907 0.3735437 with assignments were distributed Total 76 0.09210526 0.2910959 in sequence without any taking into account the subjects’ history (in Round II particular scores in earlier exams) Version size mean sd so we can assume that the popu- V. A first (G1 ) 34 0.08823529 0.2879022 lations of all the groups had the V. B first (G2 ) 31 0.09677419 0.3005372 same background. The assignments Total 65 0.09230769 0.2917125 in each case were worked out by Round III the subjects in the exam-like con- ditions, i.e. without any communi- Version size mean sd cation with their colleagues solv- V. A first (G 1 ) 17 0.2941176 0.4696682 ing the same assignment. There- V. B first (G2 ) 17 0.1764706 0.3929526 fore, we may assume that the mea- Total 34 0.2352941 0.4305615 sures for each of the assignment were taken independently. At last, the final Fig. 4. The basic statistics of the experiment. scores were numeric so they fulfil the assumption of comparability of responses. 4.1 General Overview of Results Round I Version score 0 score 1 The overall number of subjects who took part V. A first (G1 ) 36 7 in Round I of the experiment was 76. Out of V. B first (G2 ) 33 0 the number, 33 people got the assignment leaf with version A of the proof on the left-hand Round II side and 43 people got the assignment with Version score 0 score 1 version B on the left-hand side. In the case V. A first (G ) 28 3 1 of Round II the total number of subjects was V. B first (G ) 31 3 2 65 with 34 assignments where version A of the proof was on the left-hand side and 31 Round III where version B was there. In Round III the Version score 0 score 1 total number of subjects was 34 and each of V. A first (G1 ) 14 3 the groups had 17 people. We have to note V. B first (G2 ) 12 5 here that the uneven number of subjects in the two groups does not prevent applicability Fig. 5. The distribution of the as- signed scores. 5 The small difference in the formulation of the hypotheses H0a , HA a and H0f , HA f is probabilistically negligible, but it is easier to handle in computations by R. The impact of proof steps sequence on proof readability. . . 11 of the Mann-Whitney statistical test, it only impairs its accuracy. The basic statistics of the groups are presented in Fig. 4. We can see the sizes of the groups (column size) as well as the mean score (column mean) and standard deviation of the scores (column sd). We immediately see that the mean of the score in the group G1 during Round I is 0. This is due to the fact that none of the subjects returned the assignment with indicated version A on the right-hand side. We can confirm this by looking in the table displayed in Fig. 5 where the numerical scores of all the six subgroups are presented. 4.2 First Text Is Favoured Round p-value Round I 3.208984 · 10−14 We evaluate first the hypotheses H0f ver- Round II 2.482325 · 10−12 f sus HA as they give the background for Round III 0.001467528 the whole experiment. For this we use Stouffer-Lipták 0 a more direct binomial test based upon the Bernoulli distribution. The p-values for this Fig. 6. The p-values for the test on test are 3.208984 · 10−14 , 2.482325 · 10−12 , which side is favoured. and 0.001467528 for the samples taken in Round I, Round II, and Round III respectively. The combined p-value obtained with the Stouffer-Lipták method is so small that it is below the precision range of our tools. Therefore the overall result is 0.6 All the figures are gathered in the table presented in Fig. 6. Sum- ming up, we obtained a prevalent evidence that the left hand side is favoured in such comparison tests. 4.3 Readability Assessment Round Z-value p-value The picture in the readability as- Round I 2.41645 0.007836335 sessment is more complicated. Al- Round II 0.117872 0.4530845 ready a look at the table in Fig. 5 Round III −0.7966275 0.7871663 reveals that in Round I the prefer- Stouffer-Lipták 1.31315 0.0945662 ence for version B was strong, in Round II the scores were even and Fig. 7. The statistics of the readability ex- in Round III turned in the opposite periment. direction. This is reflected in obtained Z-values, which are 2.41645, 0.117872 and −0.7966275 respectively. The negative value of the second statistics reflects the a fact that we are closer to rejecting HA . The p-values obtained in the tests are 0.007836335, 0.4530845 and 0.7871663 respectively.7 This shows that the first test gives a strong statistically significant result, which supports our claim that the proofs in version A format can give cognitive advantage at the statistically significant level, actually the significance here is higher than 99%. This evidence is so strong that even when we combine the three rounds with Stouffer-Lipták 6 The results were obtained with help of the standard R function binom.test. 7 The results were obtained with help of the R package coin [5,6] with its heuristics mid-ranks to handle ties in input data. 12 Karol Pąk and Aleksy Schubert test we obtain result of reasonable significance at the level over 90%. These results are summarised in the table presented in Fig. 7. 5 Discussion The fact that the left-hand side of tests devised in our experiment is strongly favoured has very strong support in our data. The case of readability is more curious. We can see that Round I resulted in strong preference for version B of the proof, which is more convoluted. However slight changes to the conditions in Rounds II and III gave a picture that tends to support the opposite claim. We can observe that Round I (during an exam) was likely more stressful than Round III, and this could make the short-term memory less effective. As a result, the bigger number of labels present in version B could turn out to be advantageous for the understanding process. Some of the subjects shared with us this opinion on their assessments. However, this line of argument is not supported by the result of Round II, which was also taken under exam conditions. One more phenomenon that can also play role here is that the proof in version A indeed more often referred to the previous step. However, each step depends not only on the directly preceding one, but also on some other ones. In the case of version A we have a step (line 8. in Fig. 1) which required subjects to refer 3 steps back, while the proof in version B required subjects to refer at most 2 steps back. Further investigations are necessary to check which of the reasons actually holds. However, this investigation shows that the tools to improve read- ability based upon the principle used here should be used cautiously. Readability is a feature that depends highly on individual abilities of proof readers so the tools to improve it should not enforce any fixed method of proof improvement, but rather offer a number possibilities that can be chosen by users. The subjects in this experiment actually were not trained in extensive reading of mathematical proofs. We can say that they were accidental proof readers. This experiments showed that they prefer proofs in version B. Still, experienced users, as implied by our initial expectations, seem to prefer proofs structured as in version A. It may be that the experienced users learn this preference as they gain experience. This conjecture is also worth further investigation. 6 Acknowledgements A number of people helped us to execute the experiment. We would like to thank professors Jerzy Tyszkiewicz from University of Warsaw and Krzysztof Prażmowski from University of Białystok for allowing us to take the test during exams of their classes. The tests were also part of classes conducted by Daria Walukiewicz-Chrząszcz, Jacek Chrząszcz, and Piotr Wasilewski so we are also grateful for their help. These people were also very kind to discuss with us the design of the experiment which helped us in its better organisation. The impact of proof steps sequence on proof readability. . . 13 References 1. Otto Behaghel. Beziehungen zwischen umfang und reihenfolge von satzgliedern. Indogermanische Forschungen, (29):110–142, 1909. 2. Georges Gonthier. Formal proof the four-color theorem. Notices of the AMS, 55(11):1382–1393, 2008. 3. Adam Grabowski, Artur Korniłowicz, and Adam Naumowicz. Four decades of Mizar. Journal of Automated Reasoning, 55(3):191–198, October 2015. 4. Adam Grabowski and Christoph Schwarzweller. Revisions as an essential tool to maintain mathematical repositories. In Proceedings of the 14th Symposium on Towards Mechanized Mathematical Assistants: 6th International Conference, Calculemus ’07 / MKM ’07, pages 235–249, Berlin, Heidelberg, 2007. Springer- Verlag. 5. Torsten Hothorn, Kurt Hornik, Mark A. van de Wiel, and Achim Zeileis. Im- plementing a class of permutation tests: The coin package. Journal of Statistical Software, 28(8), 2008. 6. Torsten Hothorn, Kurt Hornik, Mark A. van de Wiel, and Achim Zeileis. Package ‘coin’: Conditional Inference Procedures in a Permutation Test Framework. CRAN, 2013. 7. T. Lipták. On the combination of independent tests. Magyar Tud Akad Mat Kutato Int Közl., 3:171–196, 1958. 8. Adam Naumowicz and Czesław Byliński. Improving Mizar texts with properties and requirements. In Andrea Asperti, Grzegorz Bancerek, and Andrzej Trybulec, editors, Mathematical Knowledge Management, Third International Conference, MKM 2004 Proceedings, volume 3119 of MKM’04, Lecture Notes in Computer Science, pages 290–301, 2004. 9. Karol Pąk. The Algorithms for Improving and Reorganizing Natural Deduction Proofs. Studies in Logic, Grammar and Rhetoric, 22(35):95–112, 2010. 10. Karol Pąk. Improving Legibility of Natural Deduction Proofs is Not Trivial. Logical Methods in Computer Science, 10(3), 2014. 11. Karol Pąk. Improving Legibility of Formal Proofs Based on the Close Reference Principle is NP-Hard. Journal of Automated Reasoning, 55(3):295–306, 2015. 12. Ernst Snapper. The three crises in mathematics: Logicism, intuitionism and for- malism. Mathematics Magazine, 52(4):207–216, 1979. 13. Makarius Wenzel. The Isabelle/Isar Reference Manual. University of Cambridge, 2013. A English Translations of the Tests Used in the Experiment In each of the two versions of the experiment the assignment text and the for- mulation of the statement repeat on each side of the page. Only the proof part is different. We give the repeating parts only once in this translation. A.1 Version Used in January 2015 Assignment You are presented two proofs of the same fact. Both have the same flaw. Choose the version of the reasoning for which you can easier perform the following task: please show the place where the flaw is located and describe in one sentence what is the reason of the mistake. 14 Karol Pąk and Aleksy Schubert Statement (X ∪ Y )\(X ∩ Y ) = (X\Y ) ∩ (Y \X) Proof (version A) We show first that for each x the equivalence holds x ∈ (X ∪ Y )\(X ∩ Y ) if and only if x ∈ (X\Y ) ∩ (Y \X). For the proof from left to right: – A1: Let us take any x ∈ (X ∪ Y )\(X ∩ Y ). – We immediately obtain that x 6∈ (X ∩ Y ) (def. of set difference), – A2: and further that x 6∈ X or x 6∈ Y (def. of set intersection). – Additionally x ∈ X or x ∈ Y (see A1, def. of set sum) – therefore x ∈ (X\Y ) or x ∈ (Y \X) (see A2, def. of set difference), – which gives the expected x ∈ (X\Y ) ∩ (Y \X) (def. of set intersection). For the proof from right to left: – Let us take any x ∈ (X\Y ) ∩ (Y \X). – We immediately obtain that x ∈ (X\Y ) or x ∈ (Y \X) (def. of set intersec- tion), – A4: and further x ∈ X and x 6∈ Y or x ∈ Y and x 6∈ X (def. of set difference), – and further x ∈ (X ∪ Y ) (def. of set sum). – Additionally x 6∈ (X ∩ Y ) (see A4, def. of set intersection) – which gives the expected x ∈ (X ∪ Y )\(X ∩ Y ) (def. of set difference). From the proved equivalence and definition of set equality we immediately obtain the goal statement. Proof (version B) We show first that for each x the equivalence holds x ∈ (X ∪ Y )\(X ∩ Y ) if and only if x ∈ (X\Y ) ∩ (Y \X). For the proof from left to right: – A1: Lat us take any x ∈ (X ∪ Y )\(X ∩ Y ). – A2: We immediately obtain that x 6∈ (X ∩ Y ) (def. of set difference), – A3: Additionally x ∈ X or x ∈ Y (see A1, def. of set sum) – Observe that x 6∈ X or x 6∈ Y (see A2, def. of set intersection) – Consequently x ∈ (X\Y ) or x ∈ (Y \X) (see A3, def. of set difference), – which gives the expected x ∈ (X\Y ) ∩ (Y \X) (def. of set intersection). For the proof from right to left: – Let us take any x ∈ (X\Y ) ∩ (Y \X). – We immediately obtain that x ∈ (X\Y ) or x ∈ (Y \X) (def. of set intersec- tion), – A3: and further x ∈ X and x 6∈ Y or x ∈ Y and x 6∈ X (def. of set difference), – and further x ∈ (X ∪ Y ) (def. of set sum). – Additionally x 6∈ (X ∩ Y ) (see A3, def. of set intersection) – which gives the expected x ∈ (X ∪ Y )\(X ∩ Y ) (def. of set difference). From the proved equivalence and definition of set equality we immediately obtain the goal statement. A.2 Version Used in January 2016 Assignment You are presented two proofs of the same fact. Find in them as many flaws as you can. In case a mistake occurs in both proofs, mark with a star The impact of proof steps sequence on proof readability. . . 15 the version in which you found it first. Is the structure of some of the proofs more readable for you? Statement (X ∪ Y )\(X ∩ Y ) = (X\Y ) ∩ (Y \X) Proof (version A) We show first that for each x the equivalence holds x ∈ (X ∪ Y )\(X ∩ Y ) if and only if x ∈ (X\Y ) ∩ (Y \X). For the proof from left to right: – [1] Let us take any x ∈ (X ∪ Y )\(X ∩ Y ). – We immediately obtain that x 6∈ (X ∩ Y ) (def. of set difference), – [2] and further that x 6∈ X or x 6∈ Y (def. of set intersection). – Additionally x ∈ X or x ∈ Y (see [1], def. of set sum) – therefore x ∈ (X\Y ) or x ∈ (Y \X) (see [2], def. of set difference), – summing up, we obtain in the end that x ∈ (X\Y ) ∩ (Y \X) (def. of set intersection). For the proof from right to left: – Let us take any x ∈ (X\Y ) ∩ (Y \X). – We immediately obtain that x ∈ (X\Y ) or x ∈ (Y \X) (def. of set intersec- tion), – [3] and then x ∈ X ∧ x 6∈ Y or x ∈ Y ∧ x 6∈ X (def. of set difference), – [4] and then x ∈ (X ∪ Y ) (def. of set sum). – Additionally x 6∈ (X ∩ Y ) (see [3], def. of set intersection) – summing up, we obtain in the end that x ∈ (X ∪ Y )\(X ∩ Y ) (see [4] and def. of set difference). From the proved equivalence and definition of set equality we immediately obtain the goal statement. Proof (version B) We show first that for each x the equivalence holds x ∈ (X ∪ Y )\(X ∩ Y ) if and only if x ∈ (X\Y ) ∩ (Y \X). For the proof from right to left: – Let us fix arbitrary x ∈ (X\Y ) ∩ (Y \X). – From this, we obtain that x ∈ (X\Y ) or x ∈ (Y \X) (def. of set intersection), – [1] and then x ∈ X ∧ x 6∈ Y or x ∈ Y ∧ x 6∈ X (def. of set difference), – [2] and then x ∈ (X ∪ Y ) (def. of set sum). – Except from that x 6∈ (X ∩ Y ) (see [1], def. of set intersection) – which gives the expected x ∈ (X ∪ Y )\(X ∩ Y ) (see [2] and def. of set difference). For the proof from left to right: – [3] Let us fix arbitrary x ∈ (X ∪ Y )\(X ∩ Y ). – [4] From this we obtain that x 6∈ (X ∩ Y ) (def. of set difference), – [5] Additionally x ∈ X or x ∈ Y (see [3], def. of set sum) – Observe that x 6∈ X or x 6∈ Y (see [4], def. of set intersection). – therefore x ∈ (X\Y ) or x ∈ (Y \X) (see [5], def. of set difference), – which gives the expected x ∈ (X\Y ) ∩ (Y \X) (def. of set intersection). From the proved equivalence and definition of set equality we immediately obtain the goal statement.