High Stakes Automatic Assessments: Developing an Online Linear Algebra Examination Christopher J. Sangwin School of Mathematics University of Edinburgh, Edinburgh C.J.Sangwin@ed.ac.uk Abstract In this paper I investigate development of an automatically marked online version of a current paper-based examination for a university mathematics course, and the extent to which the outcomes are equiva- lent to a paper-based exam. An online examination was implemented using the STACK online assessment tool which is built using computer algebra, and in which students’ answers are normally typed expressions. The study group was 376 undergraduates taking a year 1 Introduction to Linear Algebra course. The results of this experiment are cautiously optimistic: a significant proportion of current examination questions can be automatically assessed. 1 Introduction To what extent can we produce an automatically marked online version of a current paper-based examination for methods-based university mathematics courses using contemporary technology? To what extent are the outcomes of this exam equivalent to the outcomes of a paper-based exam? And more speculatively: if we automate exams are we perpetuating “incantation” or moving towards “enlightenment”? In this paper I report a pilot to develop, use, and evaluate an online examination for a university linear algebra course. The STACK online assessment tool [San13] is built using computer algebra and students’ interactions move significantly beyond multiple choice questions with their well-known difficulties for mathematics, see [SJ17]. In particular STACK uses the computer algebra system Maxima to generate random questions; interpret stu- dents’ typed algebraic expressions; establish objective mathematical properties of students’ answers; and assign outcomes such as feedback and marks. My work is based upon the epistemological position that to successfully automate a process it is necessary to understand it profoundly. It follows that automation of a process necessitates the development of a certain kind of understanding and we learn a lot about the underlying process through automation. Assessment provides students with challenge, interest and motivation: assessment is a key driver of students’ activity in education. Assessment often defines the course of study, and even defines the nature of the subject itself. To many students mathematics is defined in a large part by what we expect students to do in examinations, [Bur87]. Online assessment has for many years been used widely in formative settings, see [San13]. Developing an online automatic high-stakes final examination is a natural extension of automation of formative assessment. Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: O. Hasan, W. Neuper, Z. Kovcs, W. Schreiner (eds.): Proceedings of the Workshop CME-EI: Computer Mathematics in Education - Enlightenment or Incantation, Hagenberg, Austria, 17-Aug-2018, published at http://ceur-ws.org Figure 1: Question 19 of the current study in STACK Automation also has practical benefits, reducing the marking load and potentially speeding up examination processes. However, changing written examinations, with centuries of custom and practice, is a high-stakes and high-risk undertaking. My previous joint work [SK16] examined questions set in school-level examination papers with a view to developing automatically marked online versions. The results of [SK16] were cautiously optimistic that a significant proportion of current questions could be automatically assessed. In this paper I extend this work, and create examination questions and trial their use with a large group of university students. 2 Methodology For this study I added a mock online examination to Introduction to Linear Algebra (ILA). This is a year 1, semester 1, mathematics course worth 20 credits taken by mathematics, computer science and other undergrad- uate students. Students normally take 120 credits per year, in two semesters. The course is defined by [Poo11] Chapters 1 to Chapter 6.2, with a selection of the applications included and selected topics omitted. ILA had over 600 students, of whom 578 took the final written examination and had a non-zero examination mark. Students had requested exam practice, but it was impractical to administer and mark students’ attempts (approximately 35 person-days for the genuine exam) in the short period between the end of teaching and the scheduled examination. In context, a mock examination was likely to be taken seriously by a significant proportion of the student cohort as a valuable practice and learning opportunity. Since the mock examination did not contribute to the overall course grade there was no incentive for students to cheat, or to be impersonated. Introduction to Linear Algebra, has an “open book” examination and so possible access to materials is less of a threat to this experiment than would be the case for a closed-book examination. The lack of certainty over who was sitting the online tests, the circumstances of participation, the potential use of internet resources and so on is certainly a compromise. Such uncertainty does not affect the extent to which I could produce questions at a technical level, or the effectiveness of the scoring mechanism in the face of students’ attempts. The results consist of a report on the extent to which current questions can be faithfully automated, and I give a preliminary report on students’ attempts. 3 Results The existing paper-based ILA examination takes 180 minutes and consists of Section A: compulsory questions totalling 40 marks, and Section B: four questions each of 20 marks from which we take the student’s best three marks. Students may use any standard scientific calculator but graphical calculators with matrix functions are not permitted. The primary goal was to provide students with an online examination which was as close as possible to the forthcoming paper-based summative course examination. ILA has been running for many years, with a stable (but not invariant) syllabus, and I had access to examinations going back to December 2011 (two per year: the main exam and an equivalent resit paper). I therefore decided to remove the oldest exam papers from easy access Online exam Study group paper Paper examination 20 40 60 60 Frequency Frequency Frequency 80 40 0 20 0 0 0 20 40 60 80 100 20 40 60 80 100 0 20 40 60 80 100 Grade Grade Grade Figure 2: Histograms of achievement in the online mock and paper based examinations through the course website and base the online examination on those questions. Using as few papers as possible helps provide a representative online examination. Technically it is difficult to operate a “best 3 out of 4” mark scheme in the STACK online system and in any case for a formative mock exam this makes little sense. In deciding how to allocate marks I have taken a strict interpretation. Specifically, where the original intention of the examiners included “with justification”, I only awarded a minimum number of marks for giving the answer only online. For example, Q5 on our online exam asked the following. 5. Is it possible for A and B to be 3 × 3 rank 2 matrices with AB = 0? True/False. The original paper awarded 7 marks for the answer and justification, whereas only one mark was awarded for the correct answer. I did ask students to provide typed free-text justifications even though these would not be marked and no feedback was provided. Ultimately I used two papers (120 marks each) to create the online exam with 59 marks of the online exam coming from Dec-11 and 50 marks from Aug-12. I took one question from Dec-13 to add a mark to Section A to make the online exam total 110 marks. Of the paper-based questions selected for the online exam, 44 marks are not awarded online. These missing marks are for justification which cannot, at this time, be automatically assessed. This resulted in Section A having fewer marks than would be the case with a paper based submission. Of the 240 marks available on the Dec-11 and Aug-12 papers, 109/240 marks 45% were automated in a way faithful to the original examinations. However, the online versions do lack some partial credit and do not (in this experiment) implement follow on-marking, which in some Section B questions is substantial. An example question is shown in Figure 1, illustrating “validity” feedback which was available during the exam. Validity feedback is normally available to a student, and provides information on syntax errors and other input problems helping reduce the extent to which students are penalized on a technicality. For ILA, online course work quizzes were already implemented using STACK. All students were expected to sit 30 online quizzes using the STACK system as part of the ILA course before the mock examination, and would be thoroughly familiar with how to enter answers into the system. The online examination was made available to students to do in their own time for a period of one week in December 2017, between the end of formal teaching and the scheduled paper-based exam. Students could choose when to sit the online examination, but were given one attempt of 180 minutes to do so to simulate examination practice. All data was downloaded from the online STACK system, and after ratification by the exam board, combined with overall achievement data. Students were assigned a unique number to ensure anonymity, and the data loaded into R-studio for analysis. There were 395 attempts at the mock online exam in December 2017. One student who was granted a second attempt for technical reasons had their first attempt disregarded, giving 394 attempts. There were no other significant technical problems affecting the conduct of the online examination. For the online exam (including those who scored zero) the mean grade was 47.9% with standard deviation of 23.2%. The coefficient of internal consistency (Cronbach Alpha) for the online exam was 0.87. There was a moderate positive correlation between time taken (M=132 mins, SD=48.6 mins) and the online exam result (M=47.9%, SD=23.2) r(392) = 0.517, p < 10−16 , as might be expected. Despite a small number of outlier questions, the mock online exam appears to have operated successfully in its own right as a test. Linear model: R2 = 0.3513 100 y = 0.727x + 0.727 80 60 Online 40 20 0 20 40 60 80 100 Paper Figure 3: Online mock exam grades vs paper exam grades for the study group Q4 276 (M=187, SD=167) Q6 211 (M=125, SD=117) Q8 242 (M=204, SD=155) Q10 249 (M=128, SD=124) Q12 220 (M=201, SD=194) Q15 227 (M=142, SD=145) Q22 213 (M=173, SD=145) Q26 194 (M=67, SD=69.4) Table 1: The number of free text responses, together with mean number of characters M and the standard deviation of the response length The final mark for ILA is made up of coursework (20%) and a final paper-based exam (80%). There were 394 attempts at the online mock examination, and all but one of these students also sat the paper-based examination. Note that 17 students scored 0 for the online exam, perhaps indicating students who looked at the online questions but made no serious attempt at them. Technically there is a difference between students who never sat the online exam, and those who opened the exam and scored 0. For the analysis I excluded the 17 students who scored 0 in the online exam: this leaves the study group of N = 376 students with paper and mock exam information. For the study group, the online exam results had (M=50.2, SD=21.3) and paper exam (M=68.0, SD=17.3). For all students who sat the ILA paper exam (M=63.1, SD=21.6). Histograms of achievement in the online mock and paper based examinations are shown in Figure 2. There is a significantly larger failure rate (score less than 40%) in the online examination, and a significantly lower mean. These differences could be explained by the level of engagement: the online exam carried no credit, and students may have lost motivation when tired. A scatter plot of the online mock exam grades vs paper exam grades is shown in Figure 3, together with a linear regression model. The dashed line shows the (ideal) linear relationship in which the online mock examination has identical outcomes with the paper-based exam. Notice the online exam scores are clearly below those of the paper exam, supporting the hypothesis that students may have lost motivation when tired and not performed to their full potential in the online mock exam. The mock exam grades and paper exam grades were moderately correlated, r(374) = 0.593, p < 10−15 . The number of non-empty free text responses to each of the “justify” questions is given in Table 1, together with the mean and standard deviation of the response length (in number of characters). It is clear reading through the free-text responses that over 200 students took the exercise seriously, providing sensible (and often correct) justifications in good English. For the Section A questions in paper there were 59 marks available, whereas in the STACK exam only 24 marks were awarded. I did not expect students to make serious use of the free-text entry. The fact students entered sensible justification to many of these questions, and received no marks, could easily account for the difference in mean scores between the paper-based and online exam. There were a large number of empty responses (as there are on paper as well), together with some incoherent utterances, and some plaintive messages. I did not assess these free-text responses, or subject them to comprehensive analysis for the purposes of this paper. However, in a genuine online examination such responses could be assessed (1) manually in the traditional way on-screen, (2) using automatic assessment technology such as described in [BJ10, Jor12], or (3) using comparative judgement for longer passages, see [JSP14, Pol12]. 4 Discussion The implementation of the mock online examination for linear algebra was a modest success. There were no serious technical problem during the conduct, and no students complained of inaccurate or unfair marking. The results of the online examination were broadly comparable with a paper-based exam, with the consistently lower online performance explained by a combination of (1) potential disengagement in a low stakes setting, (2) lack of assessment of students’ justification, (3) lack of partial credit and follow through marking. Both partial credit and follow through marking are technically possible in STACK, but are expensive (in staff time) to implement. The results give us confidence to use such assessments in higher-stakes settings in the future. This research has done nothing to address serious practical problems associated with online examinations. Problems include the need for invigilation to reduce plagiarism and impersonation, and security to eliminate communication during the exam (such as answer sharing) or access to unauthorised resources. These examination conduct problems must be solved, but they have nothing to do with mathematics. Automatic assessment is an area of mathematics which would particularly benefit from tools which automate explanation, justification and reasoning. In particular “proof checking” software, as applied to students’ under- standing, is necessary to move beyond assessing only a final answer, as shown in Figure 1, to a full mathematical answer. In this study, only students’ final answers were subject to automatic assessment which is a serious limitation. However, progress is being made to assess working especially in the area of reasoning by equivalence as discussed briefly in [SK16]. I was surprised at the large extent to which existing questions could be automatically assessed with the current tools, based on computer algebra, faithfully. However, there is nothing sacrosanct about current examination questions. Why should the online examination be exactly the same as a paper-based examination? Current questions are written explicitly for the paper-based format, and it is sensible to seek to write questions which take advantage of the online format as appropriate. Many Section A are true/false, but the justification of a “false” response is via appropriate examples. Computer algebra is ideally suited to assessing answers, such as counter examples, which expect the teacher to perform some time-consuming and error-prone calculation. For this research I did not rephrase questions to “give me examples, such that ...”, but this would be one option. This analysis raises the question of whether we, as a mathematics community, believe current mathematics examinations are a valid test of mathematical achievement. Do current examinations actually represent valid mathematical practice, as undertaken by researchers, industrial mathematicians and for pure recreation as an intellectual pursuit? Construct validity is a central educational concern, but it is not relevant to the research question of whether we can actually automate current exams. My personal views about the nature of mathematics broadly align with those expressed in [Pol54] and [Lak76]. That is, that setting up abstract problems and solving them lies at the the heart of mathematics. [Pol62] identified four patterns of thought to help structure thinking about solving mathematical problems. His “Cartesian” pattern is where a problem is translated into a system of equations, and solved using algebra. Note that the algebraic manipulation is the technical middle step in the process: setting up the equations and interpreting the solutions are essential parts to complete this pattern. My previous work [SK16] examined questions set in school-level examination papers and found that line-by-line algebraic reasoning, termed reasoning by equivalence [NBC04], is the most important single form of reasoning in school mathematics. However, many examination questions do not relate to a problem at all, rather they instruct students to undertake a well-rehearsed set of techniques, isolated from any problem. Many of the questions in the ILA examinations also rely on predictable methods which can be well-rehearsed. Predictable methods predominate in school examinations, such as those considered in my previous research in [SK16]. Current examinations tend towards “incantation” by students, and there is a real danger that national examination boards, universities, and others with responsibilities for examinations will replicate traditional examinations online without a critical reassessment of the purpose of mathematics education. 5 Conclusion The increasing use of software tools in online assessment will affect mathematics education. It is likely that automatic online examinations for mathematics in school, as well as for some methods-based university courses, will become feasible and will be used in the very near future. A pragmatic combination of computer algebra supported assessment and automatic assessment of short answer questions will assess a significant proportion of current questions automatically. Traditional expert marking, and use of comparative judgement will potentially widen the scope of exams at the expense of complete automation. A further pragmatic approach will be to split courses into two summative assessment components: largely skill-based questions can be automatically assessed online, with the justification and rhetorical discussion in a traditional written examination. Replication of traditional examinations online without a critical reassessment of the purpose of mathematics education would be a wasted opportunity to define the subject through valid assessments. 5.0.1 Acknowledgements The online questions were created with the help of Dr Konstantina Zerva, of the University of Edinburgh. References [BJ10] P. G. Butcher and S. E. Jordan. A comparison of human and computer marking of short free-text student responses. Computers and Education, 55(2):489–499, September 2010. [Bur87] H. Burkhardt. What you test is what you get. In I. Wirszup and R. Streit, editors, The Dynamics of Curriculum Change in Developments in School Mathematics Worldwide. University of Chicago School Mathematics Project, 1987. [Jor12] S. Jordan. Student engagement with assessment and feedback: Some lessons from short-answer free-text e-assessment questions. Computers and Education, 58(2):818–834, 2012. [JSP14] I. Jones, M. Swan, and A. Pollitt. Assessing mathematical problem solving using comparative judge- ment. International Journal of Science and Mathematics Education, 13(1):151–177, 2014. [Lak76] I. Lakatos. Proofs and refutations. Cambridge University Press, 1976. [NBC04] J. F. Nicaud, D. Bouhineau, and H. Chaachoua. Mixing microworlds and CAS features in building com- puter systems that help students learn algebra. International Journal of Computers for Mathematical Learning, 9(2):169–211, 2004. [Pol54] G. Polya. Mathematics and Plausible Reasoning. Vol.1: Induction and Analogy in Mathematics. Vol 2. Patterns of Plausible Inference. Princeton University Press, 1954. [Pol62] G. Polya. Mathematical discovery: on understanding, learning, and teaching problem solving. Wiley, London, UK, 1962. [Pol12] A. Pollitt. The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3):281–300, 2012. [Poo11] D. Poole. Linear Algebra: a modern approach. Brooks/Cole, Cengage learning, third edition, 2011. [San13] C. J. Sangwin. Computer Aided Assessment of Mathematics. Oxford University Press, Oxford, UK, 2013. [SJ17] C. J. Sangwin and I. Jones. Asymmetry in student achievement on multiple choice and constructed response items in reversible mathematics processes. Educational Studies in Mathematics, 94:205–222, 2017. [SK16] C. J. Sangwin and N. Köcher. Automation of mathematics examinations. Computers and Education, 94:215–227, 2016.