=Paper=
{{Paper
|id=Vol-2481/paper48
|storemode=property
|title=Automated Short Answer Grading: A Simple Solution for a Difficult Task
|pdfUrl=https://ceur-ws.org/Vol-2481/paper48.pdf
|volume=Vol-2481
|authors=Stefano Menini,Sara Tonelli,Giovanni De Gasperis,Pierpaolo Vittorini
|dblpUrl=https://dblp.org/rec/conf/clic-it/MeniniTGV19
}}
==Automated Short Answer Grading: A Simple Solution for a Difficult Task==
Automated Short Answer Grading: A Simple Solution for a Difficult Task Stefano Menini† , Sara Tonelli† , Giovanni De Gasperis‡ , Pierpaolo Vittorini‡ † Fondazione Bruno Kessler (Trento), ‡ University of L’Aquila {menini,satonelli}@fbk.eu {giovanni.degasperis,pierpaolo.vittorini}@univaq.it Abstract a novel approach to binary automatic short an- swer grading (ASAG). This has proven particularly English. The task of short answer grad- challenging because an understanding of natural ing is aimed at assessing the outcome of an language is required, without having much textual exam by automatically analysing students’ context, while grading multiple-choice questions answers in natural language and deciding can be straightforwardly assessed, given that there whether they should pass or fail the exam. is only one possible correct response to each ques- In this paper, we tackle this task train- tion. Furthermore, the tests considered in this pa- ing an SVM classifier on real data taken per are taken from real exams on statistical analy- from a University statistics exam, showing ses, with low variability, a limited vocabulary and that simple concatenated sentence embed- therefore little lexical difference between correct dings used as features yield results around and wrong answers. 0.90 F1, and that adding more complex The contribution of this paper is two-fold: we distance-based features lead only to a slight create and release a dataset for short-answer grad- improvement. We also release the dataset, ing containing real examples, which can be freely that to our knowledge is the first freely downloaded at https://zenodo.org/record/ available dataset of this kind in Italian.1 3257363#.XRsrn5P7TLY. Besides, we propose a 1 Introduction simple approach that, making use only of concate- nated sentence embeddings and an SVM classifier, Human grading of open ended questions is a te- achieves up to 0.90 F1 after parameter tuning. dious and error-prone task, a problem that has be- come particularly pressing when such an assess- 2 Related Work ment involves a large number of students, like in an Academic setting. One possible solution to this In the literature, several works have been presented problem is to automate the grading process, so that on automated grading methods, to assess the qual- it can facilitate teachers in the correction and en- ity of answers in written examinations. Several able students to receive immediate feedback. Re- types of answers have been addressed, from es- search on this task has been active since the ’60s says (Kanejiya et al., 2003; Shermis et al., 2010), (Page, 1966), and several computational methods to code (Souza et al., 2016). Here we focus on have been proposed to automatically grade differ- works related to short answers, which are the tar- ent types of texts, from longer essays to short text get of our tests. With short answers we refer to answers. The advantages of this kind of automatic open questions, given in natural language, usually assessment do not concern only the limited time with the length of one paragraph, recalling external and effort required to grade tests compared with a knowledge (Burrows et al., 2015). When assess- manual assessment, but include also the reduction ing the grading of short answers we face two main of mistakes and bias introduced by humans, as well issues, i) the grading itself and ii) the presence of as a better formalization of assessment criteria. appropiate datasets. In this paper, we focus on tests comprising short ASAG can be tackled with several approaches, answers to natural language questions, proposing including pattern matching (Mitchell et al., 2002), looking for specific concepts or keywords in the an- 1Copyright ©2019 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- swers (Callear et al., 2001; Leacock and Chodorow, ternational (CC BY 4.0). 2003; Jordan and Mitchell, 2009), using bag of words and matching terms (Cutrone et al., 2011) data about surgical operations, subjects, scar visi- or relying on LSA (Klein et al., 2011). Some other bility and hospital stay, and asked to compute sev- solutions rely more heavily on NLP techniques, for eral statistical measures in R, such as the absolute example by extracting metrics and features that can and relative frequencies of the surgical operations. be used for text classification such as the overlap Then, students were required to comment in plain of n-grams or POS between student’s and teacher’s text on some of the analyses, for example state answers (Bailey and Meurers, 2008; Meurers et al., whether some data are extracted from a normal 2011). Some attempts have been made also to use distribution. For this second part of the exam, the similarity between word embeddings as a feature teacher prepared a “gold answer”, i.e. the correct (Sultan et al., 2016; Sakaguchi et al., 2015; Kumar answer. Two real examples from the dataset are et al., 2017). reported below. Another aspect that can affect the performance Correct answer pair: of different ASAG approaches is the target of au- (Student) Poiché il p-value e maggiore di tomated evaluation. We can for instance assess the 0.05 in entrambi i casi, la distribuzione quality of the text (Yannakoudakis et al., 2011), è normale, procediamo con un test para- its comprehension and summarization (Madnani et metrico per variabili appaiate. al., 2013), or, as in our case, the knowledge of a spe- (Gold) Siccome tutti i test di normalità cific notion. Each task would therefore need a spe- presentano un p>0.05, posso utilizzare cific dataset as a benchmark. Other dimensions af- un test parametrico. fecting the approach to ASAG and its performance are also the school level for which an assessment Wrong answer pair: is required (e.g primary school vs. university) as (Student) Siccome p<0.05,la differenza well as its domain, e.g. computer science (Gütl, fra le due variabili è statisticamente sig- 2007), biology (Siddiqi and Harrison, 2008) or nificativa. math (Leacock and Chodorow, 2003). As for Ital- (Gold) Siccome il t-test restituisce un p- ian, we are not aware of existing automated grading value > di 0.05, non posso generaliz- approaches, nor of available datasets specifically zare alla popolazione il risultato osser- released to foster research in this direction. These vato nel mio campione, e quindi non c’è are indeed the main contributions of the current differenza media di peso statisticamente paper. significativa fra i figli maschi e femmine. 3 Task and Data Description The goal of our task is, given each pair, to train a classifier and label correct and wrong students’ The short grading task that we analyse in this paper answers. An important aspect of our task is that is meant to automatize part of the exam that stu- the correctness of an answer is not defined with dents of Health Informatics in the degree course of respect to the question, which is not used for clas- Medicine and Surgery of the University of L’Aquila sification. For the moment we also focus on binary (Italy) are required to pass. It includes two activi- classification, to determine whether an answer is ties: a statistical analysis in R and the explanation correct or not, without providing a numeric score of the results in terms of clinical findings. While on how much it is correct or wrong. With the data the evaluation of the first part has already been au- organized into student-professor answers pairs, the tomatized through automated grading of R code classification is done considering i) the semantic snippets (Angelone and Vittorini, 2019), the sec- content of the answers (represented through word ond task had been addressed by the same authors embeddings ii) features related to the pair struc- using a string similarity approach, which however ture of the data such as the overlap or the distance did not yield satisfying results. Indeed, they used between the two texts. The adopted features are Levenshtein distance to compute the distance be- explained in detail in Section 4.1. tween the students’ answer and a gold standard (i.e. correct) answer, but the approach failed to 3.1 Dataset capture the semantic equivalence between the two The dataset available at https://zenodo.org/ sentences, while focusing only on the lexical one. record/3257363#.XR5i8ZP7TLY has been par- For example, an exam provided students with tially collected using data from real statistics exams spanning different years, and partially extended by All sentences are first preprocessed by remov- the authors of this paper. The dataset contains the ing the stopwords such as articles and prepositions, list of sentences written by students, with a unique and by replacing mathematical notations with their sentence ID, the type of statistical analysis it refers transcription in plain language, e.g. “>" with to (if either given for the hypothesis or normality “maggiore di" (greater than). We also perform test), its degree in a range from 0 to 1, and its fail/- part of speech tagging, lemmatisation and affix pass result, flanked with a manually defined gold recognition using the TINT NLP Suite for Italian standard (i.e. the correct answer). The degree is a (Aprosio and Moretti, 2018). Then on each pair numerical score manually assigned to each answer, of sentences the following distance-based features which takes into account whether an answer is par- are computed: tially correct, mostly correct or completely wrong. Based on this degree, the pass/fail decision was • Token overlap: a feature representing the taken, i.e. if degree < 0.6 then fail, otherwise number of overlapping tokens between the pass. two sentences normalised by their length. In order to increase the number of training in- This feature captures the lexical similarity be- stances and achieve a better balance between the tween the two strings. two classes, we manually negated a set of correct • Lemma overlap: a feature representing the answers and reversed the corresponding fail/pass number of overlapping lemmas between the result, adding a set of negated gold standard sen- two sentences normalised by their length. tences for a total of 332 new pairs. We also manu- Like the previous one, this feature captures ally paraphrased 297 of the original gold standard the lexical similarity between the two strings. sentences, so that we created some additional pairs. Overall the dataset consists of 1,069 student/gold • Presence of negations: this feature represents standard answer pairs, 663 of which are labeled as whether a content word is negated in one sen- “pass” and 406 as “fail”. tence and not in the other. For each sentence, negations are recognised based on the NEG 4 Classification framework PoS tag or the affix ‘a-’ or ‘in-’ (e.g. indipen- Although several works have explored the possibil- dente), and then the first content word oc- ity to automatically grade short text answers, these curring after the negation is considered. We attempts have mainly focused on English. Further- extract two features, one for each sentence, more, the best performing ones strongly rely on and the values are normalised by their length. knowledge bases and syntactic analyses (Mohler et Other distance-based features are computed at al., 2011), which are hard to obtain for Italian. We sentence level, and to this purpose we employ therefore test for the first time the potential of sen- fastText (Bojanowski et al., 2017), an extension tence embeddings to capture pass or fail judgments of word embeddings (Mikolov et al., 2013; Pen- in a supervised setting, where the only required nington et al., 2014) developed at Facebook that is data are a) a training/test set and b) sentence em- able to deal with rare words by including subword beddings (Bojanowski et al., 2017) trained using information, and representing sentences basically fastText2. by combining vectors representing both words and 4.1 Method subwords. To generate these embeddings we start from the pre-computed Italian language model3 Since we cast the task in a supervised classification trained on Common Crawl and Wikipedia. The lat- framework, we first need to represent the pairs of ter, in particular, is suitable for our domain, since it student/gold standard sentences as features. Two includes also scientific content and statistics pages, different types of features are tested: distance- therefore the language of the exam should be well based features, which capture the similarity of represented in our model. The embeddings are cre- the two sentences using measures based on lexical ated using continuous bag-of-word with position- and semantic similarity, and sentence embeddings weights, a dimension of 300, character n-grams features, whose goal is to represent the semantics of length 5, a window of size 5 and 10 negatives. of the two sentences in a distributional space. 3https://fasttext.cc/docs/en/crawl-vectors. 2https://fasttext.cc/ html Then, the embedding of the sentences written by the students and the gold standard ones are created by combining the word and the subword embed- dings with the fastText library. Each sentence is therefore represented through a 300 dimensional embedding. Based on this, we extract four addi- tional distance-based features: • Embeddings cosine: the cosine between the two sentence embeddings is computed. The intuition behind this feature is that the embed- Figure 1: Plot for parameter tuning dings of two sentences with a similar meaning would be close in a multidimensional space We then proceeded to find the best C and γ pa- rameters by means of grid-search tuning (Hsu et • Embeddings cosine (lemmatized): the same al., 2016), through a 10-fold cross-validation to feature as the previous one, with the only dif- prevent to overfit the model. Finally, with the pa- ference that the sentences are first lemmatised rameters that returned the best performance, we before creating the embeddings finalised the classifier and calculated its accuracy • Word Mover’s Distance (WMD): WMD is a and F1 score. The analyses were performed us- similarity measures based on the minimum ing R 3.6.0 with caret v6.0-84 and e1071 v1.7-2 amount of distance that the embedded words packages (R Core Team, 2018). of one document need to move to reach the 4.2 Results embedded words of another document (Kus- ner et al., 2015) in a multidimensional space. Figure 1 shows the plot summarising the tuning Compared with other existing similarity mea- process. In summary, within the explored area, the sures, it works well also when two sentences best parameters were found to be C = 104 and have a similar meaning despite having few γ = 2−6 . The resulting tuned model produced the words in common. We apply this algorithm following results: to measure the distance between the solutions • Accuracy = 0.891 (balanced accuracy = proposed by the students and the ones in the 0.876); gold standard. • F1 score = 0.914; • Word Mover’s Distance (lemmatized): the same feature as the previous one, with the only With a similar approach, we also tuned the clas- difference that the sentences are first lemma- sifier when fed with only the concatenated sentence tised before creating the embeddings embeddings as features (i.e., without distance- based features). With best parameters C = 103 The sentence embeddings used to compute the and γ = 2−3 , the results were: distance features are also tested as features in isola- tion: a 600 dimensional vector is indeed created by • Accuracy = 0.885 (balanced accuracy = concatenating each sentence embeddings compos- 0.870); ing a student answer – gold standard pair. This rep- • F1 score = 0.909; resentation is then directly fed to the classifier. We adopt this solution inspired by recent approaches to To evaluate the quality of the model learned with natural language inference using the concatenation these two configurations, and make sure that it of premise and hypothesis (Bowman et al., 2015; does not overfit, we perform an additional test: Kiros and Chan, 2018). we collect a small set of students’ answers from a As for the supervised classifier, we use support different statistics exam than the one used to create vector machines (Scholkopf and Smola, 2001), the training set. This is done on novel data by which generally yield satisfying results in classifi- collecting students’ answers from a small number cation tasks with a limited number of training in- of new questions, and manually creating new gold stances (as opposed to deep learning approaches). answers to be used in the pairs. Overall, we obtain 77 new answer pairs, consisting of 14 wrong and 63 knowledge, this is the first dataset of this kind. We correct answers. We then run the best performing also introduce a simple approach based on sen- model with all features and using only sentence tence embeddings to automatically identify which embeddings (same C and γ as before). The results answers are correct or not, which is easy to repli- are the following: cate and not computationally intensive. In the future, the work could be extended in sev- • Accuracy using all features = 0.7838 (bal- eral directions. First of all, it would be interesting anced accuracy = 0.5965); to use deep-learning approaches instead of SVM, • F1 score 0.8710; but for that more training data are needed. These could be collected in the upcoming exam sessions while the results achieved using only sentence at University of L’Aquila. Another refinement of embeddings are: this work would be to grade the tests by assigning • Accuracy = 0.7973 (balanced accuracy = a numerical score instead of a pass/fail judgment. 0.6349); Since such scores are already included in the re- leased dataset (the degrees), this would be quite • F1 score = 0.8780; straightforward to achieve. Finally, we plan to test 5 Discussion the classifier by integrating it in an online evalua- tion tool, through which students can submit their The results presented in the previous section show tests and the trainer can run an automatic pass/fail only a small increase in performance when using assignment. the distance-based features in addition to the sen- tence embeddings after tuning both configurations. This outcome highlights the effectiveness of us- References ing sentence embeddings to represent the semantic Anna Maria Angelone and Pierpaolo Vittorini. 2019. content of the answers in tasks where student’s and The Automated Grading of R Code Snippets: Prelim- gold solutions are very similar to each other. In inary Results in a Course of Health Informatics. In fact, the sentence pairs in our dataset show a high Proc. of the 9th International Conference in Method- ologies and Intelligent Systems for Technology En- level of word overlap, and the only discriminant hanced Learning. Springer. between a correct and a wrong answer is some- times only the presence of “<” instead of “>”, or Alessio Palmero Aprosio and Giovanni Moretti. 2018. Tint 2.0: an all-inclusive suite for NLP in italian. a negation. In Proceedings of the Fifth Italian Conference on The second experiment, where the same config- Computational Linguistics (CLiC-it 2018), Torino, uration is run on a test set taken from a statistics Italy, December 10-12, 2018. exam on different topics, shows an overall decrease Stacey Bailey and Detmar Meurers. 2008. Diagnos- in performance as expected, but the classification ing meaning errors in short answers to reading com- accuracy is still well above the most frequent base- prehension questions. In Proceedings of the Third line. In this setting, using only the sentence em- Workshop on Innovative Use of NLP for Building Educational Applications, pages 107–115. Associa- beddings yields a slightly better performance than tion for Computational Linguistics. including the other features, showing that they are more robust with respect to a change of topic. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with In general terms, despite the accurate param- subword information. Transactions of the Associa- eter tuning, the classification approach seems to tion for Computational Linguistics, 5:135–146. be applicable to short answer grading tests differ- Samuel R. Bowman, Gabor Angeli, Christopher Potts, ent from the data on which the training was done, and Christopher D. Manning. 2015. A large anno- provided that the student’s and gold answer types tated corpus for learning natural language inference. are the same as in our dataset (i.e. limited length, In Proceedings of the 2015 Conference on Empiri- limited lexical variability). cal Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal, September. Association 6 Conclusions for Computational Linguistics. Steven Burrows, Iryna Gurevych, and Benno Stein. In this paper, we have presented a novel dataset 2015. The eras and trends of automatic short answer for short answer grading taken from a real statis- grading. International Journal of Artificial Intelli- tics exam, which we make freely available. To our gence in Education, 25(1):60–117. David H Callear, Jenny Jerrams-Smith, and Victor Soh. Detmar Meurers, Ramon Ziai, Niels Ott, and Stacey M 2001. Caa of short non-mcq answers. Bailey. 2011. Integrating parallel analysis modules to evaluate the meaning of answers to reading com- Laurie Cutrone, Maiga Chang, et al. 2011. Auto- prehension questions. International Journal of Con- assessor: Computerized assessment system for mark- tinuing Engineering Education and Life-Long Learn- ing student’s short-answers automatically. In 2011 ing, 21(4):355–369. IEEE International Conference on Technology for Education, pages 81–88. IEEE. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- Christian Gütl. 2007. e-examiner: towards a fully- tations in vector space. automatic knowledge assessment tool applicable in adaptive e-learning systems. In Proceedings of the Tom Mitchell, Terry Russell, Peter Broomhead, and 2nd international conference on interactive mobile Nicola Aldridge. 2002. Towards robust comput- and computer aided learning, pages 1–10. Citeseer. erised marking of free-text responses. Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. Michael Mohler, Razvan Bunescu, and Rada Mihal- 2016. A Practical Guide to Support Vector Classifi- cea. 2011. Learning to grade short answer questions cation. Technical report, National Taiwan University. using semantic similarity measures and dependency graph alignments. In Proceedings of the 49th Annual Sally Jordan and Tom Mitchell. 2009. e-assessment Meeting of the Association for Computational Lin- for learning? the potential of short-answer free-text guistics: Human Language Technologies - Volume questions with tailored feedback. British Journal of 1, HLT ’11, pages 752–762, Stroudsburg, PA, USA. Educational Technology, 40(2):371–385. Association for Computational Linguistics. Dharmendra Kanejiya, Arun Kumar, and Surendra Prasad. 2003. Automatic evaluation of students’ Ellis B Page. 1966. The imminence of grading essays answers using syntactically enhanced lsa. In Pro- by computer. The Phi Delta Kappan, 47(5):238–243. ceedings of the HLT-NAACL 03 workshop on Build- Jeffrey Pennington, Richard Socher, and Christopher D. ing educational applications using natural language Manning. 2014. Glove: Global vectors for word processing-Volume 2, pages 53–60. Association for representation. In Proceedings of EMNLP. Computational Linguistics. Jamie Kiros and William Chan. 2018. Inferlite: Sim- R Core Team. 2018. R: A Language and Environment ple universal sentence representations from natural for Statistical Computing. language inference data. In Proceedings of the 2018 Keisuke Sakaguchi, Michael Heilman, and Nitin Mad- Conference on Empirical Methods in Natural Lan- nani. 2015. Effective feature integration for auto- guage Processing, Brussels, Belgium, October 31 - mated short answer scoring. In Proceedings of the November 4, 2018, pages 4868–4874. 2015 conference of the North American Chapter of Richard Klein, Angelo Kyrilov, and Mayya Tokman. the association for computational linguistics: Hu- 2011. Automated assessment of short free-text re- man language technologies, pages 1049–1054. sponses in computer science using latent semantic analysis. In Proceedings of the 16th annual joint Bernhard Scholkopf and Alexander J Smola. 2001. conference on Innovation and technology in com- Learning with kernels: support vector machines, puter science education, pages 158–162. ACM. regularization, optimization, and beyond. MIT press. Sachin Kumar, Soumen Chakrabarti, and Shourya Roy. 2017. Earth mover’s distance pooling over siamese Mark D Shermis, Jill Burstein, Derrick Higgins, and lstms for automatic short answer grading. In IJCAI, Klaus Zechner. 2010. Automated essay scoring: pages 2046–2052. Writing assessment and instruction. International encyclopedia of education, 4(1):20–26. Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to docu- Raheel Siddiqi and Christopher Harrison. 2008. A ment distances. In International Conference on Ma- systematic approach to the automated marking of chine Learning, pages 957–966. short-answer questions. In 2008 IEEE International Multitopic Conference, pages 329–332. IEEE. Claudia Leacock and Martin Chodorow. 2003. C-rater: Automated scoring of short-answer questions. Com- Draylson M Souza, Katia R Felizardo, and Ellen F puters and the Humanities, 37(4):389–405. Barbosa. 2016. A systematic literature review of assessment tools for programming assignments. In Nitin Madnani, Jill Burstein, John Sabatini, and Tenaha 2016 IEEE 29th International Conference on Soft- O’Reilly. 2013. Automated scoring of a summary- ware Engineering Education and Training (CSEET), writing task designed to measure reading compre- pages 147–156. IEEE. hension. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Md Arafat Sultan, Cristobal Salazar, and Tamara Sum- Applications, pages 163–168. ner. 2016. Fast and easy short answer grading with high accuracy. In Proceedings of the 2016 Confer- ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- guage Technologies, pages 1070–1075. Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading esol texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies-Volume 1, pages 180–189. Association for Computational Lin- guistics.