=Paper= {{Paper |id=Vol-2481/paper48 |storemode=property |title=Automated Short Answer Grading: A Simple Solution for a Difficult Task |pdfUrl=https://ceur-ws.org/Vol-2481/paper48.pdf |volume=Vol-2481 |authors=Stefano Menini,Sara Tonelli,Giovanni De Gasperis,Pierpaolo Vittorini |dblpUrl=https://dblp.org/rec/conf/clic-it/MeniniTGV19 }} ==Automated Short Answer Grading: A Simple Solution for a Difficult Task== https://ceur-ws.org/Vol-2481/paper48.pdf

Automated Short Answer Grading:
A Simple Solution for a Difficult Task

Stefano Menini† , Sara Tonelli† , Giovanni De Gasperis‡ , Pierpaolo Vittorini‡
†
Fondazione Bruno Kessler (Trento), ‡ University of L’Aquila
{menini,satonelli}@fbk.eu
{giovanni.degasperis,pierpaolo.vittorini}@univaq.it

Abstract a novel approach to binary automatic short an-
swer grading (ASAG). This has proven particularly
English. The task of short answer grad-
challenging because an understanding of natural
ing is aimed at assessing the outcome of an
language is required, without having much textual
exam by automatically analysing students’
context, while grading multiple-choice questions
answers in natural language and deciding
can be straightforwardly assessed, given that there
whether they should pass or fail the exam.
is only one possible correct response to each ques-
In this paper, we tackle this task train-
tion. Furthermore, the tests considered in this pa-
ing an SVM classifier on real data taken
per are taken from real exams on statistical analy-
from a University statistics exam, showing
ses, with low variability, a limited vocabulary and
that simple concatenated sentence embed-
therefore little lexical difference between correct
dings used as features yield results around
and wrong answers.
0.90 F1, and that adding more complex
The contribution of this paper is two-fold: we
distance-based features lead only to a slight
create and release a dataset for short-answer grad-
improvement. We also release the dataset,
ing containing real examples, which can be freely
that to our knowledge is the first freely
downloaded at https://zenodo.org/record/
available dataset of this kind in Italian.1
3257363#.XRsrn5P7TLY. Besides, we propose a
1 Introduction simple approach that, making use only of concate-
nated sentence embeddings and an SVM classifier,
Human grading of open ended questions is a te- achieves up to 0.90 F1 after parameter tuning.
dious and error-prone task, a problem that has be-
come particularly pressing when such an assess- 2 Related Work
ment involves a large number of students, like in
an Academic setting. One possible solution to this In the literature, several works have been presented
problem is to automate the grading process, so that on automated grading methods, to assess the qual-
it can facilitate teachers in the correction and en- ity of answers in written examinations. Several
able students to receive immediate feedback. Re- types of answers have been addressed, from es-
search on this task has been active since the ’60s says (Kanejiya et al., 2003; Shermis et al., 2010),
(Page, 1966), and several computational methods to code (Souza et al., 2016). Here we focus on
have been proposed to automatically grade differ- works related to short answers, which are the tar-
ent types of texts, from longer essays to short text get of our tests. With short answers we refer to
answers. The advantages of this kind of automatic open questions, given in natural language, usually
assessment do not concern only the limited time with the length of one paragraph, recalling external
and effort required to grade tests compared with a knowledge (Burrows et al., 2015). When assess-
manual assessment, but include also the reduction ing the grading of short answers we face two main
of mistakes and bias introduced by humans, as well issues, i) the grading itself and ii) the presence of
as a better formalization of assessment criteria. appropiate datasets.
In this paper, we focus on tests comprising short ASAG can be tackled with several approaches,
answers to natural language questions, proposing including pattern matching (Mitchell et al., 2002),
looking for specific concepts or keywords in the an-
1Copyright ©2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In- swers (Callear et al., 2001; Leacock and Chodorow,
ternational (CC BY 4.0). 2003; Jordan and Mitchell, 2009), using bag of
words and matching terms (Cutrone et al., 2011) data about surgical operations, subjects, scar visi-
or relying on LSA (Klein et al., 2011). Some other bility and hospital stay, and asked to compute sev-
solutions rely more heavily on NLP techniques, for eral statistical measures in R, such as the absolute
example by extracting metrics and features that can and relative frequencies of the surgical operations.
be used for text classification such as the overlap Then, students were required to comment in plain
of n-grams or POS between student’s and teacher’s text on some of the analyses, for example state
answers (Bailey and Meurers, 2008; Meurers et al., whether some data are extracted from a normal
2011). Some attempts have been made also to use distribution. For this second part of the exam, the
similarity between word embeddings as a feature teacher prepared a “gold answer”, i.e. the correct
(Sultan et al., 2016; Sakaguchi et al., 2015; Kumar answer. Two real examples from the dataset are
et al., 2017). reported below.
Another aspect that can affect the performance Correct answer pair:
of different ASAG approaches is the target of au- (Student) Poiché il p-value e maggiore di
tomated evaluation. We can for instance assess the 0.05 in entrambi i casi, la distribuzione
quality of the text (Yannakoudakis et al., 2011), è normale, procediamo con un test para-
its comprehension and summarization (Madnani et metrico per variabili appaiate.
al., 2013), or, as in our case, the knowledge of a spe-
(Gold) Siccome tutti i test di normalità
cific notion. Each task would therefore need a spe-
presentano un p>0.05, posso utilizzare
cific dataset as a benchmark. Other dimensions af-
un test parametrico.
fecting the approach to ASAG and its performance
are also the school level for which an assessment Wrong answer pair:
is required (e.g primary school vs. university) as (Student) Siccome p<0.05,la differenza
well as its domain, e.g. computer science (Gütl, fra le due variabili è statisticamente sig-
2007), biology (Siddiqi and Harrison, 2008) or nificativa.
math (Leacock and Chodorow, 2003). As for Ital-
(Gold) Siccome il t-test restituisce un p-
ian, we are not aware of existing automated grading
value > di 0.05, non posso generaliz-
approaches, nor of available datasets specifically
zare alla popolazione il risultato osser-
released to foster research in this direction. These
vato nel mio campione, e quindi non c’è
are indeed the main contributions of the current
differenza media di peso statisticamente
paper.
significativa fra i figli maschi e femmine.
3 Task and Data Description The goal of our task is, given each pair, to train
a classifier and label correct and wrong students’
The short grading task that we analyse in this paper answers. An important aspect of our task is that
is meant to automatize part of the exam that stu- the correctness of an answer is not defined with
dents of Health Informatics in the degree course of respect to the question, which is not used for clas-
Medicine and Surgery of the University of L’Aquila sification. For the moment we also focus on binary
(Italy) are required to pass. It includes two activi- classification, to determine whether an answer is
ties: a statistical analysis in R and the explanation correct or not, without providing a numeric score
of the results in terms of clinical findings. While on how much it is correct or wrong. With the data
the evaluation of the first part has already been au- organized into student-professor answers pairs, the
tomatized through automated grading of R code classification is done considering i) the semantic
snippets (Angelone and Vittorini, 2019), the sec- content of the answers (represented through word
ond task had been addressed by the same authors embeddings ii) features related to the pair struc-
using a string similarity approach, which however ture of the data such as the overlap or the distance
did not yield satisfying results. Indeed, they used between the two texts. The adopted features are
Levenshtein distance to compute the distance be- explained in detail in Section 4.1.
tween the students’ answer and a gold standard
(i.e. correct) answer, but the approach failed to 3.1 Dataset
capture the semantic equivalence between the two The dataset available at https://zenodo.org/
sentences, while focusing only on the lexical one. record/3257363#.XR5i8ZP7TLY has been par-
For example, an exam provided students with tially collected using data from real statistics exams
spanning different years, and partially extended by All sentences are first preprocessed by remov-
the authors of this paper. The dataset contains the ing the stopwords such as articles and prepositions,
list of sentences written by students, with a unique and by replacing mathematical notations with their
sentence ID, the type of statistical analysis it refers transcription in plain language, e.g. “>" with
to (if either given for the hypothesis or normality “maggiore di" (greater than). We also perform
test), its degree in a range from 0 to 1, and its fail/- part of speech tagging, lemmatisation and affix
pass result, flanked with a manually defined gold recognition using the TINT NLP Suite for Italian
standard (i.e. the correct answer). The degree is a (Aprosio and Moretti, 2018). Then on each pair
numerical score manually assigned to each answer, of sentences the following distance-based features
which takes into account whether an answer is par- are computed:
tially correct, mostly correct or completely wrong.
Based on this degree, the pass/fail decision was • Token overlap: a feature representing the
taken, i.e. if degree < 0.6 then fail, otherwise number of overlapping tokens between the
pass. two sentences normalised by their length.
In order to increase the number of training in- This feature captures the lexical similarity be-
stances and achieve a better balance between the tween the two strings.
two classes, we manually negated a set of correct • Lemma overlap: a feature representing the
answers and reversed the corresponding fail/pass number of overlapping lemmas between the
result, adding a set of negated gold standard sen- two sentences normalised by their length.
tences for a total of 332 new pairs. We also manu- Like the previous one, this feature captures
ally paraphrased 297 of the original gold standard the lexical similarity between the two strings.
sentences, so that we created some additional pairs.
Overall the dataset consists of 1,069 student/gold • Presence of negations: this feature represents
standard answer pairs, 663 of which are labeled as whether a content word is negated in one sen-
“pass” and 406 as “fail”. tence and not in the other. For each sentence,
negations are recognised based on the NEG
4 Classification framework PoS tag or the affix ‘a-’ or ‘in-’ (e.g. indipen-
Although several works have explored the possibil- dente), and then the first content word oc-
ity to automatically grade short text answers, these curring after the negation is considered. We
attempts have mainly focused on English. Further- extract two features, one for each sentence,
more, the best performing ones strongly rely on and the values are normalised by their length.
knowledge bases and syntactic analyses (Mohler et Other distance-based features are computed at
al., 2011), which are hard to obtain for Italian. We sentence level, and to this purpose we employ
therefore test for the first time the potential of sen- fastText (Bojanowski et al., 2017), an extension
tence embeddings to capture pass or fail judgments of word embeddings (Mikolov et al., 2013; Pen-
in a supervised setting, where the only required nington et al., 2014) developed at Facebook that is
data are a) a training/test set and b) sentence em- able to deal with rare words by including subword
beddings (Bojanowski et al., 2017) trained using information, and representing sentences basically
fastText2. by combining vectors representing both words and
4.1 Method subwords. To generate these embeddings we start
from the pre-computed Italian language model3
Since we cast the task in a supervised classification trained on Common Crawl and Wikipedia. The lat-
framework, we first need to represent the pairs of ter, in particular, is suitable for our domain, since it
student/gold standard sentences as features. Two includes also scientific content and statistics pages,
different types of features are tested: distance- therefore the language of the exam should be well
based features, which capture the similarity of represented in our model. The embeddings are cre-
the two sentences using measures based on lexical ated using continuous bag-of-word with position-
and semantic similarity, and sentence embeddings weights, a dimension of 300, character n-grams
features, whose goal is to represent the semantics of length 5, a window of size 5 and 10 negatives.
of the two sentences in a distributional space.
3https://fasttext.cc/docs/en/crawl-vectors.
2https://fasttext.cc/ html
Then, the embedding of the sentences written by
the students and the gold standard ones are created
by combining the word and the subword embed-
dings with the fastText library. Each sentence is
therefore represented through a 300 dimensional
embedding. Based on this, we extract four addi-
tional distance-based features:

• Embeddings cosine: the cosine between the
two sentence embeddings is computed. The
intuition behind this feature is that the embed- Figure 1: Plot for parameter tuning
dings of two sentences with a similar meaning
would be close in a multidimensional space We then proceeded to find the best C and γ pa-
rameters by means of grid-search tuning (Hsu et
• Embeddings cosine (lemmatized): the same
al., 2016), through a 10-fold cross-validation to
feature as the previous one, with the only dif-
prevent to overfit the model. Finally, with the pa-
ference that the sentences are first lemmatised
rameters that returned the best performance, we
before creating the embeddings
finalised the classifier and calculated its accuracy
• Word Mover’s Distance (WMD): WMD is a and F1 score. The analyses were performed us-
similarity measures based on the minimum ing R 3.6.0 with caret v6.0-84 and e1071 v1.7-2
amount of distance that the embedded words packages (R Core Team, 2018).
of one document need to move to reach the 4.2 Results
embedded words of another document (Kus-
ner et al., 2015) in a multidimensional space. Figure 1 shows the plot summarising the tuning
Compared with other existing similarity mea- process. In summary, within the explored area, the
sures, it works well also when two sentences best parameters were found to be C = 104 and
have a similar meaning despite having few γ = 2−6 . The resulting tuned model produced the
words in common. We apply this algorithm following results:
to measure the distance between the solutions • Accuracy = 0.891 (balanced accuracy =
proposed by the students and the ones in the 0.876);
gold standard.
• F1 score = 0.914;
• Word Mover’s Distance (lemmatized): the
same feature as the previous one, with the only With a similar approach, we also tuned the clas-
difference that the sentences are first lemma- sifier when fed with only the concatenated sentence
tised before creating the embeddings embeddings as features (i.e., without distance-
based features). With best parameters C = 103
The sentence embeddings used to compute the and γ = 2−3 , the results were:
distance features are also tested as features in isola-
tion: a 600 dimensional vector is indeed created by • Accuracy = 0.885 (balanced accuracy =
concatenating each sentence embeddings compos- 0.870);
ing a student answer – gold standard pair. This rep- • F1 score = 0.909;
resentation is then directly fed to the classifier. We
adopt this solution inspired by recent approaches to To evaluate the quality of the model learned with
natural language inference using the concatenation these two configurations, and make sure that it
of premise and hypothesis (Bowman et al., 2015; does not overfit, we perform an additional test:
Kiros and Chan, 2018). we collect a small set of students’ answers from a
As for the supervised classifier, we use support different statistics exam than the one used to create
vector machines (Scholkopf and Smola, 2001), the training set. This is done on novel data by
which generally yield satisfying results in classifi- collecting students’ answers from a small number
cation tasks with a limited number of training in- of new questions, and manually creating new gold
stances (as opposed to deep learning approaches). answers to be used in the pairs. Overall, we obtain
77 new answer pairs, consisting of 14 wrong and 63 knowledge, this is the first dataset of this kind. We
correct answers. We then run the best performing also introduce a simple approach based on sen-
model with all features and using only sentence tence embeddings to automatically identify which
embeddings (same C and γ as before). The results answers are correct or not, which is easy to repli-
are the following: cate and not computationally intensive.
In the future, the work could be extended in sev-
• Accuracy using all features = 0.7838 (bal-
eral directions. First of all, it would be interesting
anced accuracy = 0.5965);
to use deep-learning approaches instead of SVM,
• F1 score 0.8710; but for that more training data are needed. These
could be collected in the upcoming exam sessions
while the results achieved using only sentence
at University of L’Aquila. Another refinement of
embeddings are:
this work would be to grade the tests by assigning
• Accuracy = 0.7973 (balanced accuracy = a numerical score instead of a pass/fail judgment.
0.6349); Since such scores are already included in the re-
leased dataset (the degrees), this would be quite
• F1 score = 0.8780; straightforward to achieve. Finally, we plan to test
5 Discussion the classifier by integrating it in an online evalua-
tion tool, through which students can submit their
The results presented in the previous section show tests and the trainer can run an automatic pass/fail
only a small increase in performance when using assignment.
the distance-based features in addition to the sen-
tence embeddings after tuning both configurations.
This outcome highlights the effectiveness of us- References
ing sentence embeddings to represent the semantic Anna Maria Angelone and Pierpaolo Vittorini. 2019.
content of the answers in tasks where student’s and The Automated Grading of R Code Snippets: Prelim-
gold solutions are very similar to each other. In inary Results in a Course of Health Informatics. In
fact, the sentence pairs in our dataset show a high Proc. of the 9th International Conference in Method-
ologies and Intelligent Systems for Technology En-
level of word overlap, and the only discriminant hanced Learning. Springer.
between a correct and a wrong answer is some-
times only the presence of “<” instead of “>”, or Alessio Palmero Aprosio and Giovanni Moretti. 2018.
Tint 2.0: an all-inclusive suite for NLP in italian.
a negation. In Proceedings of the Fifth Italian Conference on
The second experiment, where the same config- Computational Linguistics (CLiC-it 2018), Torino,
uration is run on a test set taken from a statistics Italy, December 10-12, 2018.
exam on different topics, shows an overall decrease Stacey Bailey and Detmar Meurers. 2008. Diagnos-
in performance as expected, but the classification ing meaning errors in short answers to reading com-
accuracy is still well above the most frequent base- prehension questions. In Proceedings of the Third
line. In this setting, using only the sentence em- Workshop on Innovative Use of NLP for Building
Educational Applications, pages 107–115. Associa-
beddings yields a slightly better performance than tion for Computational Linguistics.
including the other features, showing that they are
more robust with respect to a change of topic. Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching word vectors with
In general terms, despite the accurate param- subword information. Transactions of the Associa-
eter tuning, the classification approach seems to tion for Computational Linguistics, 5:135–146.
be applicable to short answer grading tests differ-
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
ent from the data on which the training was done, and Christopher D. Manning. 2015. A large anno-
provided that the student’s and gold answer types tated corpus for learning natural language inference.
are the same as in our dataset (i.e. limited length, In Proceedings of the 2015 Conference on Empiri-
limited lexical variability). cal Methods in Natural Language Processing, pages
632–642, Lisbon, Portugal, September. Association
6 Conclusions for Computational Linguistics.
Steven Burrows, Iryna Gurevych, and Benno Stein.
In this paper, we have presented a novel dataset 2015. The eras and trends of automatic short answer
for short answer grading taken from a real statis- grading. International Journal of Artificial Intelli-
tics exam, which we make freely available. To our gence in Education, 25(1):60–117.
David H Callear, Jenny Jerrams-Smith, and Victor Soh. Detmar Meurers, Ramon Ziai, Niels Ott, and Stacey M
2001. Caa of short non-mcq answers. Bailey. 2011. Integrating parallel analysis modules
to evaluate the meaning of answers to reading com-
Laurie Cutrone, Maiga Chang, et al. 2011. Auto- prehension questions. International Journal of Con-
assessor: Computerized assessment system for mark- tinuing Engineering Education and Life-Long Learn-
ing student’s short-answers automatically. In 2011 ing, 21(4):355–369.
IEEE International Conference on Technology for
Education, pages 81–88. IEEE. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word represen-
Christian Gütl. 2007. e-examiner: towards a fully- tations in vector space.
automatic knowledge assessment tool applicable in
adaptive e-learning systems. In Proceedings of the Tom Mitchell, Terry Russell, Peter Broomhead, and
2nd international conference on interactive mobile Nicola Aldridge. 2002. Towards robust comput-
and computer aided learning, pages 1–10. Citeseer. erised marking of free-text responses.
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. Michael Mohler, Razvan Bunescu, and Rada Mihal-
2016. A Practical Guide to Support Vector Classifi- cea. 2011. Learning to grade short answer questions
cation. Technical report, National Taiwan University. using semantic similarity measures and dependency
graph alignments. In Proceedings of the 49th Annual
Sally Jordan and Tom Mitchell. 2009. e-assessment
Meeting of the Association for Computational Lin-
for learning? the potential of short-answer free-text
guistics: Human Language Technologies - Volume
questions with tailored feedback. British Journal of
1, HLT ’11, pages 752–762, Stroudsburg, PA, USA.
Educational Technology, 40(2):371–385.
Association for Computational Linguistics.
Dharmendra Kanejiya, Arun Kumar, and Surendra
Prasad. 2003. Automatic evaluation of students’ Ellis B Page. 1966. The imminence of grading essays
answers using syntactically enhanced lsa. In Pro- by computer. The Phi Delta Kappan, 47(5):238–243.
ceedings of the HLT-NAACL 03 workshop on Build-
Jeffrey Pennington, Richard Socher, and Christopher D.
ing educational applications using natural language
Manning. 2014. Glove: Global vectors for word
processing-Volume 2, pages 53–60. Association for
representation. In Proceedings of EMNLP.
Computational Linguistics.
Jamie Kiros and William Chan. 2018. Inferlite: Sim- R Core Team. 2018. R: A Language and Environment
ple universal sentence representations from natural for Statistical Computing.
language inference data. In Proceedings of the 2018
Keisuke Sakaguchi, Michael Heilman, and Nitin Mad-
Conference on Empirical Methods in Natural Lan-
nani. 2015. Effective feature integration for auto-
guage Processing, Brussels, Belgium, October 31 -
mated short answer scoring. In Proceedings of the
November 4, 2018, pages 4868–4874.
2015 conference of the North American Chapter of
Richard Klein, Angelo Kyrilov, and Mayya Tokman. the association for computational linguistics: Hu-
2011. Automated assessment of short free-text re- man language technologies, pages 1049–1054.
sponses in computer science using latent semantic
analysis. In Proceedings of the 16th annual joint Bernhard Scholkopf and Alexander J Smola. 2001.
conference on Innovation and technology in com- Learning with kernels: support vector machines,
puter science education, pages 158–162. ACM. regularization, optimization, and beyond. MIT
press.
Sachin Kumar, Soumen Chakrabarti, and Shourya Roy.
2017. Earth mover’s distance pooling over siamese Mark D Shermis, Jill Burstein, Derrick Higgins, and
lstms for automatic short answer grading. In IJCAI, Klaus Zechner. 2010. Automated essay scoring:
pages 2046–2052. Writing assessment and instruction. International
encyclopedia of education, 4(1):20–26.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian
Weinberger. 2015. From word embeddings to docu- Raheel Siddiqi and Christopher Harrison. 2008. A
ment distances. In International Conference on Ma- systematic approach to the automated marking of
chine Learning, pages 957–966. short-answer questions. In 2008 IEEE International
Multitopic Conference, pages 329–332. IEEE.
Claudia Leacock and Martin Chodorow. 2003. C-rater:
Automated scoring of short-answer questions. Com- Draylson M Souza, Katia R Felizardo, and Ellen F
puters and the Humanities, 37(4):389–405. Barbosa. 2016. A systematic literature review of
assessment tools for programming assignments. In
Nitin Madnani, Jill Burstein, John Sabatini, and Tenaha 2016 IEEE 29th International Conference on Soft-
O’Reilly. 2013. Automated scoring of a summary- ware Engineering Education and Training (CSEET),
writing task designed to measure reading compre- pages 147–156. IEEE.
hension. In Proceedings of the Eighth Workshop
on Innovative Use of NLP for Building Educational Md Arafat Sultan, Cristobal Salazar, and Tamara Sum-
Applications, pages 163–168. ner. 2016. Fast and easy short answer grading with
high accuracy. In Proceedings of the 2016 Confer-
ence of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Lan-
guage Technologies, pages 1070–1075.
Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.
2011. A new dataset and method for automatically
grading esol texts. In Proceedings of the 49th Annual
Meeting of the Association for Computational Lin-
guistics: Human Language Technologies-Volume 1,
pages 180–189. Association for Computational Lin-
guistics.