=Paper= {{Paper |id=Vol-2481/paper48 |storemode=property |title=Automated Short Answer Grading: A Simple Solution for a Difficult Task |pdfUrl=https://ceur-ws.org/Vol-2481/paper48.pdf |volume=Vol-2481 |authors=Stefano Menini,Sara Tonelli,Giovanni De Gasperis,Pierpaolo Vittorini |dblpUrl=https://dblp.org/rec/conf/clic-it/MeniniTGV19 }} ==Automated Short Answer Grading: A Simple Solution for a Difficult Task== https://ceur-ws.org/Vol-2481/paper48.pdf
                               Automated Short Answer Grading:
                              A Simple Solution for a Difficult Task

         Stefano Menini† , Sara Tonelli† , Giovanni De Gasperis‡ , Pierpaolo Vittorini‡
                 †
                   Fondazione Bruno Kessler (Trento), ‡ University of L’Aquila
                               {menini,satonelli}@fbk.eu
               {giovanni.degasperis,pierpaolo.vittorini}@univaq.it


                       Abstract                                a novel approach to binary automatic short an-
                                                               swer grading (ASAG). This has proven particularly
    English. The task of short answer grad-
                                                               challenging because an understanding of natural
    ing is aimed at assessing the outcome of an
                                                               language is required, without having much textual
    exam by automatically analysing students’
                                                               context, while grading multiple-choice questions
    answers in natural language and deciding
                                                               can be straightforwardly assessed, given that there
    whether they should pass or fail the exam.
                                                               is only one possible correct response to each ques-
    In this paper, we tackle this task train-
                                                               tion. Furthermore, the tests considered in this pa-
    ing an SVM classifier on real data taken
                                                               per are taken from real exams on statistical analy-
    from a University statistics exam, showing
                                                               ses, with low variability, a limited vocabulary and
    that simple concatenated sentence embed-
                                                               therefore little lexical difference between correct
    dings used as features yield results around
                                                               and wrong answers.
    0.90 F1, and that adding more complex
                                                                  The contribution of this paper is two-fold: we
    distance-based features lead only to a slight
                                                               create and release a dataset for short-answer grad-
    improvement. We also release the dataset,
                                                               ing containing real examples, which can be freely
    that to our knowledge is the first freely
                                                               downloaded at https://zenodo.org/record/
    available dataset of this kind in Italian.1
                                                               3257363#.XRsrn5P7TLY. Besides, we propose a
1   Introduction                                               simple approach that, making use only of concate-
                                                               nated sentence embeddings and an SVM classifier,
Human grading of open ended questions is a te-                 achieves up to 0.90 F1 after parameter tuning.
dious and error-prone task, a problem that has be-
come particularly pressing when such an assess-                2   Related Work
ment involves a large number of students, like in
an Academic setting. One possible solution to this             In the literature, several works have been presented
problem is to automate the grading process, so that            on automated grading methods, to assess the qual-
it can facilitate teachers in the correction and en-           ity of answers in written examinations. Several
able students to receive immediate feedback. Re-               types of answers have been addressed, from es-
search on this task has been active since the ’60s             says (Kanejiya et al., 2003; Shermis et al., 2010),
(Page, 1966), and several computational methods                to code (Souza et al., 2016). Here we focus on
have been proposed to automatically grade differ-              works related to short answers, which are the tar-
ent types of texts, from longer essays to short text           get of our tests. With short answers we refer to
answers. The advantages of this kind of automatic              open questions, given in natural language, usually
assessment do not concern only the limited time                with the length of one paragraph, recalling external
and effort required to grade tests compared with a             knowledge (Burrows et al., 2015). When assess-
manual assessment, but include also the reduction              ing the grading of short answers we face two main
of mistakes and bias introduced by humans, as well             issues, i) the grading itself and ii) the presence of
as a better formalization of assessment criteria.              appropiate datasets.
   In this paper, we focus on tests comprising short              ASAG can be tackled with several approaches,
answers to natural language questions, proposing               including pattern matching (Mitchell et al., 2002),
                                                               looking for specific concepts or keywords in the an-
    1Copyright ©2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-      swers (Callear et al., 2001; Leacock and Chodorow,
ternational (CC BY 4.0).                                       2003; Jordan and Mitchell, 2009), using bag of
words and matching terms (Cutrone et al., 2011)           data about surgical operations, subjects, scar visi-
or relying on LSA (Klein et al., 2011). Some other        bility and hospital stay, and asked to compute sev-
solutions rely more heavily on NLP techniques, for        eral statistical measures in R, such as the absolute
example by extracting metrics and features that can       and relative frequencies of the surgical operations.
be used for text classification such as the overlap       Then, students were required to comment in plain
of n-grams or POS between student’s and teacher’s         text on some of the analyses, for example state
answers (Bailey and Meurers, 2008; Meurers et al.,        whether some data are extracted from a normal
2011). Some attempts have been made also to use           distribution. For this second part of the exam, the
similarity between word embeddings as a feature           teacher prepared a “gold answer”, i.e. the correct
(Sultan et al., 2016; Sakaguchi et al., 2015; Kumar       answer. Two real examples from the dataset are
et al., 2017).                                            reported below.
   Another aspect that can affect the performance         Correct answer pair:
of different ASAG approaches is the target of au-               (Student) Poiché il p-value e maggiore di
tomated evaluation. We can for instance assess the              0.05 in entrambi i casi, la distribuzione
quality of the text (Yannakoudakis et al., 2011),               è normale, procediamo con un test para-
its comprehension and summarization (Madnani et                 metrico per variabili appaiate.
al., 2013), or, as in our case, the knowledge of a spe-
                                                                (Gold) Siccome tutti i test di normalità
cific notion. Each task would therefore need a spe-
                                                                presentano un p>0.05, posso utilizzare
cific dataset as a benchmark. Other dimensions af-
                                                                un test parametrico.
fecting the approach to ASAG and its performance
are also the school level for which an assessment         Wrong answer pair:
is required (e.g primary school vs. university) as              (Student) Siccome p<0.05,la differenza
well as its domain, e.g. computer science (Gütl,                fra le due variabili è statisticamente sig-
2007), biology (Siddiqi and Harrison, 2008) or                  nificativa.
math (Leacock and Chodorow, 2003). As for Ital-
                                                                (Gold) Siccome il t-test restituisce un p-
ian, we are not aware of existing automated grading
                                                                value > di 0.05, non posso generaliz-
approaches, nor of available datasets specifically
                                                                zare alla popolazione il risultato osser-
released to foster research in this direction. These
                                                                vato nel mio campione, e quindi non c’è
are indeed the main contributions of the current
                                                                differenza media di peso statisticamente
paper.
                                                                significativa fra i figli maschi e femmine.
3   Task and Data Description                                The goal of our task is, given each pair, to train
                                                          a classifier and label correct and wrong students’
The short grading task that we analyse in this paper      answers. An important aspect of our task is that
is meant to automatize part of the exam that stu-         the correctness of an answer is not defined with
dents of Health Informatics in the degree course of       respect to the question, which is not used for clas-
Medicine and Surgery of the University of L’Aquila        sification. For the moment we also focus on binary
(Italy) are required to pass. It includes two activi-     classification, to determine whether an answer is
ties: a statistical analysis in R and the explanation     correct or not, without providing a numeric score
of the results in terms of clinical findings. While       on how much it is correct or wrong. With the data
the evaluation of the first part has already been au-     organized into student-professor answers pairs, the
tomatized through automated grading of R code             classification is done considering i) the semantic
snippets (Angelone and Vittorini, 2019), the sec-         content of the answers (represented through word
ond task had been addressed by the same authors           embeddings ii) features related to the pair struc-
using a string similarity approach, which however         ture of the data such as the overlap or the distance
did not yield satisfying results. Indeed, they used       between the two texts. The adopted features are
Levenshtein distance to compute the distance be-          explained in detail in Section 4.1.
tween the students’ answer and a gold standard
(i.e. correct) answer, but the approach failed to         3.1   Dataset
capture the semantic equivalence between the two          The dataset available at https://zenodo.org/
sentences, while focusing only on the lexical one.        record/3257363#.XR5i8ZP7TLY has been par-
   For example, an exam provided students with            tially collected using data from real statistics exams
spanning different years, and partially extended by           All sentences are first preprocessed by remov-
the authors of this paper. The dataset contains the        ing the stopwords such as articles and prepositions,
list of sentences written by students, with a unique       and by replacing mathematical notations with their
sentence ID, the type of statistical analysis it refers    transcription in plain language, e.g. “>" with
to (if either given for the hypothesis or normality        “maggiore di" (greater than). We also perform
test), its degree in a range from 0 to 1, and its fail/-   part of speech tagging, lemmatisation and affix
pass result, flanked with a manually defined gold          recognition using the TINT NLP Suite for Italian
standard (i.e. the correct answer). The degree is a        (Aprosio and Moretti, 2018). Then on each pair
numerical score manually assigned to each answer,          of sentences the following distance-based features
which takes into account whether an answer is par-         are computed:
tially correct, mostly correct or completely wrong.
Based on this degree, the pass/fail decision was              • Token overlap: a feature representing the
taken, i.e. if degree < 0.6 then fail, otherwise                number of overlapping tokens between the
pass.                                                           two sentences normalised by their length.
   In order to increase the number of training in-              This feature captures the lexical similarity be-
stances and achieve a better balance between the                tween the two strings.
two classes, we manually negated a set of correct             • Lemma overlap: a feature representing the
answers and reversed the corresponding fail/pass                number of overlapping lemmas between the
result, adding a set of negated gold standard sen-              two sentences normalised by their length.
tences for a total of 332 new pairs. We also manu-              Like the previous one, this feature captures
ally paraphrased 297 of the original gold standard              the lexical similarity between the two strings.
sentences, so that we created some additional pairs.
Overall the dataset consists of 1,069 student/gold            • Presence of negations: this feature represents
standard answer pairs, 663 of which are labeled as              whether a content word is negated in one sen-
“pass” and 406 as “fail”.                                       tence and not in the other. For each sentence,
                                                                negations are recognised based on the NEG
4     Classification framework                                  PoS tag or the affix ‘a-’ or ‘in-’ (e.g. indipen-
Although several works have explored the possibil-              dente), and then the first content word oc-
ity to automatically grade short text answers, these            curring after the negation is considered. We
attempts have mainly focused on English. Further-               extract two features, one for each sentence,
more, the best performing ones strongly rely on                 and the values are normalised by their length.
knowledge bases and syntactic analyses (Mohler et             Other distance-based features are computed at
al., 2011), which are hard to obtain for Italian. We       sentence level, and to this purpose we employ
therefore test for the first time the potential of sen-    fastText (Bojanowski et al., 2017), an extension
tence embeddings to capture pass or fail judgments         of word embeddings (Mikolov et al., 2013; Pen-
in a supervised setting, where the only required           nington et al., 2014) developed at Facebook that is
data are a) a training/test set and b) sentence em-        able to deal with rare words by including subword
beddings (Bojanowski et al., 2017) trained using           information, and representing sentences basically
fastText2.                                                 by combining vectors representing both words and
4.1    Method                                              subwords. To generate these embeddings we start
                                                           from the pre-computed Italian language model3
Since we cast the task in a supervised classification      trained on Common Crawl and Wikipedia. The lat-
framework, we first need to represent the pairs of         ter, in particular, is suitable for our domain, since it
student/gold standard sentences as features. Two           includes also scientific content and statistics pages,
different types of features are tested: distance-          therefore the language of the exam should be well
based features, which capture the similarity of            represented in our model. The embeddings are cre-
the two sentences using measures based on lexical          ated using continuous bag-of-word with position-
and semantic similarity, and sentence embeddings           weights, a dimension of 300, character n-grams
features, whose goal is to represent the semantics         of length 5, a window of size 5 and 10 negatives.
of the two sentences in a distributional space.
                                                              3https://fasttext.cc/docs/en/crawl-vectors.
    2https://fasttext.cc/                                  html
Then, the embedding of the sentences written by
the students and the gold standard ones are created
by combining the word and the subword embed-
dings with the fastText library. Each sentence is
therefore represented through a 300 dimensional
embedding. Based on this, we extract four addi-
tional distance-based features:

   • Embeddings cosine: the cosine between the
     two sentence embeddings is computed. The
     intuition behind this feature is that the embed-            Figure 1: Plot for parameter tuning
     dings of two sentences with a similar meaning
     would be close in a multidimensional space           We then proceeded to find the best C and γ pa-
                                                          rameters by means of grid-search tuning (Hsu et
   • Embeddings cosine (lemmatized): the same
                                                          al., 2016), through a 10-fold cross-validation to
     feature as the previous one, with the only dif-
                                                          prevent to overfit the model. Finally, with the pa-
     ference that the sentences are first lemmatised
                                                          rameters that returned the best performance, we
     before creating the embeddings
                                                          finalised the classifier and calculated its accuracy
   • Word Mover’s Distance (WMD): WMD is a                and F1 score. The analyses were performed us-
     similarity measures based on the minimum             ing R 3.6.0 with caret v6.0-84 and e1071 v1.7-2
     amount of distance that the embedded words           packages (R Core Team, 2018).
     of one document need to move to reach the            4.2   Results
     embedded words of another document (Kus-
     ner et al., 2015) in a multidimensional space.       Figure 1 shows the plot summarising the tuning
     Compared with other existing similarity mea-         process. In summary, within the explored area, the
     sures, it works well also when two sentences         best parameters were found to be C = 104 and
     have a similar meaning despite having few            γ = 2−6 . The resulting tuned model produced the
     words in common. We apply this algorithm             following results:
     to measure the distance between the solutions           • Accuracy = 0.891 (balanced accuracy =
     proposed by the students and the ones in the              0.876);
     gold standard.
                                                             • F1 score = 0.914;
   • Word Mover’s Distance (lemmatized): the
     same feature as the previous one, with the only         With a similar approach, we also tuned the clas-
     difference that the sentences are first lemma-       sifier when fed with only the concatenated sentence
     tised before creating the embeddings                 embeddings as features (i.e., without distance-
                                                          based features). With best parameters C = 103
   The sentence embeddings used to compute the            and γ = 2−3 , the results were:
distance features are also tested as features in isola-
tion: a 600 dimensional vector is indeed created by          • Accuracy = 0.885 (balanced accuracy =
concatenating each sentence embeddings compos-                 0.870);
ing a student answer – gold standard pair. This rep-         • F1 score = 0.909;
resentation is then directly fed to the classifier. We
adopt this solution inspired by recent approaches to         To evaluate the quality of the model learned with
natural language inference using the concatenation        these two configurations, and make sure that it
of premise and hypothesis (Bowman et al., 2015;           does not overfit, we perform an additional test:
Kiros and Chan, 2018).                                    we collect a small set of students’ answers from a
   As for the supervised classifier, we use support       different statistics exam than the one used to create
vector machines (Scholkopf and Smola, 2001),              the training set. This is done on novel data by
which generally yield satisfying results in classifi-     collecting students’ answers from a small number
cation tasks with a limited number of training in-        of new questions, and manually creating new gold
stances (as opposed to deep learning approaches).         answers to be used in the pairs. Overall, we obtain
77 new answer pairs, consisting of 14 wrong and 63     knowledge, this is the first dataset of this kind. We
correct answers. We then run the best performing       also introduce a simple approach based on sen-
model with all features and using only sentence        tence embeddings to automatically identify which
embeddings (same C and γ as before). The results       answers are correct or not, which is easy to repli-
are the following:                                     cate and not computationally intensive.
                                                          In the future, the work could be extended in sev-
    • Accuracy using all features = 0.7838 (bal-
                                                       eral directions. First of all, it would be interesting
      anced accuracy = 0.5965);
                                                       to use deep-learning approaches instead of SVM,
    • F1 score 0.8710;                                 but for that more training data are needed. These
                                                       could be collected in the upcoming exam sessions
  while the results achieved using only sentence
                                                       at University of L’Aquila. Another refinement of
embeddings are:
                                                       this work would be to grade the tests by assigning
    • Accuracy = 0.7973 (balanced accuracy =           a numerical score instead of a pass/fail judgment.
      0.6349);                                         Since such scores are already included in the re-
                                                       leased dataset (the degrees), this would be quite
    • F1 score = 0.8780;                               straightforward to achieve. Finally, we plan to test
5    Discussion                                        the classifier by integrating it in an online evalua-
                                                       tion tool, through which students can submit their
The results presented in the previous section show     tests and the trainer can run an automatic pass/fail
only a small increase in performance when using        assignment.
the distance-based features in addition to the sen-
tence embeddings after tuning both configurations.
This outcome highlights the effectiveness of us-       References
ing sentence embeddings to represent the semantic      Anna Maria Angelone and Pierpaolo Vittorini. 2019.
content of the answers in tasks where student’s and      The Automated Grading of R Code Snippets: Prelim-
gold solutions are very similar to each other. In        inary Results in a Course of Health Informatics. In
fact, the sentence pairs in our dataset show a high      Proc. of the 9th International Conference in Method-
                                                         ologies and Intelligent Systems for Technology En-
level of word overlap, and the only discriminant         hanced Learning. Springer.
between a correct and a wrong answer is some-
times only the presence of “<” instead of “>”, or      Alessio Palmero Aprosio and Giovanni Moretti. 2018.
                                                         Tint 2.0: an all-inclusive suite for NLP in italian.
a negation.                                              In Proceedings of the Fifth Italian Conference on
   The second experiment, where the same config-         Computational Linguistics (CLiC-it 2018), Torino,
uration is run on a test set taken from a statistics     Italy, December 10-12, 2018.
exam on different topics, shows an overall decrease    Stacey Bailey and Detmar Meurers. 2008. Diagnos-
in performance as expected, but the classification        ing meaning errors in short answers to reading com-
accuracy is still well above the most frequent base-      prehension questions. In Proceedings of the Third
line. In this setting, using only the sentence em-        Workshop on Innovative Use of NLP for Building
                                                          Educational Applications, pages 107–115. Associa-
beddings yields a slightly better performance than        tion for Computational Linguistics.
including the other features, showing that they are
more robust with respect to a change of topic.         Piotr Bojanowski, Edouard Grave, Armand Joulin, and
                                                          Tomas Mikolov. 2017. Enriching word vectors with
   In general terms, despite the accurate param-          subword information. Transactions of the Associa-
eter tuning, the classification approach seems to         tion for Computational Linguistics, 5:135–146.
be applicable to short answer grading tests differ-
                                                       Samuel R. Bowman, Gabor Angeli, Christopher Potts,
ent from the data on which the training was done,        and Christopher D. Manning. 2015. A large anno-
provided that the student’s and gold answer types        tated corpus for learning natural language inference.
are the same as in our dataset (i.e. limited length,     In Proceedings of the 2015 Conference on Empiri-
limited lexical variability).                            cal Methods in Natural Language Processing, pages
                                                         632–642, Lisbon, Portugal, September. Association
6    Conclusions                                         for Computational Linguistics.
                                                       Steven Burrows, Iryna Gurevych, and Benno Stein.
In this paper, we have presented a novel dataset          2015. The eras and trends of automatic short answer
for short answer grading taken from a real statis-        grading. International Journal of Artificial Intelli-
tics exam, which we make freely available. To our         gence in Education, 25(1):60–117.
David H Callear, Jenny Jerrams-Smith, and Victor Soh.     Detmar Meurers, Ramon Ziai, Niels Ott, and Stacey M
  2001. Caa of short non-mcq answers.                       Bailey. 2011. Integrating parallel analysis modules
                                                            to evaluate the meaning of answers to reading com-
Laurie Cutrone, Maiga Chang, et al. 2011. Auto-             prehension questions. International Journal of Con-
  assessor: Computerized assessment system for mark-        tinuing Engineering Education and Life-Long Learn-
  ing student’s short-answers automatically. In 2011        ing, 21(4):355–369.
  IEEE International Conference on Technology for
  Education, pages 81–88. IEEE.                           Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
                                                            Dean. 2013. Efficient estimation of word represen-
Christian Gütl. 2007. e-examiner: towards a fully-          tations in vector space.
  automatic knowledge assessment tool applicable in
  adaptive e-learning systems. In Proceedings of the      Tom Mitchell, Terry Russell, Peter Broomhead, and
  2nd international conference on interactive mobile        Nicola Aldridge. 2002. Towards robust comput-
  and computer aided learning, pages 1–10. Citeseer.        erised marking of free-text responses.
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin.         Michael Mohler, Razvan Bunescu, and Rada Mihal-
  2016. A Practical Guide to Support Vector Classifi-       cea. 2011. Learning to grade short answer questions
  cation. Technical report, National Taiwan University.     using semantic similarity measures and dependency
                                                            graph alignments. In Proceedings of the 49th Annual
Sally Jordan and Tom Mitchell. 2009. e-assessment
                                                            Meeting of the Association for Computational Lin-
  for learning? the potential of short-answer free-text
                                                            guistics: Human Language Technologies - Volume
  questions with tailored feedback. British Journal of
                                                            1, HLT ’11, pages 752–762, Stroudsburg, PA, USA.
  Educational Technology, 40(2):371–385.
                                                            Association for Computational Linguistics.
Dharmendra Kanejiya, Arun Kumar, and Surendra
  Prasad. 2003. Automatic evaluation of students’         Ellis B Page. 1966. The imminence of grading essays
  answers using syntactically enhanced lsa. In Pro-          by computer. The Phi Delta Kappan, 47(5):238–243.
  ceedings of the HLT-NAACL 03 workshop on Build-
                                                          Jeffrey Pennington, Richard Socher, and Christopher D.
  ing educational applications using natural language
                                                             Manning. 2014. Glove: Global vectors for word
  processing-Volume 2, pages 53–60. Association for
                                                             representation. In Proceedings of EMNLP.
  Computational Linguistics.
Jamie Kiros and William Chan. 2018. Inferlite: Sim-       R Core Team. 2018. R: A Language and Environment
  ple universal sentence representations from natural       for Statistical Computing.
  language inference data. In Proceedings of the 2018
                                                          Keisuke Sakaguchi, Michael Heilman, and Nitin Mad-
  Conference on Empirical Methods in Natural Lan-
                                                            nani. 2015. Effective feature integration for auto-
  guage Processing, Brussels, Belgium, October 31 -
                                                            mated short answer scoring. In Proceedings of the
  November 4, 2018, pages 4868–4874.
                                                            2015 conference of the North American Chapter of
Richard Klein, Angelo Kyrilov, and Mayya Tokman.            the association for computational linguistics: Hu-
  2011. Automated assessment of short free-text re-         man language technologies, pages 1049–1054.
  sponses in computer science using latent semantic
  analysis. In Proceedings of the 16th annual joint       Bernhard Scholkopf and Alexander J Smola. 2001.
  conference on Innovation and technology in com-           Learning with kernels: support vector machines,
  puter science education, pages 158–162. ACM.              regularization, optimization, and beyond. MIT
                                                            press.
Sachin Kumar, Soumen Chakrabarti, and Shourya Roy.
  2017. Earth mover’s distance pooling over siamese       Mark D Shermis, Jill Burstein, Derrick Higgins, and
  lstms for automatic short answer grading. In IJCAI,      Klaus Zechner. 2010. Automated essay scoring:
  pages 2046–2052.                                         Writing assessment and instruction. International
                                                           encyclopedia of education, 4(1):20–26.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian
 Weinberger. 2015. From word embeddings to docu-          Raheel Siddiqi and Christopher Harrison. 2008. A
 ment distances. In International Conference on Ma-         systematic approach to the automated marking of
 chine Learning, pages 957–966.                             short-answer questions. In 2008 IEEE International
                                                            Multitopic Conference, pages 329–332. IEEE.
Claudia Leacock and Martin Chodorow. 2003. C-rater:
  Automated scoring of short-answer questions. Com-       Draylson M Souza, Katia R Felizardo, and Ellen F
  puters and the Humanities, 37(4):389–405.                 Barbosa. 2016. A systematic literature review of
                                                            assessment tools for programming assignments. In
Nitin Madnani, Jill Burstein, John Sabatini, and Tenaha     2016 IEEE 29th International Conference on Soft-
  O’Reilly. 2013. Automated scoring of a summary-           ware Engineering Education and Training (CSEET),
  writing task designed to measure reading compre-          pages 147–156. IEEE.
  hension. In Proceedings of the Eighth Workshop
  on Innovative Use of NLP for Building Educational       Md Arafat Sultan, Cristobal Salazar, and Tamara Sum-
  Applications, pages 163–168.                             ner. 2016. Fast and easy short answer grading with
  high accuracy. In Proceedings of the 2016 Confer-
  ence of the North American Chapter of the Associ-
  ation for Computational Linguistics: Human Lan-
  guage Technologies, pages 1070–1075.
Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.
  2011. A new dataset and method for automatically
  grading esol texts. In Proceedings of the 49th Annual
  Meeting of the Association for Computational Lin-
  guistics: Human Language Technologies-Volume 1,
  pages 180–189. Association for Computational Lin-
  guistics.