-

Methods and perspectives for the automated analytic assessment of free-text responses in formative scenarios

Sebastian Gombert

gombert@dipf.de 1

Assessment, Automated Assessment, Analytic Assessment, Short Answer Grading, Essay Grading

0 Leibniz Institute for Research and Information in Education , Frankfurt am Main , Germany 1 Workshop Proce dings

Assessment is the process of testing learners' skills and knowledge. Free-text response items are well suited for the assessment of learners' active knowledge and writing skills. However, the automatic assessment of respective responses is not trivial and requires the application of natural language processing. Accordingly, the automatic assessment of free-text responses is a widely researched topic in educational natural language processing. Most past work targets holistic scoring, the process of assigning overall scores or grades to responses. This is problematic in formative scenarios because learners require feedback rather than summative scores in such scenarios. Such feedback ideally targets specific aspects of responses, and, accordingly, automated systems which only predict holistic scores cannot be used as a basis for providing the same. What is instead needed are systems which implement analytic scoring approaches. Analytic scoring targets specific aspects of responses and scores them according to corresponding criteria. This requires diferent systems than addressed by the broad research on automated holistic scoring. In my PhD work which is outlined by this paper, I want to explore approaches for implementing analytic scoring systems by means of state-of-the-art natural language processing. These systems are targeted at providing a basis for feedback generation.

1. Introduction suring and documenting learners’ skills and knowledge [ 1 ]. This is conducted through tests composed of various kinds of test items. Assessing learners’ knowledge and skills is also the basis for providing them with appropriate content-related feedback in formative scenarios [ 2 ]. In the context of technology-based assessment, multiple-choice items have grown to be a popular choice to implement tests [3, 4]. This is mostly the case since evaluating multiple-choice items is rather trivial. Test creators simply need to define a set of responses out of whom they define one or more as the correct ones. When test-takers select respective responses during testing, the computer only needs to determine which of them were among the correct ones. Moreover, multiple-choice items take only a short time to answer which makes it possible to include many diferent of them within tests and test for a broad range of knowledge [4].

However, not every skill and every kind of knowledge can be assessed through multiple-choice items. “A multiple‑choice test for history students can test their factual

https://edutec.science/team/sebastian-gombert/ (S. Gombert) 0000-0001-5598-9547 (S. Gombert)

2. Constructed Responses and their Automatic Assessment To test skills such as the ones described by [4], con

structed response items are needed instead multiple choice items. In their most common form, they require students to enter a free text as response into a text field.

However, this drastically increases the complexity of assessing learners’ responses in an automated fashion, as the computer-based analysis of human language is far from trivial. With natural language processing respectively computational linguistics, a whole interdisciplinary field of research building upon various methods tics, logic, psychology, cognitive science, software engi- analysing their writing skills of students, e.g., their skill neering and philosophy is dedicated to this issue, and to clearly and coherently discuss or communicate a given the automatic processing of many aspects of language re- issue or argue against or in favour of an opinion. Acmains open research. What makes the automatic analysis cordingly, approaches for the analytic assessment of both of free text dificult are the properties of language itself. text forms must inevitably difer. For short answers, it Humans can generate an unlimited set of diferent linguis- presumably should be suficient to simply assess whole tic utterances, and often, there are many ways to express responses for the diferent aspects, as short answers are the same or similar semantics, i.e., through diferent syn- rather condensed texts. From a formal point of view, this onyms, the usage of passive vs. active constructions, or can be interpreted as a (multi-label) text classification ways of paraphrasing. In past research, many diferent task [5]. methods were applied to the automatic assessment of On the other hand, for essays, the respective coding free-text responses. These range from simpler keyword, can require more varied approaches. Are the aspects pattern and regular expression searches, and methods coded related to content or writing style? Does a contentbuilding upon distributional vector space semantics, to specific code apply to the whole text or to specific secfully-fledged machine learning systems [ 5, 6]. tions? These questions need to be addressed in order

Most recently, transformer language models such as to come to appropriate operationalisations. E.g., if it is BERT [7] were successfully applied to the problem of likely that each code corresponds to a specific part of free-text assessment [8, 9, 10, 11]. The application of an essay, one needs to first semantically segment it into transformers to the assessment of constructed responses the respective parts. One could then in a second step promises major advancements in the field, but nonethe- separately classify these parts for the actual codes. On less, most of the systems available are built to predict only the other hand, if a code corresponds to a whole essay, holistic scores [5, 6], ergo scores aimed at denoting the such separation is not needed. overall quality of a response [4]. Most of the established I plan my PhD to be paper-based where the single padatasets, especially the ones focused on short answers, pers are connected by the overarching topic of analytic also cater towards this approach [5, 9, 6]. While holis- constructed response coding. First and foremost, I want tic scores reflect how well learners were able to overall to explore what has been already done in past work and solve a given task, they do not necessarily denote which how my own work can benefit from these insights. The aspects of their response were of good quality and in acquired knowledge is then to be used for the practical which regards they could improve. However, especially implementation of constructed response scoring systems in formative scenarios, providing students with feedback in a range of case studies. For these case studies, I plan is crucial, which puts the application of holistic scoring to leverage data sets from several research projects I am systems in formative scenarios into question. involved in. In the projects AFLEK and ALICE, I have ac

There is a second scoring approach in constructed re- cess to a set of short answers to diferent science-related sponse assessment which can be seen as a better basis for tasks with detailed coding rubrics focusing on scientific providing detailed, personalized feedback: analytic scor- knowledge and argumentative skills. On the other hand, ing. In analytic scoring, rather than judging responses the project HIKOF provides a data set of essays in which as a whole, they are assessed for multiple diferent as- students discuss learning tips from a YouTube video with pects which need to be specifically defined in a coding respect to their grounding in educational psychology. rubric [4]. I.e., “[o]n a science question, the scorer may Both data sets are coded in a way which allows for the award two points for providing a correct explanation of a implementation of automated analytic assessment sysphenomenon, one point for correctly stating the general tems. principle that it illustrates, and one point for providing Another important aspect of my work is the quesanother valid example of that principle in action” [4]. tion how codes from response scoring systems can be Drawing such distinctions and coding responses for mul- transformed into concrete learner feedback. Feedback tiple diferent aspects allows to provide more detailed can be given on an item-specific level as well as on a and concise feedback as the same can specifically address more global one. It can focus the content or the form these aspects. of concrete responses, and it can also target the overall domain knowledge of a student across multiple items.

For the prior case, generative language models could be 3. Research Questions promising [ 9, 12 ]. For the latter case, a way of modeling learners’ domain knowledge is required. A conceptual The two most common types of free-text responses are framework which goes into this direction was provided short answers and essays. While short answers are used to test students’ ability to explain phenomena or wbyh[ic1h3]awdditshmthuelitripelxepfaeneddebdaecvki-dreenlactee-dceanstpreedctdsetsoigtnhmewodeelll-, demonstrate their active knowledge, essays are used for known evidence-centred design [ 14 ]. However, to my best knowledge, this conceptual framework was not opera- The review by [5] is, thanks to its publication date, tionalised into a concrete feedback-driven assessment fairly outdated. Moreover, in my opinion, it fails to funcsystem so far. tion as a lookup guide for possible techniques to use, and

The last aspect I want to address is the one of explain- rather focuses on summarizing papers from past work. ability. Ethical frameworks in learning analytics and The review by [6], on the other hand, is well structured educational technology such as [ 15 ] often call for the but also fairly short thanks to it being published in conapplication of transparent and explainable models where ference proceedings. The plan for my literature review possible. It is likely that providing learners with simple is to primarily act as a guide for practitioners which they explanations on why models made a given prediction, can refer to when they plan to build their own free-text which, in turn, led to a particular feedback outcome, can assessment systems rather than as a pure overview over increase their acceptance for respective systems. For nat- past work. It shall equip interested researchers with a ural language processing models, a wide range of meth- clear plan on how they can approach their own free-text ods for providing such explanations has been developed response assessment system in a structured manner. [ 16 ]. Research for making state-of-the-art methodology The next papers deal with the implementation of reexplainable also shows promising results, e.g. [ 17 ]. For spective systems themselves to address RQ3. The most this reason, I want to leverage this potential and explore, recent achievements in holistic free-text response assessif providing learners with explanations for their feedback ment, in line with the general developments in natural can increase trust. language processing, were achieved using transformer

To summarzie, I want to address the following research language models [8, 9, 10, 11]. For this reason, my plan questions: is to also apply transformer language models to the task of analytic assessment. However, [5] and [6] document 1. What were the main methods, characteristics and a wide range of methods from the pre-transformers era. results of past work in constructed response scor- It is an interesting question In this context, my plan is to ing? implement and evaluate exemplary systems for assessing 2. What techniques were applied for coding con- both short answers and essays in an analytic fashion. structed responses in an analytic fashion in past In a first research paper, which is currently under work? review, I implemented and evaluated multiple systems 3. What machine learning-based pipelines and ap- aimed at assessing German middle school students’ proaches are efective for the automated analytic knowledge about energy physics. In particular, the sysassessment of constructed responses and to what tems classify if students mentioned certain concepts reextent can they be generalized? lated to energy transformation, i.e., diferent manifesta4. How can the predictions of automated analytic tions of energy, indicators for the same, and if energy assessment systems be transformed into useful is transformed, in a meaningful manner. For this purlearner feedback? pose, first data was collected and coded using a coding 5. To what extent can explaining model outputs rubric which targeted the diferent categories of knowlmake learners trust in the provided feedback? edge. I then implemented and evaluated multiple text classification systems trained to replicate the coding for the respective purpose, transformer- and feature-based. 4. Design The systems are given the response, a provided sample solution and the item prompt. Moreover, using diferFrom a technical perspective, the intention behind my ent methods for generating model explanations, I evaluPhD work is to implement and evaluate respective meth- ated the descriptive accuracy of the implemented models. ods for the analytic assessment of free-text responses for Overall, a transformer-based model based upon GBERT exemplary use cases drawing from state-of-the-art NLP could achieve superior results. In subsequent research, I research. I plan to study and summarize what methods want to explore how well the predictions of such systems were applied to the assessment of free-text responses can be concretely translated into feedback. in past work via a literature review to address RQ1 and In another research paper, I want to implement sysRQ2. For this literature review, I plan to draw from past tems targeting essays. In particular, I aim to use a data reviews on the topic, in particular [5] for the text type set of essays collected throughout the HIKOF project. of short answers and [6] for the text type of essays, but These essays discuss ten diferent learning tips presented primarily with a focus on work which was not covered in a YouTube video with respect to their grounding in by them. The main goal behind the literature review educational and psychological research. For each tip, ten is to provide a concise overview over the methods and diferent codes were assigned. Moreover, it was coded features which can be successfully applied to the task. which sentences within an essay correspond to which tips. This results in two problems which need to be solved.

First, unseen essays must be segmented into sections cor

responding to the diferent tips. This can be approached as a sentence classification task. In a second step, the resulting sections must then be given to a second text classification system which classifies the sections with respect to the analytic codes corresponding to each tip.

In the next step, feedback needs to be generated from the predicted codes. For this purpose, I use contentrelated feedback templates which are assembled dynamically depending on the predicted codes. In particular, the predicted codes are matched with ground truth codes, and discrepancies between the two lead to The generated feedback will be tested within a university lecture in an AB setup. In a followup study, I plan to add aspects of explainability to this feedback. In particular, I plan to present learners with highlighted text of what exactly in their response led to a concrete feedback in an AB setup. This shall then be combined with questionnaires evaluating if showing these explanations to learners increases acceptance. For educational recommender systems, findings from [ 18 ] suggest that showing explanations to learners can increase the acceptance for respective systems. I want to find out if this is also the case for assessment-driven feedback systems.

5. Conclusion

In this document, I presented my PhD project which deals with systems for the automatic assessment of constructed responses in formative scenarios implemented through machine learning-based natural language processing. In particular, I explore the implementation and evaluation of respective systems for multiple use cases. Moreover, I plan to write a literature review on constructed response scoring in the form of a practitioner lookup guide. Finally, I then want to explore how codes predicted by automatic assessment systems can be translated into automatic actionable feedback, and if explaining the model predictions behind this feedback can contribute to the acceptance of these systems. 0969595980050102. doi:1 0 . 1 0 8 0 / 0 9 6 9 5 9 5 9 8 0 0 5 0 1 0 2 . a r X i v : h t t p s : / / d o i . o r g / 1 0 . 1 0 8 0 / 0 9 6 9 5 9 5 9 8 0 0 5 0 1 0 2 . [3] S. K. Mangal, S. Mangal, Assessment for Learning,

PHI Learning, Delhi, India, 2019. [4] S. A. Livingston, Constructed-response test questions: Why we use them; how we score them. r&d connections. number 11., Educational Testing Service (2009). [5] S. Burrows, I. Gurevych, B. Stein, The eras and trends of automatic short answer grading, International Journal of Artificial Intelligence in Education 25 (2014) 60–117. URL: https:// doi.org/10.1007%2Fs40593-014-0026-8. doi:1 0 . 1 0 0 7 / s 4 0 5 9 3 - 0 1 4 - 0 0 2 6 - 8 . [6] Z. Ke, V. Ng, Automated essay scoring: A survey of the state of the art, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, 2019. URL: https://doi.org/10.24963%2Fijcai.2019%2F879. doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 1 9 / 8 7 9 . [7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/ N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 . [8] L. Camus, A. Filighera, Investigating transformers for automatic short answer grading, in: Lecture Notes in Computer Science, Springer International Publishing, 2020, pp. 43–48. URL: https: //doi.org/10.1007%2F978-3-030-52240-7_8. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 5 2 2 4 0 - 7 _ 8 . [9] A. Filighera, S. Parihar, T. Steuer, T. Meuser, S. Ochs, Your answer is incorrect... would you like to know why? introducing a bilingual short answer feedback dataset, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 8577–8591. URL: https://aclanthology.org/2022. acl-long.587. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 2 . a c l - l o n g . 5 8 7 . [10] A. Poulton, S. Eliens, Explaining transformerbased models for automatic short answer grading, in: 2021 5th International Conference on Digital Technology in Education, ACM, 2021. URL: https://doi.org/10.1145%2F3488466.3488479. doi:1 0 . 1 1 4 5 / 3 4 8 8 4 6 6 . 3 4 8 8 4 7 9 . [11] C. Sung, T. Dhamecha, S. Saha, T. Ma, V. Reddy, R. Arora, Pre-training BERT on domain resources for short answer grading, in: Proceed

[1]

Nelson ,

Dawson , A contribution to the history of assessment: how a conversation simulator redeems socratic method , Assessment & Evaluation in Higher Education 39 ( 2013 ) 195 - 204 . URL: https: //doi.org/10.1080% 2F02602938 . 2013 . 798394 . doi:1 0 . 1 0 8 0 / 0 2 6 0 2 9 3 8 . 2 0 1 3 . 7 9 8 3 9 4 .

[2]

Black ,

Wiliam , Assessment and classroom learning , Assessment in Education: Principles, Policy & Practice 5 ( 1998 ) 7 - 74 . URL: https://doi.org/10.1080/ ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 6071 - 6075 . URL: https://aclanthology.org/ D19-1628. doi:1 0 . 1 8 6 5 3 / v 1 / D 1 9 - 1 6 2 8 .

[12]

Filighera ,

Tschesche ,

Steuer ,

Tregel ,

Wernet , Towards generating counterfactual examples as automatic short answer feedback , in: Lecture Notes in Computer Science , Springer International Publishing, 2022 , pp. 206 - 217 . URL: https: //doi.org/10.1007% 2F978 -3 -031-11644-5_17. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 1 - 1 1 6 4 4 - 5 _ 1 7 .

[13]

Arieli-Attali ,

Ward , J. Thomas,

Deonovic , A. A. von Davier , The expanded evidence-centered design (e-ECD) for learning and assessment systems: A framework for incorporating learning goals and processes within assessment design , Frontiers in Psychology 10 ( 2019 ). URL: https://doi.org/10. 3389% 2Ffpsyg . 2019 . 00853 . doi:1 0 . 3 3 8 9 / f p s y g . 2 0 1 9 . 0 0 8 5 3 .

[14]

R. J.

Mislevy ,

R. G.

Almond ,

J. F.

Lukas , A brief introduction to evidence-centred design , ETS Research Report Series 2003 ( 2003 ) i - 29 . URL: https://doi.org/ 10.1002% 2Fj . 2333 - 8504 . 2003 .tb01908.x. doi:1 0 . 1 0 0 2 / j . 2 3 3 3 - 8 5 0 4 . 2 0 0 3 . t b 0 1 9 0 8 . x .

[15]

Slade ,

Tait , Global guidelines: Ethics in learning analytics ( 2019 ).

[16]

Danilevsky ,

Qian ,

Aharonov ,

Katsis ,

Kawas ,

Sen , A survey of the state of explainable AI for natural language processing , in: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing , Association for Computational Linguistics, Suzhou, China, 2020 , pp. 447 - 459 . URL: https://aclanthology.org/ 2020 . aacl-main. 46 .

[17]

Chefer ,

Gur , L. Wolf, Transformer interpretability beyond attention visualization , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021 , pp. 782 - 791 .

[18]

Takami ,

Dai ,

Flanagan ,

Ogata , Educational explainable recommender usage and its efectiveness in high school summer vacation assignment , in: LAK22: 12th International Learning Analytics and Knowledge Conference , LAK22, Association for Computing Machinery, New York, NY, USA, 2022 , p. 458 - 464 . URL: https://doi.org/10.1145/ 3506860.3506882. doi:1 0 . 1 1 4 5 / 3 5 0 6 8 6 0 . 3 5 0 6 8 8 2 .