<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Methods and perspectives for the automated analytic assessment of free-text responses in formative scenarios</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sebastian Gombert</string-name>
          <email>gombert@dipf.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Assessment, Automated Assessment, Analytic Assessment, Short Answer Grading, Essay Grading</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leibniz Institute for Research and Information in Education</institution>
          ,
          <addr-line>Frankfurt am Main</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Assessment is the process of testing learners' skills and knowledge. Free-text response items are well suited for the assessment of learners' active knowledge and writing skills. However, the automatic assessment of respective responses is not trivial and requires the application of natural language processing. Accordingly, the automatic assessment of free-text responses is a widely researched topic in educational natural language processing. Most past work targets holistic scoring, the process of assigning overall scores or grades to responses. This is problematic in formative scenarios because learners require feedback rather than summative scores in such scenarios. Such feedback ideally targets specific aspects of responses, and, accordingly, automated systems which only predict holistic scores cannot be used as a basis for providing the same. What is instead needed are systems which implement analytic scoring approaches. Analytic scoring targets specific aspects of responses and scores them according to corresponding criteria. This requires diferent systems than addressed by the broad research on automated holistic scoring. In my PhD work which is outlined by this paper, I want to explore approaches for implementing analytic scoring systems by means of state-of-the-art natural language processing. These systems are targeted at providing a basis for feedback generation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1. Introduction
suring and documenting learners’ skills and knowledge
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This is conducted through tests composed of
various kinds of test items. Assessing learners’ knowledge
and skills is also the basis for providing them with
appropriate content-related feedback in formative
scenarios [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In the context of technology-based assessment,
multiple-choice items have grown to be a popular choice
to implement tests [3, 4]. This is mostly the case since
evaluating multiple-choice items is rather trivial. Test
creators simply need to define a set of responses out of
whom they define one or more as the correct ones. When
test-takers select respective responses during testing, the
computer only needs to determine which of them were
among the correct ones. Moreover, multiple-choice items
take only a short time to answer which makes it possible
to include many diferent of them within tests and test
for a broad range of knowledge [4].
      </p>
    </sec>
    <sec id="sec-2">
      <title>However, not every skill and every kind of knowledge can be assessed through multiple-choice items. “A multiple‑choice test for history students can test their factual</title>
      <p>https://edutec.science/team/sebastian-gombert/ (S. Gombert)
0000-0001-5598-9547 (S. Gombert)</p>
      <sec id="sec-2-1">
        <title>2. Constructed Responses and their Automatic Assessment</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>To test skills such as the ones described by [4], con</title>
      <p>structed response items are needed instead multiple
choice items. In their most common form, they require
students to enter a free text as response into a text field.</p>
      <p>However, this drastically increases the complexity of
assessing learners’ responses in an automated fashion,
as the computer-based analysis of human language is
far from trivial. With natural language processing
respectively computational linguistics, a whole
interdisciplinary field of research building upon various methods
tics, logic, psychology, cognitive science, software engi- analysing their writing skills of students, e.g., their skill
neering and philosophy is dedicated to this issue, and to clearly and coherently discuss or communicate a given
the automatic processing of many aspects of language re- issue or argue against or in favour of an opinion.
Acmains open research. What makes the automatic analysis cordingly, approaches for the analytic assessment of both
of free text dificult are the properties of language itself. text forms must inevitably difer. For short answers, it
Humans can generate an unlimited set of diferent linguis- presumably should be suficient to simply assess whole
tic utterances, and often, there are many ways to express responses for the diferent aspects, as short answers are
the same or similar semantics, i.e., through diferent syn- rather condensed texts. From a formal point of view, this
onyms, the usage of passive vs. active constructions, or can be interpreted as a (multi-label) text classification
ways of paraphrasing. In past research, many diferent task [5].
methods were applied to the automatic assessment of On the other hand, for essays, the respective coding
free-text responses. These range from simpler keyword, can require more varied approaches. Are the aspects
pattern and regular expression searches, and methods coded related to content or writing style? Does a
contentbuilding upon distributional vector space semantics, to specific code apply to the whole text or to specific
secfully-fledged machine learning systems [ 5, 6]. tions? These questions need to be addressed in order</p>
      <p>Most recently, transformer language models such as to come to appropriate operationalisations. E.g., if it is
BERT [7] were successfully applied to the problem of likely that each code corresponds to a specific part of
free-text assessment [8, 9, 10, 11]. The application of an essay, one needs to first semantically segment it into
transformers to the assessment of constructed responses the respective parts. One could then in a second step
promises major advancements in the field, but nonethe- separately classify these parts for the actual codes. On
less, most of the systems available are built to predict only the other hand, if a code corresponds to a whole essay,
holistic scores [5, 6], ergo scores aimed at denoting the such separation is not needed.
overall quality of a response [4]. Most of the established I plan my PhD to be paper-based where the single
padatasets, especially the ones focused on short answers, pers are connected by the overarching topic of analytic
also cater towards this approach [5, 9, 6]. While holis- constructed response coding. First and foremost, I want
tic scores reflect how well learners were able to overall to explore what has been already done in past work and
solve a given task, they do not necessarily denote which how my own work can benefit from these insights. The
aspects of their response were of good quality and in acquired knowledge is then to be used for the practical
which regards they could improve. However, especially implementation of constructed response scoring systems
in formative scenarios, providing students with feedback in a range of case studies. For these case studies, I plan
is crucial, which puts the application of holistic scoring to leverage data sets from several research projects I am
systems in formative scenarios into question. involved in. In the projects AFLEK and ALICE, I have
ac</p>
      <p>There is a second scoring approach in constructed re- cess to a set of short answers to diferent science-related
sponse assessment which can be seen as a better basis for tasks with detailed coding rubrics focusing on scientific
providing detailed, personalized feedback: analytic scor- knowledge and argumentative skills. On the other hand,
ing. In analytic scoring, rather than judging responses the project HIKOF provides a data set of essays in which
as a whole, they are assessed for multiple diferent as- students discuss learning tips from a YouTube video with
pects which need to be specifically defined in a coding respect to their grounding in educational psychology.
rubric [4]. I.e., “[o]n a science question, the scorer may Both data sets are coded in a way which allows for the
award two points for providing a correct explanation of a implementation of automated analytic assessment
sysphenomenon, one point for correctly stating the general tems.
principle that it illustrates, and one point for providing Another important aspect of my work is the
quesanother valid example of that principle in action” [4]. tion how codes from response scoring systems can be
Drawing such distinctions and coding responses for mul- transformed into concrete learner feedback. Feedback
tiple diferent aspects allows to provide more detailed can be given on an item-specific level as well as on a
and concise feedback as the same can specifically address more global one. It can focus the content or the form
these aspects. of concrete responses, and it can also target the overall
domain knowledge of a student across multiple items.</p>
      <p>
        For the prior case, generative language models could be
3. Research Questions promising [
        <xref ref-type="bibr" rid="ref3">9, 12</xref>
        ]. For the latter case, a way of modeling
learners’ domain knowledge is required. A conceptual
The two most common types of free-text responses are framework which goes into this direction was provided
short answers and essays. While short answers are
used to test students’ ability to explain phenomena or wbyh[ic1h3]awdditshmthuelitripelxepfaeneddebdaecvki-dreenlactee-dceanstpreedctdsetsoigtnhmewodeelll-,
demonstrate their active knowledge, essays are used for known evidence-centred design [
        <xref ref-type="bibr" rid="ref5">14</xref>
        ]. However, to my best
knowledge, this conceptual framework was not opera- The review by [5] is, thanks to its publication date,
tionalised into a concrete feedback-driven assessment fairly outdated. Moreover, in my opinion, it fails to
funcsystem so far. tion as a lookup guide for possible techniques to use, and
      </p>
      <p>
        The last aspect I want to address is the one of explain- rather focuses on summarizing papers from past work.
ability. Ethical frameworks in learning analytics and The review by [6], on the other hand, is well structured
educational technology such as [
        <xref ref-type="bibr" rid="ref6">15</xref>
        ] often call for the but also fairly short thanks to it being published in
conapplication of transparent and explainable models where ference proceedings. The plan for my literature review
possible. It is likely that providing learners with simple is to primarily act as a guide for practitioners which they
explanations on why models made a given prediction, can refer to when they plan to build their own free-text
which, in turn, led to a particular feedback outcome, can assessment systems rather than as a pure overview over
increase their acceptance for respective systems. For nat- past work. It shall equip interested researchers with a
ural language processing models, a wide range of meth- clear plan on how they can approach their own free-text
ods for providing such explanations has been developed response assessment system in a structured manner.
[
        <xref ref-type="bibr" rid="ref7">16</xref>
        ]. Research for making state-of-the-art methodology The next papers deal with the implementation of
reexplainable also shows promising results, e.g. [
        <xref ref-type="bibr" rid="ref8">17</xref>
        ]. For spective systems themselves to address RQ3. The most
this reason, I want to leverage this potential and explore, recent achievements in holistic free-text response
assessif providing learners with explanations for their feedback ment, in line with the general developments in natural
can increase trust. language processing, were achieved using transformer
      </p>
      <p>To summarzie, I want to address the following research language models [8, 9, 10, 11]. For this reason, my plan
questions: is to also apply transformer language models to the task
of analytic assessment. However, [5] and [6] document
1. What were the main methods, characteristics and a wide range of methods from the pre-transformers era.
results of past work in constructed response scor- It is an interesting question In this context, my plan is to
ing? implement and evaluate exemplary systems for assessing
2. What techniques were applied for coding con- both short answers and essays in an analytic fashion.
structed responses in an analytic fashion in past In a first research paper, which is currently under
work? review, I implemented and evaluated multiple systems
3. What machine learning-based pipelines and ap- aimed at assessing German middle school students’
proaches are efective for the automated analytic knowledge about energy physics. In particular, the
sysassessment of constructed responses and to what tems classify if students mentioned certain concepts
reextent can they be generalized? lated to energy transformation, i.e., diferent
manifesta4. How can the predictions of automated analytic tions of energy, indicators for the same, and if energy
assessment systems be transformed into useful is transformed, in a meaningful manner. For this
purlearner feedback? pose, first data was collected and coded using a coding
5. To what extent can explaining model outputs rubric which targeted the diferent categories of
knowlmake learners trust in the provided feedback? edge. I then implemented and evaluated multiple text
classification systems trained to replicate the coding for
the respective purpose, transformer- and feature-based.
4. Design The systems are given the response, a provided sample
solution and the item prompt. Moreover, using
diferFrom a technical perspective, the intention behind my ent methods for generating model explanations, I
evaluPhD work is to implement and evaluate respective meth- ated the descriptive accuracy of the implemented models.
ods for the analytic assessment of free-text responses for Overall, a transformer-based model based upon GBERT
exemplary use cases drawing from state-of-the-art NLP could achieve superior results. In subsequent research, I
research. I plan to study and summarize what methods want to explore how well the predictions of such systems
were applied to the assessment of free-text responses can be concretely translated into feedback.
in past work via a literature review to address RQ1 and In another research paper, I want to implement
sysRQ2. For this literature review, I plan to draw from past tems targeting essays. In particular, I aim to use a data
reviews on the topic, in particular [5] for the text type set of essays collected throughout the HIKOF project.
of short answers and [6] for the text type of essays, but These essays discuss ten diferent learning tips presented
primarily with a focus on work which was not covered in a YouTube video with respect to their grounding in
by them. The main goal behind the literature review educational and psychological research. For each tip, ten
is to provide a concise overview over the methods and diferent codes were assigned. Moreover, it was coded
features which can be successfully applied to the task. which sentences within an essay correspond to which
tips. This results in two problems which need to be solved.</p>
    </sec>
    <sec id="sec-4">
      <title>First, unseen essays must be segmented into sections cor</title>
      <p>responding to the diferent tips. This can be approached
as a sentence classification task. In a second step, the
resulting sections must then be given to a second text
classification system which classifies the sections with
respect to the analytic codes corresponding to each tip.</p>
      <p>
        In the next step, feedback needs to be generated from
the predicted codes. For this purpose, I use
contentrelated feedback templates which are assembled
dynamically depending on the predicted codes. In particular, the
predicted codes are matched with ground truth codes,
and discrepancies between the two lead to The
generated feedback will be tested within a university lecture
in an AB setup. In a followup study, I plan to add
aspects of explainability to this feedback. In particular, I
plan to present learners with highlighted text of what
exactly in their response led to a concrete feedback in
an AB setup. This shall then be combined with
questionnaires evaluating if showing these explanations to
learners increases acceptance. For educational
recommender systems, findings from [
        <xref ref-type="bibr" rid="ref9">18</xref>
        ] suggest that showing
explanations to learners can increase the acceptance for
respective systems. I want to find out if this is also the
case for assessment-driven feedback systems.
      </p>
      <sec id="sec-4-1">
        <title>5. Conclusion</title>
        <p>In this document, I presented my PhD project which deals
with systems for the automatic assessment of constructed
responses in formative scenarios implemented through
machine learning-based natural language processing. In
particular, I explore the implementation and evaluation
of respective systems for multiple use cases. Moreover, I
plan to write a literature review on constructed response
scoring in the form of a practitioner lookup guide.
Finally, I then want to explore how codes predicted by
automatic assessment systems can be translated into
automatic actionable feedback, and if explaining the model
predictions behind this feedback can contribute to the
acceptance of these systems.
0969595980050102. doi:1 0 . 1 0 8 0 / 0 9 6 9 5 9 5 9 8 0 0 5 0 1 0 2 .
a r X i v : h t t p s : / / d o i . o r g / 1 0 . 1 0 8 0 / 0 9 6 9 5 9 5 9 8 0 0 5 0 1 0 2 .
[3] S. K. Mangal, S. Mangal, Assessment for Learning,</p>
        <p>PHI Learning, Delhi, India, 2019.
[4] S. A. Livingston, Constructed-response test
questions: Why we use them; how we score them. r&amp;d
connections. number 11., Educational Testing
Service (2009).
[5] S. Burrows, I. Gurevych, B. Stein, The eras
and trends of automatic short answer
grading, International Journal of Artificial
Intelligence in Education 25 (2014) 60–117. URL: https://
doi.org/10.1007%2Fs40593-014-0026-8. doi:1 0 . 1 0 0 7 /
s 4 0 5 9 3 - 0 1 4 - 0 0 2 6 - 8 .
[6] Z. Ke, V. Ng, Automated essay scoring: A
survey of the state of the art, in: Proceedings of the
Twenty-Eighth International Joint Conference on
Artificial Intelligence, International Joint
Conferences on Artificial Intelligence Organization, 2019.
URL: https://doi.org/10.24963%2Fijcai.2019%2F879.
doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 1 9 / 8 7 9 .
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
Pre-training of deep bidirectional transformers for
language understanding, in: Proceedings of the
2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), Association for
Computational Linguistics, Minneapolis, Minnesota,
2019, pp. 4171–4186. URL: https://aclanthology.org/
N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .
[8] L. Camus, A. Filighera, Investigating
transformers for automatic short answer grading, in:
Lecture Notes in Computer Science, Springer
International Publishing, 2020, pp. 43–48. URL: https:
//doi.org/10.1007%2F978-3-030-52240-7_8. doi:1 0 .
1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 5 2 2 4 0 - 7 _ 8 .
[9] A. Filighera, S. Parihar, T. Steuer, T. Meuser, S. Ochs,
Your answer is incorrect... would you like to know
why? introducing a bilingual short answer
feedback dataset, in: Proceedings of the 60th Annual
Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Association for
Computational Linguistics, Dublin, Ireland, 2022,
pp. 8577–8591. URL: https://aclanthology.org/2022.
acl-long.587. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 2 . a c l - l o n g . 5 8 7 .
[10] A. Poulton, S. Eliens, Explaining
transformerbased models for automatic short answer
grading, in: 2021 5th International Conference on
Digital Technology in Education, ACM, 2021. URL:
https://doi.org/10.1145%2F3488466.3488479. doi:1 0 .
1 1 4 5 / 3 4 8 8 4 6 6 . 3 4 8 8 4 7 9 .
[11] C. Sung, T. Dhamecha, S. Saha, T. Ma, V. Reddy,
R. Arora, Pre-training BERT on domain
resources for short answer grading, in:
Proceed</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nelson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dawson</surname>
          </string-name>
          ,
          <article-title>A contribution to the history of assessment: how a conversation simulator redeems socratic method</article-title>
          ,
          <source>Assessment &amp; Evaluation in Higher Education</source>
          <volume>39</volume>
          (
          <year>2013</year>
          )
          <fpage>195</fpage>
          -
          <lpage>204</lpage>
          . URL: https: //doi.org/10.1080%
          <fpage>2F02602938</fpage>
          .
          <year>2013</year>
          .
          <volume>798394</volume>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>8 0 / 0 2 6 0 2 9 3 8 . 2 0 1 3 . 7 9 8 3 9 4 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wiliam</surname>
          </string-name>
          ,
          <article-title>Assessment and classroom learning</article-title>
          ,
          <source>Assessment in Education: Principles, Policy &amp; Practice</source>
          <volume>5</volume>
          (
          <year>1998</year>
          )
          <fpage>7</fpage>
          -
          <lpage>74</lpage>
          . URL: https://doi.org/10.1080/ ings of the 2019
          <source>Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>6071</fpage>
          -
          <lpage>6075</lpage>
          . URL: https://aclanthology.org/ D19-1628.
          <source>doi:1 0 . 1 8</source>
          <volume>6 5 3</volume>
          / v 1 / D 1 9
          <article-title>- 1 6 2 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Filighera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tschesche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Steuer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tregel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wernet</surname>
          </string-name>
          ,
          <article-title>Towards generating counterfactual examples as automatic short answer feedback</article-title>
          ,
          <source>in: Lecture Notes in Computer Science</source>
          , Springer International Publishing,
          <year>2022</year>
          , pp.
          <fpage>206</fpage>
          -
          <lpage>217</lpage>
          . URL: https: //doi.org/10.1007%
          <fpage>2F978</fpage>
          -3
          <source>-031-11644-5_17. doi:1 0 . 1 0</source>
          <volume>0 7 / 9 7 8 - 3 - 0 3 1 - 1 1 6 4 4 - 5</volume>
          _
          <fpage>1</fpage>
          7 .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Arieli-Attali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ward</surname>
          </string-name>
          , J. Thomas,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deonovic</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. A. von Davier</surname>
          </string-name>
          ,
          <article-title>The expanded evidence-centered design (e-ECD) for learning and assessment systems: A framework for incorporating learning goals and processes within assessment design</article-title>
          ,
          <source>Frontiers in Psychology</source>
          <volume>10</volume>
          (
          <year>2019</year>
          ). URL: https://doi.org/10. 3389%
          <fpage>2Ffpsyg</fpage>
          .
          <year>2019</year>
          .
          <volume>00853</volume>
          .
          <source>doi:1 0 . 3 3</source>
          <volume>8 9</volume>
          / f p s
          <source>y g . 2 0</source>
          <volume>1 9 . 0 0 8 5 3 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mislevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Almond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Lukas</surname>
          </string-name>
          ,
          <article-title>A brief introduction to evidence-centred design</article-title>
          ,
          <source>ETS Research Report Series</source>
          <year>2003</year>
          (
          <year>2003</year>
          )
          <fpage>i</fpage>
          -
          <lpage>29</lpage>
          . URL: https://doi.org/ 10.1002%
          <fpage>2Fj</fpage>
          .
          <fpage>2333</fpage>
          -
          <lpage>8504</lpage>
          .
          <year>2003</year>
          .tb01908.x.
          <source>doi:1 0 . 1 0 0 2 / j . 2 3</source>
          <volume>3 3 - 8 5 0 4 . 2 0 0 3</volume>
          .
          <source>t b 0 1</source>
          <volume>9 0 8</volume>
          . x .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Slade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tait</surname>
          </string-name>
          ,
          <article-title>Global guidelines: Ethics in learning analytics (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Danilevsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aharonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Katsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kawas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sen</surname>
          </string-name>
          ,
          <article-title>A survey of the state of explainable AI for natural language processing</article-title>
          ,
          <source>in: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing</source>
          , Association for Computational Linguistics, Suzhou, China,
          <year>2020</year>
          , pp.
          <fpage>447</fpage>
          -
          <lpage>459</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          . aacl-main.
          <volume>46</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gur</surname>
          </string-name>
          , L. Wolf,
          <article-title>Transformer interpretability beyond attention visualization</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>782</fpage>
          -
          <lpage>791</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Takami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Flanagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ogata</surname>
          </string-name>
          ,
          <article-title>Educational explainable recommender usage and its efectiveness in high school summer vacation assignment</article-title>
          ,
          <source>in: LAK22: 12th International Learning Analytics and Knowledge Conference</source>
          , LAK22, Association for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>458</fpage>
          -
          <lpage>464</lpage>
          . URL: https://doi.org/10.1145/ 3506860.3506882.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 5 0 6 8 6 0 . 3 5 0 6 8 8 2 .</volume>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>