Evaluating Correctness of Student Code Explanations:
                         Challenges and Solutions⋆
                         Arun-Balajiee Lekshmi-Narayanan1,∗ , Peter Brusilvosky
                         Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA


                                           Abstract
                                           Educational data used by data mining approaches in the domain of Computer Science education primarily focused on working with
                                           code-based data, such as student homework submissions. However, the increased use of natural language techniques and Large
                                           Language models (LLM) in all domains of learning including Computer Science education is now producing an abundance of natural
                                           language data, such as code explanations generated by students and LLMs as well as feedback and hints produced by instructors, TAs,
                                           and LLMs. These data represent new challenges for CSEDM research and need new creative approaches to leverage. In this paper, we
                                           present a first attempt to analyze one type of these new data, student explanations of worked code examples. The main challenge in
                                           working with these data is to evaluate the correctness of self-explanations. Using a dataset of student explanations collected in our
                                           previous work, we demonstrate the difficulty of this problem and discuss a possible way to solve it.

                                           Keywords
                                           code explanations, worked examples, automated assessment


                         1. Introduction                                                                                              approaches.
                                                                                                                                        More specifically, the remaining part of the paper focuses
                         The majority of the work in computer science educational                                                     on two groups of approaches:
                         data mining (CSEDM) relied so far on datasets that col-
                         lected traces of learner work with various learning content                                                      1. Use surface-level features: this group of approaches
                         or datasets with student submissions to programming as-                                                             use “surface–level” lexical and readability features
                         signments 1 Using these datasets, researchers were able                                                             that could be easily extracted from the text of student
                         to explore a range of novel approaches including finding                                                            or expert explanations. This is discussed further in
                         knowledge components [1], debugging [2, 3], and detecting                                                           Section 3.1.
                         cheating [4]. However, as newer types of dataset become                                                          2. Use expert explanations: this group of approaches
                         openly available for analysis, new methods need to be de-                                                           attempts to calculate various similarity metrics be-
                         veloped to leverage this data 2 .                                                                                   tween student explanations and expert explanation
                            With recent research on student self-explanation of code                                                         and use the obtained similarity to distinguish correct
                         fragments [5] as well as the use of LLMs and students to                                                            and incorect explanations. This is described further
                         generate code explanations automatically [6], an increasing                                                         in Section 3.2.
                         number of datasets contain free–form code explanations. In
                         this work, we consider one such dataset with code explana-
                         tions generated by students and instructors [7]. This dataset                                                2. Related Work
                         was annotated to mark the correctness of each student’s
                                                                                                                                      Corpora for free-form student answers such as reflective es-
                         explanation and to assess the similarity between students
                                                                                                                                      says [8] and argumentative writing [9], provide interesting
                         and instructors’ explanations for the same code lines. The
                                                                                                                                      examples of use cases that are different from traditional log
                         goal we want to achieve by working with this dataset is to
                                                                                                                                      data. The ability to analyze this data is important to pro-
                         distinguish correct and incorect explanations. This goal has
                                                                                                                                      vide feedback when assessing students’ free–form responses.
                         practical value. An approach that could reliably identify
                                                                                                                                      Tools, such as COH–METRIX [10] and EDU-Convokit [11],
                         incorrect explanations could be used to build an intelligent
                                                                                                                                      offer several options to analyze textual educational data;
                         tutor to support the self-explanation process [5].
                                                                                                                                      however, our dataset of free–form code explanations needs
                            Starting with a review of relevant work, the paper dis-
                                                                                                                                      slightly different methods to evaluate correctness.
                         cusses several approaches to distinguish correct and incorect
                                                                                                                                         Some examples in the natural language processing do-
                         explanations. Since our dataset contains “ground truth”, i.e.,
                                                                                                                                      main offer encouraging examples of using surface-level fea-
                         human expert annotation of each explanations as as cor-
                                                                                                                                      tures to models for various tasks. Schwartz and colleagues
                         rect or incorrect (including inter–rater reliability) we are
                                                                                                                                      explore surface features such as the word or character n–
                         able to use the dataset to illustrate the feasibility of these
                                                                                                                                      gram and length of sentences to build a classifier to identify
                                                                                                                                      author writing styles in a CLOZE story task [12]. Some
                         CSEDM’24: 8th Educational Data Mining in Computer Science Education                                          examples in the case of language inference tasks consider
                         (CSEDM) Workshop, June 14, 2024, Atlanta, GA
                         ⋆
                            You can use this document as the template for preparing your publica-
                                                                                                                                      word-level similarity–based approaches [13]. When con-
                             tion. We recommend using the latest version of the ceurart style.                                        structing adversarial examples for natural language infer-
                         ∗
                              Corresponding author.                                                                                   ence tasks, another work considers surface–level cues such
                         Envelope-Open arl122@pitt.edu (A. Lekshmi-Narayanan); peterb@pitt.edu                                        as words that contradict or are negative (”not”). Negative
                         (P. Brusilvosky)                                                                                             sampling is another approach that uses surface–level fea-
                         GLOBE https://a2un.github.io (A. Lekshmi-Narayanan);
                         https://sites.pitt.edu/~peterb/ (P. Brusilvosky)
                                                                                                                                      tures to construct synthetic examples that can help build
                         Orcid 0000-0002-7735-5008 (A. Lekshmi-Narayanan); 0000-0002-1902-1464                                        robust classifiers [14].
                         (P. Brusilvosky)                                                                                                Li and colleagues [15] discuss the student perceptions
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).                                                     on the potential errors autograder may provide as feedback
                         1
                             https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=3458                                              to their submissions. This emphasizes the need to develop
                         2
                             https://the-learning-agency.com/learning-engineering-hub/build/


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
better automated assessment techniques, for newer kinds           for a given line of code.
of data such as code explanations discussed in our work.
The earlier work of the same team [16] discuss the use of an      3.1. Surface Features
autograding system that evaluates student explanations to
code in plain English. They show the potential limitations         We try to assess the correctness of student explanations
of using finetuned AI models for autograding accuracy by          using the following easily extracted features.
comparing with TAs at different levels of grading expertise
and do not find statistically significant results – either ow-            1. Explanation Length is calculated as the number of
ing to sample size or the AI model actually not performing                   words to check that longer student explanations are
better than TAs at the task. This necessatitates the possibil-               correct. This is a useful metric for tasks such as
ities to explore better finetuned AI models that have higher                 persuasive essay evaluation [9] and we expect this
accuracy, with lower rates of false positives and negatives.                 could work for assessing code explanations also.
In this work, we consider a context quite similar to their                2. Lexical Density is calculated by the ratio of the num-
work, however, we use student self-explanations produced                     ber of nouns, adjectives, verbs, and adverbs (tagged
as a part of the learning process rather then explanations                   in the sentence using a Spacy POS Tagger 3 ) over the
produced for grading. We also start from scratch by identi-                  overall number of words in the sentence (Ure LD for-
fying features in student explanations to classify them as                   mula 4 ). We expect that correct student explanations
correct or incorrect. Like this work, we are observing that                  are lexically denser.
the use of surface–level linguistic features may not help in              3. Gunning Fog Readability is the metric to evaluate
differentiating student explanations by correctness. Possible                the grade level to understand a text. We hypothesise
extensions would involve the use of contextual emebeddings                   that the correct student explanations might have
and LLMs which we also currently explore as a follow-up                      higher scores (require more technical knowledge to
to this current work–in–progress.                                            understand) than incorrect explanations.
   Denny and colleagues [17] explore student code explana-
tions in plain English in a different context. In this work,      3.2. Expert–Student Similarity Features
code explanations are used to encourage students to think
deeply about the problem so that using their code expla-           We consider the pairwise similarity between expert and
nations LLM can generate a code equivalent to the code            student explanations to assess correctness of student expla-
the student is trying to explain. Additionally, they evaluate     nations. Following our previous work [20], METEOR [21],
the student explanations progress up the classifications of       BERTScore [22] and chrF [23] are considered to evaluate
SOLO taxonomy. They also conduct a user study to evalu-           the pairwise similarity between student and expert expla-
ate students’ perceptions on traditional approaches such as       nations for a given line of code. We expect correct student
code–writing in comparison to approahces like code expla-         explanations to be more similar to expert explanations than
nations.                                                          incorrect explanations. We choose this combination of met-
   Haller and colleagues [18] survey automated assessments        rics because METEOR and chrF scores measure character
tools that are used to evaluate short answer essays. They         and token level similarities, while BERTScore estimates se-
discuss hand–engineered appraoches in combination SVM             mantic similarity by using cosine similarities between the
or KNN based classifiers. In our case, the goal is not to build   contextual word embeddings of the two explanations. The
best classifiers for the task, but to evaluate if the features    similarity scores are between 0 and 1.
themselves reveal differences between correct and incorrect
student explanations.                                             4. Dataset
   Lin and colleagues [19] explore the use of a finet–tuned
GPT model that can be used to provide personalized, adap-         We use a dataset of line–by–line explanations provided by
tive feedback to students. They use new metric that is an         students in an study in which they were asked to explain
extension to the precision / recall –based Intersection-Over-     worked examples [24, 7]. The study included four Java
Union metric to evluate the LLM–based feedback and com-           worked code examples: some basic examples focused on ar-
pare with human feedback in a user study. For current             ray search and print statements and more difficult examples
work–in–progress this idea is the next target to achieve in       focused on object-oriented principles. Among about all ex-
the context to code explanations of worked examples in            pert explanations in the dataset, we considered upto 2 expert
programming.                                                      explanations for every line of code. In the original dataset,
   Leinonen and colleagues [6] compare ChatGPT-generated          the majority of the student explanations were provided in
explanations with student explanations. In our work, we           a single sentence; however, a fraction of explanations in-
are interested in classifying student explanations as correct     cluded two or more sentences. For the purpose of this study,
or incorrect. In an ongoing extension to this work, we also       we excluded these multi-sentence explanations retaining
focus on using ChatGPT–based interventions to solve this          between 23 and 26 single-sentence student explanations per
challenging problem.                                              line of code. The key datset parameters are shown in Table 1
                                                                  and sample explanations are provided in Figure 1 (metadata
                                                                  columns are omitted). There is a known imbalance in the
3. Method                                                         dataset between correct and incorrect examples (1234 in-
Inspired by previous work [20], we extract surface-level          stances of single sentence student explanations annotated
features from student and expert explanations alone and           as correct and 70 instances of single sentence student expla-
generate pairwise similarity scores between student and           nations annotated as incorrect). We calculated the average
expert explanations for the same code lines. This data is ap-
                                                                  3
plied to evaluate the correctness of the student’s explanation        https://spacy.io/usage/linguistic-features
                                                                  4
                                                                      https://en.wikipedia.org/wiki/Lexical_density
                         Dataset Property                                                         Value
                         # Single Sentence Student–Expert (All Experts) Pairs                     1854
                         # All Sentence Student–Expert (All Experts) Pairs                        3019
                         # All Sentence Student–Expert (All Experts) Pairs Annotation Agreement   88.24%
                         # Worked code examples                                                   4
                         # Lines per example                                                      ≈8
                         # Single Sentence Student–Expert (Expert 1 & 2) Pairs                    1304
                         # Student–Expert (1 & 2) Explanation Pairs with Student Correct          1234
                         # Student–Expert (1 & 2) Explanation Pairs with Student Incorrect        70
     Table 1
     A summary of the properties our dataset.


percentage agreement for the annotation of correctness all          5.1.3. Lexical Density
sentence all experts student pairs of explanations (see Ta-
                                                                    We observe that the lexical density also may not differen-
ble 1). More details on the dataset is available in our past
                                                                    tiate good and bad student explanations. (see Figure 2c).
work [7].
                                                                    The lexical density measures a more linguistic aspect of the
                                                                    explanations by the parts of speech, which may not neces-
   Program: PointTester.java Line number: 14 Line code:
   x += dx;
                                                                    sarily evaluate the conceptual aspects of the explanations.
   Expert1: To shift the x-coordinate of the point, we need         This is because the concepts may not be associated with a
   to add dx to the value of the x-coordinate of the point.         particular kind of speech and are more connected with the
   Student1: move the x coord the amount that the argument          ontology of concepts in computing.
   specified
   Student2: Adds the first inputted value to X.                    5.1.4. Vocabulary
   Student3: increases the value of x by the amount of the
   first parameter in the function.                                 Correctness also does not seem to depend on the vocabulary
   ...                                                              of the student explanations (see Figure 2d). The vocabulary
   Student23: The value of dx is added to variable x.               in the sentence is more of a linguistic measure. Hence, this
                                                                    may not necessarily capture the conceptual ontology in
Figure 1: A slice of the dataset showing a subset of expert and     computing such as those discussed in an earlier work of
student explanations for the same line of code.                     JAVA Parser [25].

                                                                    5.2. Expert–Student Explanation Similarity
5. Results                                                          5.2.1. ChrF score
                                                                    We observe that the differences considering the class imbal-
 In this section, we group the results by the surface–level         ance between the correct and incorrect explanations could
metrics used to evaluate the correctness of student explana-        create an issue with the threshold to differentiate the expla-
tions and similarity–based metrics where pairs of student           nations using this score (see Figure 3a). Further inspection
and expert explanations were used. While extensive statis-          of similarity scores at a line–by–line level shows that irre-
tical results could be performed (such as t-tests to compare        spective of the expert explanation that is used to calculate
means), we preferred to perform exploratory analysis before         the similarity, the correct and incorrect student explana-
digging deeper with our analysis.                                   tions cannot be separated easily by their ChrF score (see
                                                                    Figure 4a).
5.1. Lexical Based Surface Metrics
5.1.1. Explanation Length                                           5.2.2. METEOR Metric

Using the length of the explanation, we observe if the expert       There is a more noticeable difference in the METEOR sim-
and student explanations can be distinguished, we observe           ilarity scores between the correct and incorrect student
that the explanations marked correct have lengths of dif-           explanations. This could be due to n-gram level word align-
ferent words. Some lengths for correct explanations are             ment. The density plots of the METEOR similarity scores
the same as those for incorrect explanations (see Figure 2a).       distribution show that most incorrect explanations have a
This may be because the student explanations are generally          METEOR score below 0.3, as shown in Figure 3b. However,
similar in length, regardless of whether they are annotated         more than 50% of the correct explanations also have a ME-
as correct or incorrect.                                            TEOR similarity score below 0.3. Thus, irrespective of the
                                                                    expert explanation that is used to calculate the similarity,
                                                                    the correct and incorrect student explanations cannot be
5.1.2. Readability Metrics
                                                                    easily separated using the METEOR score (see Figure 4b).
We observe that it is impossible to differentiate correct from
incorrect student explanations using lexical surface met-           5.2.3. BERTScore
rics(see Figure 2b). This may be because the student ex-
planations are not technically different in their explanation       While we expected better performance of BERTScore in
levels but the concepts in computing that are used to explain       separation of correct and incorrect explanations, the den-
the line of code, which we observed when annotating the             sity plots of the BERTScore distribution for correct and
dataset.                                                            incorrect explanations show very little differences. We may
            —
                     (a) Student and Expert Explanation Length          (b) Student and Expert Explanation Gunning-Fog


                (c) Student and Expert Explanation Lexical Diversity    (d) Student and Expert Explanation Vocabulary
          Figure 2: Scatter plots of various text linguistic (“surface”) metrics. The x and y represent the student and expert sentence
          values, respectively. The colors, shapes, and sizes represent the annotation of the students’ explanations for their correctness
          or incorrectness. Correct and incorrect student explanations are not differentiable by the lexical surface metrics.


       Similarity Metric        Incorrect       Correct                 the student explanation per line of code per solution with
                               (Mean,SD)      (Mean,SD)
                                                                        the two expert explanations. This, we can calculate this
             chrF              0.305, 0.114   0.361, 0.140
                                                                        for all the lines of all the programs and observe that the
            METEOR             0.140, 0.091   0.283, 0.170
           BERTScore           0.874, 0.028   0.894, 0.024              while the range of values of the three similarity scoring
                                                                        metrics are different, they are highly correlated (𝑝 < 1𝑒 − 6,
Table 2                                                                 0.5 ≤ 𝑐𝑜𝑟𝑟 ≤ 0.6), as shown in Figures 5a and 5b.
The mean similarity scores between expert and student explana-
tions are not different for the student explanations annotated as
correct from incorrect. The most difference is observable with          6. Conclusion
the METEOR metric, also observed with the plots
                                                                        In this work, we present the challenges of analyzing new
                                                                        kinds of datasets such as the code explanations dataset in
observe differences if we pre-trained a RoBERTa model over              this paper. We observe that we need more sophisticated met-
instances from the dataset. (See Figure 3c). As before, the             rics to evaluate student explanations as “good” or “bad” and
inspection of the similarity scores at a line–by-line level             surface-level metrics are mostly ineffective in evaluating stu-
shows that regardless of the expert explanation that is used            dent explanation correctness. We present similarity-based
to calculate similarity, the correct and incorrect student              metrics also not performing well in separating the “good”
explanations cannot be separated easily (see Figure 4c).                from the “bad” student explanations.
                                                                           Our work has several limitations. We did not consider
5.2.4. Similarity Correlations                                          the use of combinations of lexical and similarity-based fea-
                                                                        tures to classify student explanation correctness. The goal
When drawing similarity scores between the student and                  of this paper is not to present the best possible classifier,
expert explanations, we can take the average similarity of              rather to show the difficulty in identifying useful features
    —
              (a) Character F Similarity                  (b) Meteor Similarity                   (c) BERTScore Similarity
          Figure 3: In the Figures 3a, 3b, 3c, we observe that student and expert explanations are similar irrespective of whether the
          student explanations are annotated as correct or incorrect.


    —
              (a) Character F Similarity                  (b) Meteor Similarity                   (c) BERTScore Similarity
          Figure 4: The scatter plots for the 3 similarity metrics used between expert and student explanations for a given line of code
          in a given program separated by correct vs incorrect presents some differences in scale across the different similarity metrics
          (refer Figures 4a, 4b, 4c)
                                                                    .


to differentiate correct from incorrect student explanations.          of the dataset. We thank Dr. Xiang Lorraine Li for her con-
We are addressing this in our ongoing work, with the use of            tributions to writing this draft and for the natural language
LLMs to assess the correctness and provide feedback to stu-            processing topics of the paper.
dent explanations. Our expert explanations may not have
sufficient and diverse correct (positive) and incorrect (nega-
tive) examples to build robust classifiers. We are developing          References
cross–validation techniques to build better classifiers that
                                                                         [1] Y. Shi, R. Schmucker, M. Chi, T. Barnes, T. Price, Kc-
are exposed to various synthetic and real–world examples
                                                                             finder: Automated knowledge component discovery
to evaluate student explanation correctness. In this work,
                                                                             for programming problems., International Educational
we did not present similar results by considering multiple
                                                                             Data Mining Society (2023).
sentences for both student and expert explanations. While
                                                                         [2] A. M. Kazerouni, R. S. Mansur, S. H. Edwards, C. A.
this is important, we chose to present in this work a proto-
                                                                             Shaffer, Student debugging practices and their rela-
type for single sentence evaluation which is scalable with
                                                                             tionships with project outcomes, in: Proceedings of
aggregation techniques, which we will be presenting in an
                                                                             the 50th ACM Technical Symposium on Computer
upcoming future work. This dataset does not cover the
                                                                             Science Education, 2019, pp. 1263–1263.
cases where students improve over time with providing cor-
                                                                         [3] P. Denny, J. Prather, B. A. Becker, Error message read-
rect explanations to lines of code as they progress through
                                                                             ability and novice debugging performance, in: Pro-
harder programming solutions. We will explore this in a
                                                                             ceedings of the 2020 ACM conference on innovation
longitudinal study as we build a system which will have the
                                                                             and technology in computer science education, 2020,
option to present harder examples for students to explain as
                                                                             pp. 480–486.
they are evaluted correct with a better classifier that utilizes
                                                                         [4] M. Hoq, Y. Shi, J. Leinonen, D. Babalola, C. Lynch,
several of the current analyses presented in this work as
                                                                             T. Price, B. Akram, Detecting chatgpt-generated code
evidence.
                                                                             submissions in a cs1 course using machine learning
                                                                             models, in: Proceedings of the 55th ACM Technical
Acknowledgments                                                              Symposium on Computer Science Education V. 1, 2024,
                                                                             pp. 526–532.
We thank Jeevan Chapagain for his efforts in the annotation              [5] P. Oli, R. Banjade, A. B. Lekshmi Narayanan, J. Cha-
           —
                        (a) BERTScore ChrF Correlation                        (b) METEOR ChrF Correlation
          Figure 5: When we draw correlations, we observe that the three similiarity metrics are correlated to one another (refer 5a
          and 5b).


     pagain, L. J. Tamang, P. Brusilovsky, V. Rus, Im-                    A. M. Rush, Don’t take the premise for granted: Miti-
     proving code comprehension through scaffolded self-                  gating artifacts in natural language inference, arXiv
     explanations, in: Proceedings of 24th International                  preprint arXiv:1907.04380 (2019).
     Conference on Artificial Intelligence in Education, Part        [15] T. W. Li, S. Hsu, M. Fowler, Z. Zhang, C. Zilles, K. Kara-
     2, Springer, 2023, pp. 478–483. URL: ttps://doi.org/10.              halios, Am i wrong, or is the autograder wrong? ef-
     1007/978-3-031-36272-9_75.                                           fects of ai grading mistakes on learning, in: Proceed-
 [6] J. Leinonen, P. Denny, S. MacNeil, S. Sarsa, S. Bernstein,           ings of the 2023 ACM Conference on International
     J. Kim, A. Tran, A. Hellas, Comparing Code Explana-                  Computing Education Research-Volume 1, 2023, pp.
     tions Created by Students and Large Language Models,                 159–176.
     2023. arXiv:2304.03938 , arXiv:2304.03938.                      [16] M. Fowler, B. Chen, S. Azad, M. West, C. Zilles, Au-
 [7] A.-B.      Lekshmi-Narayanan,           J.    Chapagain,             tograding ”explain in plain english” questions using
     P. Brusilovsky, V. Rus, SelfCode 2.0: Annotated                      nlp, in: Proceedings of the 52nd ACM Technical Sym-
     Corpus of Student Self- Explanations to Introduc-                    posium on Computer Science Education, SIGCSE ’21,
     tory JAVA Programs in Computer Science, 2024.                        Association for Computing Machinery, New York, NY,
     URL:          https://doi.org/10.5281/zenodo.10912669.               USA, 2021, p. 1163–1169. URL: https://doi.org/10.1145/
     doi:10.5281/zenodo.10912669 .                                        3408877.3432539. doi:10.1145/3408877.3432539 .
 [8] X. Fan, W. Luo, M. Menekse, D. Litman, J. Wang,                 [17] P. Denny, D. H. Smith IV, M. Fowler, J. Prather, B. A.
     Coursemirror: Enhancing large classroom instructor-                  Becker, J. Leinonen, Explaining code with a pur-
     student interactions via mobile interfaces and natural               pose: An integrated approach for developing code
     language processing, in: Proceedings of the 33rd An-                 comprehension and prompting skills, arXiv preprint
     nual ACM Conference Extended Abstracts on Human                      arXiv:2403.06050 (2024).
     Factors in Computing Systems, 2015, pp. 1473–1478.              [18] S. Haller, A. Aldea, C. Seifert, N. Strisciuglio, Survey on
 [9] S. A. Crossley, P. Baffour, Y. Tian, A. Picou, M. Benner,            automated short answer grading with deep learning:
     U. Boser, The persuasive essays for rating, selecting,               from word embeddings to transformers, arXiv preprint
     and understanding argumentative and discourse el-                    arXiv:2204.03503 (2022).
     ements (persuade) corpus 1.0, Assessing Writing 54              [19] J. Lin, Z. Han, D. R. Thomas, A. Gurung, S. Gupta,
     (2022) 100667.                                                       V. Aleven, K. R. Koedinger, How can i get it right? us-
[10] A. C. Graesser, D. S. McNamara, M. M. Louwerse,                      ing gpt to rephrase incorrect trainee responses, arXiv
     Z. Cai, Coh-metrix: Analysis of text on cohesion and                 preprint arXiv:2405.00970 (2024).
     language, Behavior research methods, instruments, &             [20] A. B. L. Narayanan, P. Oli, J. Chapagain, M. Hassany,
     computers 36 (2004) 193–202.                                         R. Banjade, P. Brusilovsky, V. Rus, Explaining code
[11] R. E. Wang, D. Demszky, Edu-convokit: An open-                       examples in introductory programming courses: LLM
     source library for education conversation data, arXiv                vs humans, in: AI for Education: Bridging Innovation
     preprint arXiv:2402.05111 (2024).                                    and Responsibility at the 38th AAAI Annual Confer-
[12] R. Schwartz, M. Sap, I. Konstas, L. Zilles, Y. Choi, N. A.           ence on AI, 2024. URL: https://openreview.net/forum?
     Smith, The effect of different writing tasks on linguis-             id=zImjfZG3mw.
     tic style: A case study of the roc story cloze task, arXiv      [21] S. Banerjee, A. Lavie, Meteor: An automatic metric for
     preprint arXiv:1702.01841 (2017).                                    mt evaluation with improved correlation with human
[13] M. Glockner, V. Shwartz, Y. Goldberg, Breaking nli                   judgments, in: Proceedings of the acl workshop on in-
     systems with sentences that require simple lexical in-               trinsic and extrinsic evaluation measures for machine
     ferences, arXiv preprint arXiv:1805.02266 (2018).                    translation and/or summarization, 2005, pp. 65–72.
[14] Y. Belinkov, A. Poliak, S. M. Shieber, B. Van Durme,            [22] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi,
     Bertscore: Evaluating text generation with bert, arXiv
     preprint arXiv:1904.09675 (2019).
[23] M. Popović, chrf: character n-gram f-score for au-
     tomatic mt evaluation, in: Proceedings of the tenth
     workshop on statistical machine translation, 2015, pp.
     392–395.
[24] R. Hosseini, K. Akhuseyinoglu, P. Brusilovsky,
     L. Malmi, K. Pollari-Malmi, C. Schunn, T. Sirkiä,
     Improving engagement in program construc-
     tion examples for learning python program-
     ming,     International Journal of Artificial Intel-
     ligence in Education 30 (2020) 299–336. URL:
     https://doi.org/10.1007/s40593-020-00197-0.
     doi:10.1007/s40593- 020- 00197- 0 .
[25] R. Hosseini, P. Brusilovsky, Javaparser: A fine-grain
     concept indexing tool for java problems, in: CEUR
     Workshop Proceedings, volume 1009, University of
     Pittsburgh, 2013, pp. 60–63.