<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>CSEDM) Workshop, June</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Challenges and Solutions⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arun-Balajiee Lekshmi-Narayanan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Brusilvosky</string-name>
          <email>peterb@pitt.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Systems Program, University of Pittsburgh</institution>
          ,
          <addr-line>Pittsburgh, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>14</volume>
      <issue>2024</issue>
      <abstract>
        <p>Educational data used by data mining approaches in the domain of Computer Science education primarily focused on working with code-based data, such as student homework submissions. However, the increased use of natural language techniques and Large Language models (LLM) in all domains of learning including Computer Science education is now producing an abundance of natural language data, such as code explanations generated by students and LLMs as well as feedback and hints produced by instructors, TAs, and LLMs. These data represent new challenges for CSEDM research and need new creative approaches to leverage. In this paper, we present a first attempt to analyze one type of these new data, student explanations of worked code examples. The main challenge in working with these data is to evaluate the correctness of self-explanations. Using a dataset of student explanations collected in our previous work, we demonstrate the dificulty of this problem and discuss a possible way to solve it.</p>
      </abstract>
      <kwd-group>
        <kwd>code explanations</kwd>
        <kwd>worked examples</kwd>
        <kwd>automated assessment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The majority of the work in computer science educational
data mining (CSEDM) relied so far on datasets that
collected traces of learner work with various learning content
or datasets with student submissions to programming
assignments 1 Using these datasets, researchers were able
to explore a range of novel approaches including finding
knowledge components [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], debugging [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], and detecting
cheating [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, as newer types of dataset become
openly available for analysis, new methods need to be
developed to leverage this data 2.
      </p>
      <p>
        With recent research on student self-explanation of code
fragments [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as well as the use of LLMs and students to
generate code explanations automatically [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], an increasing
number of datasets contain free–form code explanations. In
this work, we consider one such dataset with code
explanations generated by students and instructors [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This dataset
was annotated to mark the correctness of each student’s
explanation and to assess the similarity between students
and instructors’ explanations for the same code lines. The
goal we want to achieve by working with this dataset is to
distinguish correct and incorect explanations. This goal has
practical value. An approach that could reliably identify
incorrect explanations could be used to build an intelligent
tutor to support the self-explanation process [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Starting with a review of relevant work, the paper
discusses several approaches to distinguish correct and incorect
explanations. Since our dataset contains “ground truth”, i.e.,
human expert annotation of each explanations as as
correct or incorrect (including inter–rater reliability) we are
able to use the dataset to illustrate the feasibility of these
LGOBE
(P. Brusilvosky)</p>
      <p>https://a2un.github.io (A. Lekshmi-Narayanan);
https://sites.pitt.edu/~peterb/ (P. Brusilvosky)</p>
      <p>0000-0002-7735-5008 (A. Lekshmi-Narayanan); 0000-0002-1902-1464
(P. Brusilvosky)
1https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=3458
2https://the-learning-agency.com/learning-engineering-hub/build/
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Work</title>
      <p>
        Corpora for free-form student answers such as reflective
essays [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and argumentative writing [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], provide interesting
examples of use cases that are diferent from traditional log
data. The ability to analyze this data is important to
provide feedback when assessing students’ free–form responses.
Tools, such as COH–METRIX [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and EDU-Convokit [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
ofer several options to analyze textual educational data;
however, our dataset of free–form code explanations needs
slightly diferent methods to evaluate correctness.
      </p>
      <p>
        Some examples in the natural language processing
domain ofer encouraging examples of using surface-level
features to models for various tasks. Schwartz and colleagues
explore surface features such as the word or character n–
gram and length of sentences to build a classifier to identify
author writing styles in a CLOZE story task [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Some
examples in the case of language inference tasks consider
word-level similarity–based approaches [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. When
constructing adversarial examples for natural language
inference tasks, another work considers surface–level cues such
as words that contradict or are negative (”not”). Negative
sampling is another approach that uses surface–level
features to construct synthetic examples that can help build
robust classifiers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Li and colleagues [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] discuss the student perceptions
on the potential errors autograder may provide as feedback
to their submissions. This emphasizes the need to develop
CEUR
      </p>
      <p>
        ceur-ws.org
better automated assessment techniques, for newer kinds
of data such as code explanations discussed in our work.
The earlier work of the same team [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] discuss the use of an
autograding system that evaluates student explanations to
code in plain English. They show the potential limitations
of using finetuned AI models for autograding accuracy by
comparing with TAs at diferent levels of grading expertise
and do not find statistically significant results – either
owing to sample size or the AI model actually not performing
better than TAs at the task. This necessatitates the
possibilities to explore better finetuned AI models that have higher
accuracy, with lower rates of false positives and negatives.
In this work, we consider a context quite similar to their
work, however, we use student self-explanations produced
as a part of the learning process rather then explanations
produced for grading. We also start from scratch by
identifying features in student explanations to classify them as
correct or incorrect. Like this work, we are observing that
the use of surface–level linguistic features may not help in
diferentiating student explanations by correctness. Possible
extensions would involve the use of contextual emebeddings
and LLMs which we also currently explore as a follow-up
to this current work–in–progress.
      </p>
      <p>
        Denny and colleagues [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] explore student code
explanations in plain English in a diferent context. In this work,
code explanations are used to encourage students to think
deeply about the problem so that using their code
explanations LLM can generate a code equivalent to the code
the student is trying to explain. Additionally, they evaluate
the student explanations progress up the classifications of
SOLO taxonomy. They also conduct a user study to
evaluate students’ perceptions on traditional approaches such as
code–writing in comparison to approahces like code
explanations.
      </p>
      <p>
        Haller and colleagues [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] survey automated assessments
tools that are used to evaluate short answer essays. They
discuss hand–engineered appraoches in combination SVM
or KNN based classifiers. In our case, the goal is not to build
best classifiers for the task, but to evaluate if the features
themselves reveal diferences between correct and incorrect
student explanations.
      </p>
      <p>
        Lin and colleagues [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] explore the use of a finet–tuned
GPT model that can be used to provide personalized,
adaptive feedback to students. They use new metric that is an
extension to the precision / recall –based
Intersection-OverUnion metric to evluate the LLM–based feedback and
compare with human feedback in a user study. For current
work–in–progress this idea is the next target to achieve in
the context to code explanations of worked examples in
programming.
      </p>
      <p>
        Leinonen and colleagues [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] compare ChatGPT-generated
explanations with student explanations. In our work, we
are interested in classifying student explanations as correct
or incorrect. In an ongoing extension to this work, we also
focus on using ChatGPT–based interventions to solve this
challenging problem.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Method</title>
      <p>
        Inspired by previous work [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], we extract surface-level
features from student and expert explanations alone and
generate pairwise similarity scores between student and
expert explanations for the same code lines. This data is
applied to evaluate the correctness of the student’s explanation
for a given line of code.
      </p>
      <sec id="sec-4-1">
        <title>3.1. Surface Features</title>
        <p>We try to assess the correctness of student explanations
using the following easily extracted features.</p>
        <p>
          1. Explanation Length is calculated as the number of
words to check that longer student explanations are
correct. This is a useful metric for tasks such as
persuasive essay evaluation [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and we expect this
could work for assessing code explanations also.
2. Lexical Density is calculated by the ratio of the
number of nouns, adjectives, verbs, and adverbs (tagged
in the sentence using a Spacy POS Tagger 3) over the
overall number of words in the sentence (Ure LD
formula 4). We expect that correct student explanations
are lexically denser.
3. Gunning Fog Readability is the metric to evaluate
the grade level to understand a text. We hypothesise
that the correct student explanations might have
higher scores (require more technical knowledge to
understand) than incorrect explanations.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Expert–Student Similarity Features</title>
        <p>
          We consider the pairwise similarity between expert and
student explanations to assess correctness of student
explanations. Following our previous work [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], METEOR [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ],
BERTScore [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] and chrF [23] are considered to evaluate
the pairwise similarity between student and expert
explanations for a given line of code. We expect correct student
explanations to be more similar to expert explanations than
incorrect explanations. We choose this combination of
metrics because METEOR and chrF scores measure character
and token level similarities, while BERTScore estimates
semantic similarity by using cosine similarities between the
contextual word embeddings of the two explanations. The
similarity scores are between 0 and 1.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Dataset</title>
      <p>
        We use a dataset of line–by–line explanations provided by
students in an study in which they were asked to explain
worked examples [
        <xref ref-type="bibr" rid="ref7">24, 7</xref>
        ]. The study included four Java
worked code examples: some basic examples focused on
array search and print statements and more dificult examples
focused on object-oriented principles. Among about all
expert explanations in the dataset, we considered upto 2 expert
explanations for every line of code. In the original dataset,
the majority of the student explanations were provided in
a single sentence; however, a fraction of explanations
included two or more sentences. For the purpose of this study,
we excluded these multi-sentence explanations retaining
between 23 and 26 single-sentence student explanations per
line of code. The key datset parameters are shown in Table 1
and sample explanations are provided in Figure 1 (metadata
columns are omitted). There is a known imbalance in the
dataset between correct and incorrect examples (1234
instances of single sentence student explanations annotated
as correct and 70 instances of single sentence student
explanations annotated as incorrect). We calculated the average
3https://spacy.io/usage/linguistic-features
4https://en.wikipedia.org/wiki/Lexical_density
Dataset Property
# Single Sentence Student–Expert (All Experts) Pairs
# All Sentence Student–Expert (All Experts) Pairs
# All Sentence Student–Expert (All Experts) Pairs Annotation Agreement
# Worked code examples
# Lines per example
# Single Sentence Student–Expert (Expert 1 &amp; 2) Pairs
# Student–Expert (1 &amp; 2) Explanation Pairs with Student Correct
# Student–Expert (1 &amp; 2) Explanation Pairs with Student Incorrect
percentage agreement for the annotation of correctness all
sentence all experts student pairs of explanations (see
Table 1). More details on the dataset is available in our past
work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <sec id="sec-5-1">
        <title>Program: PointTester.java Line number: 14 Line code:</title>
        <p>x += dx;
Expert1: To shift the x-coordinate of the point, we need
to add dx to the value of the x-coordinate of the point.
Student1: move the x coord the amount that the argument
specified
Student2: Adds the first inputted value to X.</p>
        <p>Student3: increases the value of x by the amount of the
first parameter in the function.
...</p>
        <p>Student23: The value of dx is added to variable x.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results</title>
      <p>In this section, we group the results by the surface–level
metrics used to evaluate the correctness of student
explanations and similarity–based metrics where pairs of student
and expert explanations were used. While extensive
statistical results could be performed (such as t-tests to compare
means), we preferred to perform exploratory analysis before
digging deeper with our analysis.</p>
      <sec id="sec-6-1">
        <title>5.1. Lexical Based Surface Metrics</title>
        <sec id="sec-6-1-1">
          <title>5.1.1. Explanation Length</title>
          <p>Using the length of the explanation, we observe if the expert
and student explanations can be distinguished, we observe
that the explanations marked correct have lengths of
different words. Some lengths for correct explanations are
the same as those for incorrect explanations (see Figure 2a).
This may be because the student explanations are generally
similar in length, regardless of whether they are annotated
as correct or incorrect.</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>5.1.2. Readability Metrics</title>
          <p>We observe that it is impossible to diferentiate correct from
incorrect student explanations using lexical surface
metrics(see Figure 2b). This may be because the student
explanations are not technically diferent in their explanation
levels but the concepts in computing that are used to explain
the line of code, which we observed when annotating the
dataset.</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>5.1.3. Lexical Density</title>
          <p>We observe that the lexical density also may not
diferentiate good and bad student explanations. (see Figure 2c).
The lexical density measures a more linguistic aspect of the
explanations by the parts of speech, which may not
necessarily evaluate the conceptual aspects of the explanations.
This is because the concepts may not be associated with a
particular kind of speech and are more connected with the
ontology of concepts in computing.</p>
        </sec>
        <sec id="sec-6-1-4">
          <title>5.1.4. Vocabulary</title>
          <p>Correctness also does not seem to depend on the vocabulary
of the student explanations (see Figure 2d). The vocabulary
in the sentence is more of a linguistic measure. Hence, this
may not necessarily capture the conceptual ontology in
computing such as those discussed in an earlier work of
JAVA Parser [25].</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Expert–Student Explanation Similarity</title>
        <sec id="sec-6-2-1">
          <title>5.2.1. ChrF score</title>
          <p>We observe that the diferences considering the class
imbalance between the correct and incorrect explanations could
create an issue with the threshold to diferentiate the
explanations using this score (see Figure 3a). Further inspection
of similarity scores at a line–by–line level shows that
irrespective of the expert explanation that is used to calculate
the similarity, the correct and incorrect student
explanations cannot be separated easily by their ChrF score (see
Figure 4a).</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>5.2.2. METEOR Metric</title>
          <p>There is a more noticeable diference in the METEOR
similarity scores between the correct and incorrect student
explanations. This could be due to n-gram level word
alignment. The density plots of the METEOR similarity scores
distribution show that most incorrect explanations have a
METEOR score below 0.3, as shown in Figure 3b. However,
more than 50% of the correct explanations also have a
METEOR similarity score below 0.3. Thus, irrespective of the
expert explanation that is used to calculate the similarity,
the correct and incorrect student explanations cannot be
easily separated using the METEOR score (see Figure 4b).</p>
        </sec>
        <sec id="sec-6-2-3">
          <title>5.2.3. BERTScore</title>
          <p>While we expected better performance of BERTScore in
separation of correct and incorrect explanations, the
density plots of the BERTScore distribution for correct and
incorrect explanations show very little diferences. We may
—
(a) Student and Expert Explanation Length
(b) Student and Expert Explanation Gunning-Fog
(c) Student and Expert Explanation Lexical Diversity
(d) Student and Expert Explanation Vocabulary
observe diferences if we pre-trained a RoBERTa model over
instances from the dataset. (See Figure 3c). As before, the
inspection of the similarity scores at a line–by-line level
shows that regardless of the expert explanation that is used
to calculate similarity, the correct and incorrect student
explanations cannot be separated easily (see Figure 4c).</p>
        </sec>
        <sec id="sec-6-2-4">
          <title>5.2.4. Similarity Correlations</title>
          <p>When drawing similarity scores between the student and
expert explanations, we can take the average similarity of
the student explanation per line of code per solution with
the two expert explanations. This, we can calculate this
for all the lines of all the programs and observe that the
while the range of values of the three similarity scoring
metrics are diferent, they are highly correlated (  &lt; 1 − 6 ,
0.5 ≤   ≤ 0.6 ), as shown in Figures 5a and 5b.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>In this work, we present the challenges of analyzing new
kinds of datasets such as the code explanations dataset in
this paper. We observe that we need more sophisticated
metrics to evaluate student explanations as “good” or “bad” and
surface-level metrics are mostly inefective in evaluating
student explanation correctness. We present similarity-based
metrics also not performing well in separating the “good”
from the “bad” student explanations.</p>
      <p>Our work has several limitations. We did not consider
the use of combinations of lexical and similarity-based
features to classify student explanation correctness. The goal
of this paper is not to present the best possible classifier,
rather to show the dificulty in identifying useful features
—
—
(a) Character F Similarity
(b) Meteor Similarity
(c) BERTScore Similarity
to diferentiate correct from incorrect student explanations.
We are addressing this in our ongoing work, with the use of
LLMs to assess the correctness and provide feedback to
student explanations. Our expert explanations may not have
suficient and diverse correct (positive) and incorrect
(negative) examples to build robust classifiers. We are developing
cross–validation techniques to build better classifiers that
are exposed to various synthetic and real–world examples
to evaluate student explanation correctness. In this work,
we did not present similar results by considering multiple
sentences for both student and expert explanations. While
this is important, we chose to present in this work a
prototype for single sentence evaluation which is scalable with
aggregation techniques, which we will be presenting in an
upcoming future work. This dataset does not cover the
cases where students improve over time with providing
correct explanations to lines of code as they progress through
harder programming solutions. We will explore this in a
longitudinal study as we build a system which will have the
option to present harder examples for students to explain as
they are evaluted correct with a better classifier that utilizes
several of the current analyses presented in this work as
evidence.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We thank Jeevan Chapagain for his eforts in the annotation
of the dataset. We thank Dr. Xiang Lorraine Li for her
contributions to writing this draft and for the natural language
processing topics of the paper.
—
(a) BERTScore ChrF Correlation
(b) METEOR ChrF Correlation
Bertscore: Evaluating text generation with bert, arXiv
preprint arXiv:1904.09675 (2019).
[23] M. Popović, chrf: character n-gram f-score for
automatic mt evaluation, in: Proceedings of the tenth
workshop on statistical machine translation, 2015, pp.
392–395.
[24] R. Hosseini, K. Akhuseyinoglu, P. Brusilovsky,
L. Malmi, K. Pollari-Malmi, C. Schunn, T. Sirkiä,
Improving engagement in program
construction examples for learning python
programming, International Journal of Artificial
Intelligence in Education 30 (2020) 299–336. URL:
https://doi.org/10.1007/s40593-020-00197-0.</p>
      <p>doi:10.1007/s40593-020-00197-0.
[25] R. Hosseini, P. Brusilovsky, Javaparser: A fine-grain
concept indexing tool for java problems, in: CEUR
Workshop Proceedings, volume 1009, University of
Pittsburgh, 2013, pp. 60–63.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schmucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Barnes</surname>
          </string-name>
          , T. Price, Kcifnder:
          <article-title>Automated knowledge component discovery for programming problems</article-title>
          .,
          <source>International Educational Data Mining Society</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Kazerouni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Mansur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Edwards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Shafer</surname>
          </string-name>
          ,
          <article-title>Student debugging practices and their relationships with project outcomes</article-title>
          ,
          <source>in: Proceedings of the 50th ACM Technical Symposium on Computer Science Education</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1263</fpage>
          -
          <lpage>1263</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Prather</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <article-title>Error message readability and novice debugging performance</article-title>
          ,
          <source>in: Proceedings of the 2020 ACM conference on innovation and technology in computer science education</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>480</fpage>
          -
          <lpage>486</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hoq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leinonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Babalola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lynch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Akram</surname>
          </string-name>
          ,
          <article-title>Detecting chatgpt-generated code submissions in a cs1 course using machine learning models</article-title>
          ,
          <source>in: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1</source>
          ,
          <issue>2024</issue>
          , pp.
          <fpage>526</fpage>
          -
          <lpage>532</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Oli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Banjade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Lekshmi Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chapagain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Tamang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rus</surname>
          </string-name>
          ,
          <article-title>Improving code comprehension through scafolded selfexplanations</article-title>
          ,
          <source>in: Proceedings of 24th International Conference on Artificial Intelligence in Education, Part 2</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>478</fpage>
          -
          <lpage>483</lpage>
          . URL: ttps://doi.org/10. 1007/978-3-
          <fpage>031</fpage>
          -36272-9_
          <fpage>75</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Leinonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          , S. MacNeil, S. Sarsa,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          ,
          <source>Comparing Code Explanations Created by Students and Large Language Models</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2304</volume>
          .03938, arXiv:
          <fpage>2304</fpage>
          .
          <fpage>03938</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>A.-B. Lekshmi-Narayanan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chapagain</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brusilovsky</surname>
          </string-name>
          , V. Rus, SelfCode
          <volume>2</volume>
          .0: Annotated Corpus of Student Self- Explanations to Introductory JAVA Programs in Computer Science,
          <year>2024</year>
          . URL: https://doi.org/10.5281/zenodo.10912669. doi:
          <volume>10</volume>
          .5281/zenodo.10912669.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Menekse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Litman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Coursemirror:
          <article-title>Enhancing large classroom instructorstudent interactions via mobile interfaces and natural language processing</article-title>
          ,
          <source>in: Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1473</fpage>
          -
          <lpage>1478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Crossley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bafour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Picou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Benner</surname>
          </string-name>
          , U. Boser,
          <article-title>The persuasive essays for rating, selecting, and understanding argumentative and discourse elements (persuade) corpus 1.0</article-title>
          ,
          <string-name>
            <surname>Assessing</surname>
            <given-names>Writing</given-names>
          </string-name>
          54 (
          <year>2022</year>
          )
          <fpage>100667</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Graesser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Louwerse</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>Coh-metrix: Analysis of text on cohesion and language</article-title>
          , Behavior research methods, instruments, &amp; computers
          <volume>36</volume>
          (
          <year>2004</year>
          )
          <fpage>193</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demszky</surname>
          </string-name>
          , Edu-convokit:
          <article-title>An opensource library for education conversation data</article-title>
          ,
          <source>arXiv preprint arXiv:2402.05111</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Konstas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zilles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith,</surname>
          </string-name>
          <article-title>The efect of diferent writing tasks on linguistic style: A case study of the roc story cloze task</article-title>
          ,
          <source>arXiv preprint arXiv:1702</source>
          .
          <year>01841</year>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Glockner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <article-title>Breaking nli systems with sentences that require simple lexical inferences</article-title>
          , arXiv preprint arXiv:
          <year>1805</year>
          .
          <volume>02266</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belinkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poliak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Shieber</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Van Durme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <article-title>Don't take the premise for granted: Mitigating artifacts in natural language inference</article-title>
          ,
          <source>arXiv preprint arXiv:1907</source>
          .
          <volume>04380</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T. W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fowler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zilles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Karahalios</surname>
          </string-name>
          ,
          <article-title>Am i wrong, or is the autograder wrong? effects of ai grading mistakes on learning</article-title>
          ,
          <source>in: Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume</source>
          <volume>1</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fowler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Azad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>West</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zilles</surname>
          </string-name>
          ,
          <article-title>Autograding ”explain in plain english” questions using nlp</article-title>
          ,
          <source>in: Proceedings of the 52nd ACM Technical Symposium on Computer Science Education, SIGCSE '21</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>1163</fpage>
          -
          <lpage>1169</lpage>
          . URL: https://doi.org/10.1145/ 3408877.3432539. doi:
          <volume>10</volume>
          .1145/3408877.3432539.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Smith</surname>
          </string-name>
          <string-name>
            <given-names>IV</given-names>
            ,
            <surname>M. Fowler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Prather</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Leinonen,</surname>
          </string-name>
          <article-title>Explaining code with a purpose: An integrated approach for developing code comprehension and prompting skills</article-title>
          ,
          <source>arXiv preprint arXiv:2403.06050</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Haller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aldea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Seifert</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Strisciuglio,</surname>
          </string-name>
          <article-title>Survey on automated short answer grading with deep learning: from word embeddings to transformers</article-title>
          ,
          <source>arXiv preprint arXiv:2204.03503</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gurung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Aleven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Koedinger</surname>
          </string-name>
          ,
          <article-title>How can i get it right? using gpt to rephrase incorrect trainee responses</article-title>
          ,
          <source>arXiv preprint arXiv:2405.00970</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>A. B. L. Narayanan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Oli</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chapagain</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hassany</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Banjade</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brusilovsky</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Rus</surname>
          </string-name>
          ,
          <article-title>Explaining code examples in introductory programming courses: LLM vs humans, in: AI for Education: Bridging Innovation and Responsibility at the 38th</article-title>
          <source>AAAI Annual Conference on AI</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum? id=zImjfZG3mw.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>Meteor:</surname>
          </string-name>
          <article-title>An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation</article-title>
          and/or summarization,
          <year>2005</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>