Evaluating Correctness of Student Code Explanations: Challenges and Solutions⋆ Arun-Balajiee Lekshmi-Narayanan1,∗ , Peter Brusilvosky Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA Abstract Educational data used by data mining approaches in the domain of Computer Science education primarily focused on working with code-based data, such as student homework submissions. However, the increased use of natural language techniques and Large Language models (LLM) in all domains of learning including Computer Science education is now producing an abundance of natural language data, such as code explanations generated by students and LLMs as well as feedback and hints produced by instructors, TAs, and LLMs. These data represent new challenges for CSEDM research and need new creative approaches to leverage. In this paper, we present a first attempt to analyze one type of these new data, student explanations of worked code examples. The main challenge in working with these data is to evaluate the correctness of self-explanations. Using a dataset of student explanations collected in our previous work, we demonstrate the difficulty of this problem and discuss a possible way to solve it. Keywords code explanations, worked examples, automated assessment 1. Introduction approaches. More specifically, the remaining part of the paper focuses The majority of the work in computer science educational on two groups of approaches: data mining (CSEDM) relied so far on datasets that col- lected traces of learner work with various learning content 1. Use surface-level features: this group of approaches or datasets with student submissions to programming as- use “surface–level” lexical and readability features signments 1 Using these datasets, researchers were able that could be easily extracted from the text of student to explore a range of novel approaches including finding or expert explanations. This is discussed further in knowledge components [1], debugging [2, 3], and detecting Section 3.1. cheating [4]. However, as newer types of dataset become 2. Use expert explanations: this group of approaches openly available for analysis, new methods need to be de- attempts to calculate various similarity metrics be- veloped to leverage this data 2 . tween student explanations and expert explanation With recent research on student self-explanation of code and use the obtained similarity to distinguish correct fragments [5] as well as the use of LLMs and students to and incorect explanations. This is described further generate code explanations automatically [6], an increasing in Section 3.2. number of datasets contain free–form code explanations. In this work, we consider one such dataset with code explana- tions generated by students and instructors [7]. This dataset 2. Related Work was annotated to mark the correctness of each student’s Corpora for free-form student answers such as reflective es- explanation and to assess the similarity between students says [8] and argumentative writing [9], provide interesting and instructors’ explanations for the same code lines. The examples of use cases that are different from traditional log goal we want to achieve by working with this dataset is to data. The ability to analyze this data is important to pro- distinguish correct and incorect explanations. This goal has vide feedback when assessing students’ free–form responses. practical value. An approach that could reliably identify Tools, such as COH–METRIX [10] and EDU-Convokit [11], incorrect explanations could be used to build an intelligent offer several options to analyze textual educational data; tutor to support the self-explanation process [5]. however, our dataset of free–form code explanations needs Starting with a review of relevant work, the paper dis- slightly different methods to evaluate correctness. cusses several approaches to distinguish correct and incorect Some examples in the natural language processing do- explanations. Since our dataset contains “ground truth”, i.e., main offer encouraging examples of using surface-level fea- human expert annotation of each explanations as as cor- tures to models for various tasks. Schwartz and colleagues rect or incorrect (including inter–rater reliability) we are explore surface features such as the word or character n– able to use the dataset to illustrate the feasibility of these gram and length of sentences to build a classifier to identify author writing styles in a CLOZE story task [12]. Some CSEDM’24: 8th Educational Data Mining in Computer Science Education examples in the case of language inference tasks consider (CSEDM) Workshop, June 14, 2024, Atlanta, GA ⋆ You can use this document as the template for preparing your publica- word-level similarity–based approaches [13]. When con- tion. We recommend using the latest version of the ceurart style. structing adversarial examples for natural language infer- ∗ Corresponding author. ence tasks, another work considers surface–level cues such Envelope-Open arl122@pitt.edu (A. Lekshmi-Narayanan); peterb@pitt.edu as words that contradict or are negative (”not”). Negative (P. Brusilvosky) sampling is another approach that uses surface–level fea- GLOBE https://a2un.github.io (A. Lekshmi-Narayanan); https://sites.pitt.edu/~peterb/ (P. Brusilvosky) tures to construct synthetic examples that can help build Orcid 0000-0002-7735-5008 (A. Lekshmi-Narayanan); 0000-0002-1902-1464 robust classifiers [14]. (P. Brusilvosky) Li and colleagues [15] discuss the student perceptions © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). on the potential errors autograder may provide as feedback 1 https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=3458 to their submissions. This emphasizes the need to develop 2 https://the-learning-agency.com/learning-engineering-hub/build/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings better automated assessment techniques, for newer kinds for a given line of code. of data such as code explanations discussed in our work. The earlier work of the same team [16] discuss the use of an 3.1. Surface Features autograding system that evaluates student explanations to code in plain English. They show the potential limitations We try to assess the correctness of student explanations of using finetuned AI models for autograding accuracy by using the following easily extracted features. comparing with TAs at different levels of grading expertise and do not find statistically significant results – either ow- 1. Explanation Length is calculated as the number of ing to sample size or the AI model actually not performing words to check that longer student explanations are better than TAs at the task. This necessatitates the possibil- correct. This is a useful metric for tasks such as ities to explore better finetuned AI models that have higher persuasive essay evaluation [9] and we expect this accuracy, with lower rates of false positives and negatives. could work for assessing code explanations also. In this work, we consider a context quite similar to their 2. Lexical Density is calculated by the ratio of the num- work, however, we use student self-explanations produced ber of nouns, adjectives, verbs, and adverbs (tagged as a part of the learning process rather then explanations in the sentence using a Spacy POS Tagger 3 ) over the produced for grading. We also start from scratch by identi- overall number of words in the sentence (Ure LD for- fying features in student explanations to classify them as mula 4 ). We expect that correct student explanations correct or incorrect. Like this work, we are observing that are lexically denser. the use of surface–level linguistic features may not help in 3. Gunning Fog Readability is the metric to evaluate differentiating student explanations by correctness. Possible the grade level to understand a text. We hypothesise extensions would involve the use of contextual emebeddings that the correct student explanations might have and LLMs which we also currently explore as a follow-up higher scores (require more technical knowledge to to this current work–in–progress. understand) than incorrect explanations. Denny and colleagues [17] explore student code explana- tions in plain English in a different context. In this work, 3.2. Expert–Student Similarity Features code explanations are used to encourage students to think deeply about the problem so that using their code expla- We consider the pairwise similarity between expert and nations LLM can generate a code equivalent to the code student explanations to assess correctness of student expla- the student is trying to explain. Additionally, they evaluate nations. Following our previous work [20], METEOR [21], the student explanations progress up the classifications of BERTScore [22] and chrF [23] are considered to evaluate SOLO taxonomy. They also conduct a user study to evalu- the pairwise similarity between student and expert expla- ate students’ perceptions on traditional approaches such as nations for a given line of code. We expect correct student code–writing in comparison to approahces like code expla- explanations to be more similar to expert explanations than nations. incorrect explanations. We choose this combination of met- Haller and colleagues [18] survey automated assessments rics because METEOR and chrF scores measure character tools that are used to evaluate short answer essays. They and token level similarities, while BERTScore estimates se- discuss hand–engineered appraoches in combination SVM mantic similarity by using cosine similarities between the or KNN based classifiers. In our case, the goal is not to build contextual word embeddings of the two explanations. The best classifiers for the task, but to evaluate if the features similarity scores are between 0 and 1. themselves reveal differences between correct and incorrect student explanations. 4. Dataset Lin and colleagues [19] explore the use of a finet–tuned GPT model that can be used to provide personalized, adap- We use a dataset of line–by–line explanations provided by tive feedback to students. They use new metric that is an students in an study in which they were asked to explain extension to the precision / recall –based Intersection-Over- worked examples [24, 7]. The study included four Java Union metric to evluate the LLM–based feedback and com- worked code examples: some basic examples focused on ar- pare with human feedback in a user study. For current ray search and print statements and more difficult examples work–in–progress this idea is the next target to achieve in focused on object-oriented principles. Among about all ex- the context to code explanations of worked examples in pert explanations in the dataset, we considered upto 2 expert programming. explanations for every line of code. In the original dataset, Leinonen and colleagues [6] compare ChatGPT-generated the majority of the student explanations were provided in explanations with student explanations. In our work, we a single sentence; however, a fraction of explanations in- are interested in classifying student explanations as correct cluded two or more sentences. For the purpose of this study, or incorrect. In an ongoing extension to this work, we also we excluded these multi-sentence explanations retaining focus on using ChatGPT–based interventions to solve this between 23 and 26 single-sentence student explanations per challenging problem. line of code. The key datset parameters are shown in Table 1 and sample explanations are provided in Figure 1 (metadata columns are omitted). There is a known imbalance in the 3. Method dataset between correct and incorrect examples (1234 in- Inspired by previous work [20], we extract surface-level stances of single sentence student explanations annotated features from student and expert explanations alone and as correct and 70 instances of single sentence student expla- generate pairwise similarity scores between student and nations annotated as incorrect). We calculated the average expert explanations for the same code lines. This data is ap- 3 plied to evaluate the correctness of the student’s explanation https://spacy.io/usage/linguistic-features 4 https://en.wikipedia.org/wiki/Lexical_density Dataset Property Value # Single Sentence Student–Expert (All Experts) Pairs 1854 # All Sentence Student–Expert (All Experts) Pairs 3019 # All Sentence Student–Expert (All Experts) Pairs Annotation Agreement 88.24% # Worked code examples 4 # Lines per example ≈8 # Single Sentence Student–Expert (Expert 1 & 2) Pairs 1304 # Student–Expert (1 & 2) Explanation Pairs with Student Correct 1234 # Student–Expert (1 & 2) Explanation Pairs with Student Incorrect 70 Table 1 A summary of the properties our dataset. percentage agreement for the annotation of correctness all 5.1.3. Lexical Density sentence all experts student pairs of explanations (see Ta- We observe that the lexical density also may not differen- ble 1). More details on the dataset is available in our past tiate good and bad student explanations. (see Figure 2c). work [7]. The lexical density measures a more linguistic aspect of the explanations by the parts of speech, which may not neces- Program: PointTester.java Line number: 14 Line code: x += dx; sarily evaluate the conceptual aspects of the explanations. Expert1: To shift the x-coordinate of the point, we need This is because the concepts may not be associated with a to add dx to the value of the x-coordinate of the point. particular kind of speech and are more connected with the Student1: move the x coord the amount that the argument ontology of concepts in computing. specified Student2: Adds the first inputted value to X. 5.1.4. Vocabulary Student3: increases the value of x by the amount of the first parameter in the function. Correctness also does not seem to depend on the vocabulary ... of the student explanations (see Figure 2d). The vocabulary Student23: The value of dx is added to variable x. in the sentence is more of a linguistic measure. Hence, this may not necessarily capture the conceptual ontology in Figure 1: A slice of the dataset showing a subset of expert and computing such as those discussed in an earlier work of student explanations for the same line of code. JAVA Parser [25]. 5.2. Expert–Student Explanation Similarity 5. Results 5.2.1. ChrF score We observe that the differences considering the class imbal- In this section, we group the results by the surface–level ance between the correct and incorrect explanations could metrics used to evaluate the correctness of student explana- create an issue with the threshold to differentiate the expla- tions and similarity–based metrics where pairs of student nations using this score (see Figure 3a). Further inspection and expert explanations were used. While extensive statis- of similarity scores at a line–by–line level shows that irre- tical results could be performed (such as t-tests to compare spective of the expert explanation that is used to calculate means), we preferred to perform exploratory analysis before the similarity, the correct and incorrect student explana- digging deeper with our analysis. tions cannot be separated easily by their ChrF score (see Figure 4a). 5.1. Lexical Based Surface Metrics 5.1.1. Explanation Length 5.2.2. METEOR Metric Using the length of the explanation, we observe if the expert There is a more noticeable difference in the METEOR sim- and student explanations can be distinguished, we observe ilarity scores between the correct and incorrect student that the explanations marked correct have lengths of dif- explanations. This could be due to n-gram level word align- ferent words. Some lengths for correct explanations are ment. The density plots of the METEOR similarity scores the same as those for incorrect explanations (see Figure 2a). distribution show that most incorrect explanations have a This may be because the student explanations are generally METEOR score below 0.3, as shown in Figure 3b. However, similar in length, regardless of whether they are annotated more than 50% of the correct explanations also have a ME- as correct or incorrect. TEOR similarity score below 0.3. Thus, irrespective of the expert explanation that is used to calculate the similarity, the correct and incorrect student explanations cannot be 5.1.2. Readability Metrics easily separated using the METEOR score (see Figure 4b). We observe that it is impossible to differentiate correct from incorrect student explanations using lexical surface met- 5.2.3. BERTScore rics(see Figure 2b). This may be because the student ex- planations are not technically different in their explanation While we expected better performance of BERTScore in levels but the concepts in computing that are used to explain separation of correct and incorrect explanations, the den- the line of code, which we observed when annotating the sity plots of the BERTScore distribution for correct and dataset. incorrect explanations show very little differences. We may — (a) Student and Expert Explanation Length (b) Student and Expert Explanation Gunning-Fog (c) Student and Expert Explanation Lexical Diversity (d) Student and Expert Explanation Vocabulary Figure 2: Scatter plots of various text linguistic (“surface”) metrics. The x and y represent the student and expert sentence values, respectively. The colors, shapes, and sizes represent the annotation of the students’ explanations for their correctness or incorrectness. Correct and incorrect student explanations are not differentiable by the lexical surface metrics. Similarity Metric Incorrect Correct the student explanation per line of code per solution with (Mean,SD) (Mean,SD) the two expert explanations. This, we can calculate this chrF 0.305, 0.114 0.361, 0.140 for all the lines of all the programs and observe that the METEOR 0.140, 0.091 0.283, 0.170 BERTScore 0.874, 0.028 0.894, 0.024 while the range of values of the three similarity scoring metrics are different, they are highly correlated (𝑝 < 1𝑒 − 6, Table 2 0.5 ≤ 𝑐𝑜𝑟𝑟 ≤ 0.6), as shown in Figures 5a and 5b. The mean similarity scores between expert and student explana- tions are not different for the student explanations annotated as correct from incorrect. The most difference is observable with 6. Conclusion the METEOR metric, also observed with the plots In this work, we present the challenges of analyzing new kinds of datasets such as the code explanations dataset in observe differences if we pre-trained a RoBERTa model over this paper. We observe that we need more sophisticated met- instances from the dataset. (See Figure 3c). As before, the rics to evaluate student explanations as “good” or “bad” and inspection of the similarity scores at a line–by-line level surface-level metrics are mostly ineffective in evaluating stu- shows that regardless of the expert explanation that is used dent explanation correctness. We present similarity-based to calculate similarity, the correct and incorrect student metrics also not performing well in separating the “good” explanations cannot be separated easily (see Figure 4c). from the “bad” student explanations. Our work has several limitations. We did not consider 5.2.4. Similarity Correlations the use of combinations of lexical and similarity-based fea- tures to classify student explanation correctness. The goal When drawing similarity scores between the student and of this paper is not to present the best possible classifier, expert explanations, we can take the average similarity of rather to show the difficulty in identifying useful features — (a) Character F Similarity (b) Meteor Similarity (c) BERTScore Similarity Figure 3: In the Figures 3a, 3b, 3c, we observe that student and expert explanations are similar irrespective of whether the student explanations are annotated as correct or incorrect. — (a) Character F Similarity (b) Meteor Similarity (c) BERTScore Similarity Figure 4: The scatter plots for the 3 similarity metrics used between expert and student explanations for a given line of code in a given program separated by correct vs incorrect presents some differences in scale across the different similarity metrics (refer Figures 4a, 4b, 4c) . to differentiate correct from incorrect student explanations. of the dataset. We thank Dr. Xiang Lorraine Li for her con- We are addressing this in our ongoing work, with the use of tributions to writing this draft and for the natural language LLMs to assess the correctness and provide feedback to stu- processing topics of the paper. dent explanations. Our expert explanations may not have sufficient and diverse correct (positive) and incorrect (nega- tive) examples to build robust classifiers. We are developing References cross–validation techniques to build better classifiers that [1] Y. Shi, R. Schmucker, M. Chi, T. Barnes, T. Price, Kc- are exposed to various synthetic and real–world examples finder: Automated knowledge component discovery to evaluate student explanation correctness. In this work, for programming problems., International Educational we did not present similar results by considering multiple Data Mining Society (2023). sentences for both student and expert explanations. While [2] A. M. Kazerouni, R. S. Mansur, S. H. Edwards, C. A. this is important, we chose to present in this work a proto- Shaffer, Student debugging practices and their rela- type for single sentence evaluation which is scalable with tionships with project outcomes, in: Proceedings of aggregation techniques, which we will be presenting in an the 50th ACM Technical Symposium on Computer upcoming future work. This dataset does not cover the Science Education, 2019, pp. 1263–1263. cases where students improve over time with providing cor- [3] P. Denny, J. Prather, B. A. Becker, Error message read- rect explanations to lines of code as they progress through ability and novice debugging performance, in: Pro- harder programming solutions. We will explore this in a ceedings of the 2020 ACM conference on innovation longitudinal study as we build a system which will have the and technology in computer science education, 2020, option to present harder examples for students to explain as pp. 480–486. they are evaluted correct with a better classifier that utilizes [4] M. Hoq, Y. Shi, J. Leinonen, D. Babalola, C. Lynch, several of the current analyses presented in this work as T. Price, B. Akram, Detecting chatgpt-generated code evidence. submissions in a cs1 course using machine learning models, in: Proceedings of the 55th ACM Technical Acknowledgments Symposium on Computer Science Education V. 1, 2024, pp. 526–532. We thank Jeevan Chapagain for his efforts in the annotation [5] P. Oli, R. Banjade, A. B. Lekshmi Narayanan, J. Cha- — (a) BERTScore ChrF Correlation (b) METEOR ChrF Correlation Figure 5: When we draw correlations, we observe that the three similiarity metrics are correlated to one another (refer 5a and 5b). pagain, L. J. Tamang, P. Brusilovsky, V. Rus, Im- A. M. Rush, Don’t take the premise for granted: Miti- proving code comprehension through scaffolded self- gating artifacts in natural language inference, arXiv explanations, in: Proceedings of 24th International preprint arXiv:1907.04380 (2019). Conference on Artificial Intelligence in Education, Part [15] T. W. Li, S. Hsu, M. Fowler, Z. Zhang, C. Zilles, K. Kara- 2, Springer, 2023, pp. 478–483. URL: ttps://doi.org/10. halios, Am i wrong, or is the autograder wrong? ef- 1007/978-3-031-36272-9_75. fects of ai grading mistakes on learning, in: Proceed- [6] J. Leinonen, P. Denny, S. MacNeil, S. Sarsa, S. Bernstein, ings of the 2023 ACM Conference on International J. Kim, A. Tran, A. Hellas, Comparing Code Explana- Computing Education Research-Volume 1, 2023, pp. tions Created by Students and Large Language Models, 159–176. 2023. arXiv:2304.03938 , arXiv:2304.03938. [16] M. Fowler, B. Chen, S. Azad, M. West, C. Zilles, Au- [7] A.-B. Lekshmi-Narayanan, J. Chapagain, tograding ”explain in plain english” questions using P. Brusilovsky, V. Rus, SelfCode 2.0: Annotated nlp, in: Proceedings of the 52nd ACM Technical Sym- Corpus of Student Self- Explanations to Introduc- posium on Computer Science Education, SIGCSE ’21, tory JAVA Programs in Computer Science, 2024. Association for Computing Machinery, New York, NY, URL: https://doi.org/10.5281/zenodo.10912669. USA, 2021, p. 1163–1169. URL: https://doi.org/10.1145/ doi:10.5281/zenodo.10912669 . 3408877.3432539. doi:10.1145/3408877.3432539 . [8] X. Fan, W. Luo, M. Menekse, D. Litman, J. Wang, [17] P. Denny, D. H. Smith IV, M. Fowler, J. Prather, B. A. Coursemirror: Enhancing large classroom instructor- Becker, J. Leinonen, Explaining code with a pur- student interactions via mobile interfaces and natural pose: An integrated approach for developing code language processing, in: Proceedings of the 33rd An- comprehension and prompting skills, arXiv preprint nual ACM Conference Extended Abstracts on Human arXiv:2403.06050 (2024). Factors in Computing Systems, 2015, pp. 1473–1478. [18] S. Haller, A. Aldea, C. Seifert, N. Strisciuglio, Survey on [9] S. A. Crossley, P. Baffour, Y. Tian, A. Picou, M. Benner, automated short answer grading with deep learning: U. Boser, The persuasive essays for rating, selecting, from word embeddings to transformers, arXiv preprint and understanding argumentative and discourse el- arXiv:2204.03503 (2022). ements (persuade) corpus 1.0, Assessing Writing 54 [19] J. Lin, Z. Han, D. R. Thomas, A. Gurung, S. Gupta, (2022) 100667. V. Aleven, K. R. Koedinger, How can i get it right? us- [10] A. C. Graesser, D. S. McNamara, M. M. Louwerse, ing gpt to rephrase incorrect trainee responses, arXiv Z. Cai, Coh-metrix: Analysis of text on cohesion and preprint arXiv:2405.00970 (2024). language, Behavior research methods, instruments, & [20] A. B. L. Narayanan, P. Oli, J. Chapagain, M. Hassany, computers 36 (2004) 193–202. R. Banjade, P. Brusilovsky, V. Rus, Explaining code [11] R. E. Wang, D. Demszky, Edu-convokit: An open- examples in introductory programming courses: LLM source library for education conversation data, arXiv vs humans, in: AI for Education: Bridging Innovation preprint arXiv:2402.05111 (2024). and Responsibility at the 38th AAAI Annual Confer- [12] R. Schwartz, M. Sap, I. Konstas, L. Zilles, Y. Choi, N. A. ence on AI, 2024. URL: https://openreview.net/forum? Smith, The effect of different writing tasks on linguis- id=zImjfZG3mw. tic style: A case study of the roc story cloze task, arXiv [21] S. Banerjee, A. Lavie, Meteor: An automatic metric for preprint arXiv:1702.01841 (2017). mt evaluation with improved correlation with human [13] M. Glockner, V. Shwartz, Y. Goldberg, Breaking nli judgments, in: Proceedings of the acl workshop on in- systems with sentences that require simple lexical in- trinsic and extrinsic evaluation measures for machine ferences, arXiv preprint arXiv:1805.02266 (2018). translation and/or summarization, 2005, pp. 65–72. [14] Y. Belinkov, A. Poliak, S. M. Shieber, B. Van Durme, [22] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with bert, arXiv preprint arXiv:1904.09675 (2019). [23] M. Popović, chrf: character n-gram f-score for au- tomatic mt evaluation, in: Proceedings of the tenth workshop on statistical machine translation, 2015, pp. 392–395. [24] R. Hosseini, K. Akhuseyinoglu, P. Brusilovsky, L. Malmi, K. Pollari-Malmi, C. Schunn, T. Sirkiä, Improving engagement in program construc- tion examples for learning python program- ming, International Journal of Artificial Intel- ligence in Education 30 (2020) 299–336. URL: https://doi.org/10.1007/s40593-020-00197-0. doi:10.1007/s40593- 020- 00197- 0 . [25] R. Hosseini, P. Brusilovsky, Javaparser: A fine-grain concept indexing tool for java problems, in: CEUR Workshop Proceedings, volume 1009, University of Pittsburgh, 2013, pp. 60–63.