=Paper= {{Paper |id=Vol-3796/short1 |storemode=property |title=Evaluating Correctness of Student Code Explanations: Challenges and Solutions |pdfUrl=https://ceur-ws.org/Vol-3796/CSEDM-24_paper_1079.pdf |volume=Vol-3796 |authors=Arun-Balajiee Lekshmi-Narayanan,Peter Brusilvosky |dblpUrl=https://dblp.org/rec/conf/edm/NarayananB24 }} ==Evaluating Correctness of Student Code Explanations: Challenges and Solutions== https://ceur-ws.org/Vol-3796/CSEDM-24_paper_1079.pdf

Evaluating Correctness of Student Code Explanations:
Challenges and Solutions⋆
Arun-Balajiee Lekshmi-Narayanan1,∗ , Peter Brusilvosky
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA

Abstract
Educational data used by data mining approaches in the domain of Computer Science education primarily focused on working with
code-based data, such as student homework submissions. However, the increased use of natural language techniques and Large
Language models (LLM) in all domains of learning including Computer Science education is now producing an abundance of natural
language data, such as code explanations generated by students and LLMs as well as feedback and hints produced by instructors, TAs,
and LLMs. These data represent new challenges for CSEDM research and need new creative approaches to leverage. In this paper, we
present a first attempt to analyze one type of these new data, student explanations of worked code examples. The main challenge in
working with these data is to evaluate the correctness of self-explanations. Using a dataset of student explanations collected in our
previous work, we demonstrate the difficulty of this problem and discuss a possible way to solve it.

Keywords
code explanations, worked examples, automated assessment

1. Introduction approaches.
More specifically, the remaining part of the paper focuses
The majority of the work in computer science educational on two groups of approaches:
data mining (CSEDM) relied so far on datasets that col-
lected traces of learner work with various learning content 1. Use surface-level features: this group of approaches
or datasets with student submissions to programming as- use “surface–level” lexical and readability features
signments 1 Using these datasets, researchers were able that could be easily extracted from the text of student
to explore a range of novel approaches including finding or expert explanations. This is discussed further in
knowledge components [1], debugging [2, 3], and detecting Section 3.1.
cheating [4]. However, as newer types of dataset become 2. Use expert explanations: this group of approaches
openly available for analysis, new methods need to be de- attempts to calculate various similarity metrics be-
veloped to leverage this data 2 . tween student explanations and expert explanation
With recent research on student self-explanation of code and use the obtained similarity to distinguish correct
fragments [5] as well as the use of LLMs and students to and incorect explanations. This is described further
generate code explanations automatically [6], an increasing in Section 3.2.
number of datasets contain free–form code explanations. In
this work, we consider one such dataset with code explana-
tions generated by students and instructors [7]. This dataset 2. Related Work
was annotated to mark the correctness of each student’s
Corpora for free-form student answers such as reflective es-
explanation and to assess the similarity between students
says [8] and argumentative writing [9], provide interesting
and instructors’ explanations for the same code lines. The
examples of use cases that are different from traditional log
goal we want to achieve by working with this dataset is to
data. The ability to analyze this data is important to pro-
distinguish correct and incorect explanations. This goal has
vide feedback when assessing students’ free–form responses.
practical value. An approach that could reliably identify
Tools, such as COH–METRIX [10] and EDU-Convokit [11],
incorrect explanations could be used to build an intelligent
offer several options to analyze textual educational data;
tutor to support the self-explanation process [5].
however, our dataset of free–form code explanations needs
Starting with a review of relevant work, the paper dis-
slightly different methods to evaluate correctness.
cusses several approaches to distinguish correct and incorect
Some examples in the natural language processing do-
explanations. Since our dataset contains “ground truth”, i.e.,
main offer encouraging examples of using surface-level fea-
human expert annotation of each explanations as as cor-
tures to models for various tasks. Schwartz and colleagues
rect or incorrect (including inter–rater reliability) we are
explore surface features such as the word or character n–
able to use the dataset to illustrate the feasibility of these
gram and length of sentences to build a classifier to identify
author writing styles in a CLOZE story task [12]. Some
CSEDM’24: 8th Educational Data Mining in Computer Science Education examples in the case of language inference tasks consider
(CSEDM) Workshop, June 14, 2024, Atlanta, GA
⋆
You can use this document as the template for preparing your publica-
word-level similarity–based approaches [13]. When con-
tion. We recommend using the latest version of the ceurart style. structing adversarial examples for natural language infer-
∗
Corresponding author. ence tasks, another work considers surface–level cues such
Envelope-Open arl122@pitt.edu (A. Lekshmi-Narayanan); peterb@pitt.edu as words that contradict or are negative (”not”). Negative
(P. Brusilvosky) sampling is another approach that uses surface–level fea-
GLOBE https://a2un.github.io (A. Lekshmi-Narayanan);
https://sites.pitt.edu/~peterb/ (P. Brusilvosky)
tures to construct synthetic examples that can help build
Orcid 0000-0002-7735-5008 (A. Lekshmi-Narayanan); 0000-0002-1902-1464 robust classifiers [14].
(P. Brusilvosky) Li and colleagues [15] discuss the student perceptions
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). on the potential errors autograder may provide as feedback
1
https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=3458 to their submissions. This emphasizes the need to develop
2
https://the-learning-agency.com/learning-engineering-hub/build/

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
better automated assessment techniques, for newer kinds for a given line of code.
of data such as code explanations discussed in our work.
The earlier work of the same team [16] discuss the use of an 3.1. Surface Features
autograding system that evaluates student explanations to
code in plain English. They show the potential limitations We try to assess the correctness of student explanations
of using finetuned AI models for autograding accuracy by using the following easily extracted features.
comparing with TAs at different levels of grading expertise
and do not find statistically significant results – either ow- 1. Explanation Length is calculated as the number of
ing to sample size or the AI model actually not performing words to check that longer student explanations are
better than TAs at the task. This necessatitates the possibil- correct. This is a useful metric for tasks such as
ities to explore better finetuned AI models that have higher persuasive essay evaluation [9] and we expect this
accuracy, with lower rates of false positives and negatives. could work for assessing code explanations also.
In this work, we consider a context quite similar to their 2. Lexical Density is calculated by the ratio of the num-
work, however, we use student self-explanations produced ber of nouns, adjectives, verbs, and adverbs (tagged
as a part of the learning process rather then explanations in the sentence using a Spacy POS Tagger 3 ) over the
produced for grading. We also start from scratch by identi- overall number of words in the sentence (Ure LD for-
fying features in student explanations to classify them as mula 4 ). We expect that correct student explanations
correct or incorrect. Like this work, we are observing that are lexically denser.
the use of surface–level linguistic features may not help in 3. Gunning Fog Readability is the metric to evaluate
differentiating student explanations by correctness. Possible the grade level to understand a text. We hypothesise
extensions would involve the use of contextual emebeddings that the correct student explanations might have
and LLMs which we also currently explore as a follow-up higher scores (require more technical knowledge to
to this current work–in–progress. understand) than incorrect explanations.
Denny and colleagues [17] explore student code explana-
tions in plain English in a different context. In this work, 3.2. Expert–Student Similarity Features
code explanations are used to encourage students to think
deeply about the problem so that using their code expla- We consider the pairwise similarity between expert and
nations LLM can generate a code equivalent to the code student explanations to assess correctness of student expla-
the student is trying to explain. Additionally, they evaluate nations. Following our previous work [20], METEOR [21],
the student explanations progress up the classifications of BERTScore [22] and chrF [23] are considered to evaluate
SOLO taxonomy. They also conduct a user study to evalu- the pairwise similarity between student and expert expla-
ate students’ perceptions on traditional approaches such as nations for a given line of code. We expect correct student
code–writing in comparison to approahces like code expla- explanations to be more similar to expert explanations than
nations. incorrect explanations. We choose this combination of met-
Haller and colleagues [18] survey automated assessments rics because METEOR and chrF scores measure character
tools that are used to evaluate short answer essays. They and token level similarities, while BERTScore estimates se-
discuss hand–engineered appraoches in combination SVM mantic similarity by using cosine similarities between the
or KNN based classifiers. In our case, the goal is not to build contextual word embeddings of the two explanations. The
best classifiers for the task, but to evaluate if the features similarity scores are between 0 and 1.
themselves reveal differences between correct and incorrect
student explanations. 4. Dataset
Lin and colleagues [19] explore the use of a finet–tuned
GPT model that can be used to provide personalized, adap- We use a dataset of line–by–line explanations provided by
tive feedback to students. They use new metric that is an students in an study in which they were asked to explain
extension to the precision / recall –based Intersection-Over- worked examples [24, 7]. The study included four Java
Union metric to evluate the LLM–based feedback and com- worked code examples: some basic examples focused on ar-
pare with human feedback in a user study. For current ray search and print statements and more difficult examples
work–in–progress this idea is the next target to achieve in focused on object-oriented principles. Among about all ex-
the context to code explanations of worked examples in pert explanations in the dataset, we considered upto 2 expert
programming. explanations for every line of code. In the original dataset,
Leinonen and colleagues [6] compare ChatGPT-generated the majority of the student explanations were provided in
explanations with student explanations. In our work, we a single sentence; however, a fraction of explanations in-
are interested in classifying student explanations as correct cluded two or more sentences. For the purpose of this study,
or incorrect. In an ongoing extension to this work, we also we excluded these multi-sentence explanations retaining
focus on using ChatGPT–based interventions to solve this between 23 and 26 single-sentence student explanations per
challenging problem. line of code. The key datset parameters are shown in Table 1
and sample explanations are provided in Figure 1 (metadata
columns are omitted). There is a known imbalance in the
3. Method dataset between correct and incorrect examples (1234 in-
Inspired by previous work [20], we extract surface-level stances of single sentence student explanations annotated
features from student and expert explanations alone and as correct and 70 instances of single sentence student expla-
generate pairwise similarity scores between student and nations annotated as incorrect). We calculated the average
expert explanations for the same code lines. This data is ap-
3
plied to evaluate the correctness of the student’s explanation https://spacy.io/usage/linguistic-features
4
https://en.wikipedia.org/wiki/Lexical_density
Dataset Property Value
# Single Sentence Student–Expert (All Experts) Pairs 1854
# All Sentence Student–Expert (All Experts) Pairs 3019
# All Sentence Student–Expert (All Experts) Pairs Annotation Agreement 88.24%
# Worked code examples 4
# Lines per example ≈8
# Single Sentence Student–Expert (Expert 1 & 2) Pairs 1304
# Student–Expert (1 & 2) Explanation Pairs with Student Correct 1234
# Student–Expert (1 & 2) Explanation Pairs with Student Incorrect 70
Table 1
A summary of the properties our dataset.

percentage agreement for the annotation of correctness all 5.1.3. Lexical Density
sentence all experts student pairs of explanations (see Ta-
We observe that the lexical density also may not differen-
ble 1). More details on the dataset is available in our past
tiate good and bad student explanations. (see Figure 2c).
work [7].
The lexical density measures a more linguistic aspect of the
explanations by the parts of speech, which may not neces-
Program: PointTester.java Line number: 14 Line code:
x += dx;
sarily evaluate the conceptual aspects of the explanations.
Expert1: To shift the x-coordinate of the point, we need This is because the concepts may not be associated with a
to add dx to the value of the x-coordinate of the point. particular kind of speech and are more connected with the
Student1: move the x coord the amount that the argument ontology of concepts in computing.
specified
Student2: Adds the first inputted value to X. 5.1.4. Vocabulary
Student3: increases the value of x by the amount of the
first parameter in the function. Correctness also does not seem to depend on the vocabulary
... of the student explanations (see Figure 2d). The vocabulary
Student23: The value of dx is added to variable x. in the sentence is more of a linguistic measure. Hence, this
may not necessarily capture the conceptual ontology in
Figure 1: A slice of the dataset showing a subset of expert and computing such as those discussed in an earlier work of
student explanations for the same line of code. JAVA Parser [25].

5.2. Expert–Student Explanation Similarity
5. Results 5.2.1. ChrF score
We observe that the differences considering the class imbal-
In this section, we group the results by the surface–level ance between the correct and incorrect explanations could
metrics used to evaluate the correctness of student explana- create an issue with the threshold to differentiate the expla-
tions and similarity–based metrics where pairs of student nations using this score (see Figure 3a). Further inspection
and expert explanations were used. While extensive statis- of similarity scores at a line–by–line level shows that irre-
tical results could be performed (such as t-tests to compare spective of the expert explanation that is used to calculate
means), we preferred to perform exploratory analysis before the similarity, the correct and incorrect student explana-
digging deeper with our analysis. tions cannot be separated easily by their ChrF score (see
Figure 4a).
5.1. Lexical Based Surface Metrics
5.1.1. Explanation Length 5.2.2. METEOR Metric

Using the length of the explanation, we observe if the expert There is a more noticeable difference in the METEOR sim-
and student explanations can be distinguished, we observe ilarity scores between the correct and incorrect student
that the explanations marked correct have lengths of dif- explanations. This could be due to n-gram level word align-
ferent words. Some lengths for correct explanations are ment. The density plots of the METEOR similarity scores
the same as those for incorrect explanations (see Figure 2a). distribution show that most incorrect explanations have a
This may be because the student explanations are generally METEOR score below 0.3, as shown in Figure 3b. However,
similar in length, regardless of whether they are annotated more than 50% of the correct explanations also have a ME-
as correct or incorrect. TEOR similarity score below 0.3. Thus, irrespective of the
expert explanation that is used to calculate the similarity,
the correct and incorrect student explanations cannot be
5.1.2. Readability Metrics
easily separated using the METEOR score (see Figure 4b).
We observe that it is impossible to differentiate correct from
incorrect student explanations using lexical surface met- 5.2.3. BERTScore
rics(see Figure 2b). This may be because the student ex-
planations are not technically different in their explanation While we expected better performance of BERTScore in
levels but the concepts in computing that are used to explain separation of correct and incorrect explanations, the den-
the line of code, which we observed when annotating the sity plots of the BERTScore distribution for correct and
dataset. incorrect explanations show very little differences. We may
—
(a) Student and Expert Explanation Length (b) Student and Expert Explanation Gunning-Fog

(c) Student and Expert Explanation Lexical Diversity (d) Student and Expert Explanation Vocabulary
Figure 2: Scatter plots of various text linguistic (“surface”) metrics. The x and y represent the student and expert sentence
values, respectively. The colors, shapes, and sizes represent the annotation of the students’ explanations for their correctness
or incorrectness. Correct and incorrect student explanations are not differentiable by the lexical surface metrics.

Similarity Metric Incorrect Correct the student explanation per line of code per solution with
(Mean,SD) (Mean,SD)
the two expert explanations. This, we can calculate this
chrF 0.305, 0.114 0.361, 0.140
for all the lines of all the programs and observe that the
METEOR 0.140, 0.091 0.283, 0.170
BERTScore 0.874, 0.028 0.894, 0.024 while the range of values of the three similarity scoring
metrics are different, they are highly correlated (𝑝 < 1𝑒 − 6,
Table 2 0.5 ≤ 𝑐𝑜𝑟𝑟 ≤ 0.6), as shown in Figures 5a and 5b.
The mean similarity scores between expert and student explana-
tions are not different for the student explanations annotated as
correct from incorrect. The most difference is observable with 6. Conclusion
the METEOR metric, also observed with the plots
In this work, we present the challenges of analyzing new
kinds of datasets such as the code explanations dataset in
observe differences if we pre-trained a RoBERTa model over this paper. We observe that we need more sophisticated met-
instances from the dataset. (See Figure 3c). As before, the rics to evaluate student explanations as “good” or “bad” and
inspection of the similarity scores at a line–by-line level surface-level metrics are mostly ineffective in evaluating stu-
shows that regardless of the expert explanation that is used dent explanation correctness. We present similarity-based
to calculate similarity, the correct and incorrect student metrics also not performing well in separating the “good”
explanations cannot be separated easily (see Figure 4c). from the “bad” student explanations.
Our work has several limitations. We did not consider
5.2.4. Similarity Correlations the use of combinations of lexical and similarity-based fea-
tures to classify student explanation correctness. The goal
When drawing similarity scores between the student and of this paper is not to present the best possible classifier,
expert explanations, we can take the average similarity of rather to show the difficulty in identifying useful features
—
(a) Character F Similarity (b) Meteor Similarity (c) BERTScore Similarity
Figure 3: In the Figures 3a, 3b, 3c, we observe that student and expert explanations are similar irrespective of whether the
student explanations are annotated as correct or incorrect.

—
(a) Character F Similarity (b) Meteor Similarity (c) BERTScore Similarity
Figure 4: The scatter plots for the 3 similarity metrics used between expert and student explanations for a given line of code
in a given program separated by correct vs incorrect presents some differences in scale across the different similarity metrics
(refer Figures 4a, 4b, 4c)
.

to differentiate correct from incorrect student explanations. of the dataset. We thank Dr. Xiang Lorraine Li for her con-
We are addressing this in our ongoing work, with the use of tributions to writing this draft and for the natural language
LLMs to assess the correctness and provide feedback to stu- processing topics of the paper.
dent explanations. Our expert explanations may not have
sufficient and diverse correct (positive) and incorrect (nega-
tive) examples to build robust classifiers. We are developing References
cross–validation techniques to build better classifiers that
[1] Y. Shi, R. Schmucker, M. Chi, T. Barnes, T. Price, Kc-
are exposed to various synthetic and real–world examples
finder: Automated knowledge component discovery
to evaluate student explanation correctness. In this work,
for programming problems., International Educational
we did not present similar results by considering multiple
Data Mining Society (2023).
sentences for both student and expert explanations. While
[2] A. M. Kazerouni, R. S. Mansur, S. H. Edwards, C. A.
this is important, we chose to present in this work a proto-
Shaffer, Student debugging practices and their rela-
type for single sentence evaluation which is scalable with
tionships with project outcomes, in: Proceedings of
aggregation techniques, which we will be presenting in an
the 50th ACM Technical Symposium on Computer
upcoming future work. This dataset does not cover the
Science Education, 2019, pp. 1263–1263.
cases where students improve over time with providing cor-
[3] P. Denny, J. Prather, B. A. Becker, Error message read-
rect explanations to lines of code as they progress through
ability and novice debugging performance, in: Pro-
harder programming solutions. We will explore this in a
ceedings of the 2020 ACM conference on innovation
longitudinal study as we build a system which will have the
and technology in computer science education, 2020,
option to present harder examples for students to explain as
pp. 480–486.
they are evaluted correct with a better classifier that utilizes
[4] M. Hoq, Y. Shi, J. Leinonen, D. Babalola, C. Lynch,
several of the current analyses presented in this work as
T. Price, B. Akram, Detecting chatgpt-generated code
evidence.
submissions in a cs1 course using machine learning
models, in: Proceedings of the 55th ACM Technical
Acknowledgments Symposium on Computer Science Education V. 1, 2024,
pp. 526–532.
We thank Jeevan Chapagain for his efforts in the annotation [5] P. Oli, R. Banjade, A. B. Lekshmi Narayanan, J. Cha-
—
(a) BERTScore ChrF Correlation (b) METEOR ChrF Correlation
Figure 5: When we draw correlations, we observe that the three similiarity metrics are correlated to one another (refer 5a
and 5b).

pagain, L. J. Tamang, P. Brusilovsky, V. Rus, Im- A. M. Rush, Don’t take the premise for granted: Miti-
proving code comprehension through scaffolded self- gating artifacts in natural language inference, arXiv
explanations, in: Proceedings of 24th International preprint arXiv:1907.04380 (2019).
Conference on Artificial Intelligence in Education, Part [15] T. W. Li, S. Hsu, M. Fowler, Z. Zhang, C. Zilles, K. Kara-
2, Springer, 2023, pp. 478–483. URL: ttps://doi.org/10. halios, Am i wrong, or is the autograder wrong? ef-
1007/978-3-031-36272-9_75. fects of ai grading mistakes on learning, in: Proceed-
[6] J. Leinonen, P. Denny, S. MacNeil, S. Sarsa, S. Bernstein, ings of the 2023 ACM Conference on International
J. Kim, A. Tran, A. Hellas, Comparing Code Explana- Computing Education Research-Volume 1, 2023, pp.
tions Created by Students and Large Language Models, 159–176.
2023. arXiv:2304.03938 , arXiv:2304.03938. [16] M. Fowler, B. Chen, S. Azad, M. West, C. Zilles, Au-
[7] A.-B. Lekshmi-Narayanan, J. Chapagain, tograding ”explain in plain english” questions using
P. Brusilovsky, V. Rus, SelfCode 2.0: Annotated nlp, in: Proceedings of the 52nd ACM Technical Sym-
Corpus of Student Self- Explanations to Introduc- posium on Computer Science Education, SIGCSE ’21,
tory JAVA Programs in Computer Science, 2024. Association for Computing Machinery, New York, NY,
URL: https://doi.org/10.5281/zenodo.10912669. USA, 2021, p. 1163–1169. URL: https://doi.org/10.1145/
doi:10.5281/zenodo.10912669 . 3408877.3432539. doi:10.1145/3408877.3432539 .
[8] X. Fan, W. Luo, M. Menekse, D. Litman, J. Wang, [17] P. Denny, D. H. Smith IV, M. Fowler, J. Prather, B. A.
Coursemirror: Enhancing large classroom instructor- Becker, J. Leinonen, Explaining code with a pur-
student interactions via mobile interfaces and natural pose: An integrated approach for developing code
language processing, in: Proceedings of the 33rd An- comprehension and prompting skills, arXiv preprint
nual ACM Conference Extended Abstracts on Human arXiv:2403.06050 (2024).
Factors in Computing Systems, 2015, pp. 1473–1478. [18] S. Haller, A. Aldea, C. Seifert, N. Strisciuglio, Survey on
[9] S. A. Crossley, P. Baffour, Y. Tian, A. Picou, M. Benner, automated short answer grading with deep learning:
U. Boser, The persuasive essays for rating, selecting, from word embeddings to transformers, arXiv preprint
and understanding argumentative and discourse el- arXiv:2204.03503 (2022).
ements (persuade) corpus 1.0, Assessing Writing 54 [19] J. Lin, Z. Han, D. R. Thomas, A. Gurung, S. Gupta,
(2022) 100667. V. Aleven, K. R. Koedinger, How can i get it right? us-
[10] A. C. Graesser, D. S. McNamara, M. M. Louwerse, ing gpt to rephrase incorrect trainee responses, arXiv
Z. Cai, Coh-metrix: Analysis of text on cohesion and preprint arXiv:2405.00970 (2024).
language, Behavior research methods, instruments, & [20] A. B. L. Narayanan, P. Oli, J. Chapagain, M. Hassany,
computers 36 (2004) 193–202. R. Banjade, P. Brusilovsky, V. Rus, Explaining code
[11] R. E. Wang, D. Demszky, Edu-convokit: An open- examples in introductory programming courses: LLM
source library for education conversation data, arXiv vs humans, in: AI for Education: Bridging Innovation
preprint arXiv:2402.05111 (2024). and Responsibility at the 38th AAAI Annual Confer-
[12] R. Schwartz, M. Sap, I. Konstas, L. Zilles, Y. Choi, N. A. ence on AI, 2024. URL: https://openreview.net/forum?
Smith, The effect of different writing tasks on linguis- id=zImjfZG3mw.
tic style: A case study of the roc story cloze task, arXiv [21] S. Banerjee, A. Lavie, Meteor: An automatic metric for
preprint arXiv:1702.01841 (2017). mt evaluation with improved correlation with human
[13] M. Glockner, V. Shwartz, Y. Goldberg, Breaking nli judgments, in: Proceedings of the acl workshop on in-
systems with sentences that require simple lexical in- trinsic and extrinsic evaluation measures for machine
ferences, arXiv preprint arXiv:1805.02266 (2018). translation and/or summarization, 2005, pp. 65–72.
[14] Y. Belinkov, A. Poliak, S. M. Shieber, B. Van Durme, [22] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi,
Bertscore: Evaluating text generation with bert, arXiv
preprint arXiv:1904.09675 (2019).
[23] M. Popović, chrf: character n-gram f-score for au-
tomatic mt evaluation, in: Proceedings of the tenth
workshop on statistical machine translation, 2015, pp.
392–395.
[24] R. Hosseini, K. Akhuseyinoglu, P. Brusilovsky,
L. Malmi, K. Pollari-Malmi, C. Schunn, T. Sirkiä,
Improving engagement in program construc-
tion examples for learning python program-
ming, International Journal of Artificial Intel-
ligence in Education 30 (2020) 299–336. URL:
https://doi.org/10.1007/s40593-020-00197-0.
doi:10.1007/s40593- 020- 00197- 0 .
[25] R. Hosseini, P. Brusilovsky, Javaparser: A fine-grain
concept indexing tool for java problems, in: CEUR
Workshop Proceedings, volume 1009, University of
Pittsburgh, 2013, pp. 60–63.