<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reasoning on OIE Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nisrine Ait Khayi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasile Rus</string-name>
          <email>vrus@memphis.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Memphis</institution>
          ,
          <addr-line>375 Dunn Hall, Memphis, TN 38152</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Assessing self-explanations of code is a critical task in educational Natural Language Processing (NLP), essential for providing automated feedback in programming education. While recent studies demonstrate that LLMs exhibit analogical reasoning capabilities, their application to educational code explanation assessment remains underexplored. To address this gap, we propose a novel three-stage approach: (1) prompting the LLM to extract Open Information Extraction (OIE) units from student and expert explanations, (2) prompting the LLM to construct semantic graphs from these units, and (3) employing LLM-based analogical reasoning to assess explanation similarity. We evaluate our approach on the Self-code corpus using Pearson and Spearman correlations. Our method achieves correlations of 0.8 and 0.76, respectively, significantly outperforming supervised models like BERT (0.74/0.75) and unsupervised approaches like text-embedding-ada-002 (0.67/0.64). These results demonstrate the efectiveness of structured semantic representation combined with analogical reasoning for educational assessment tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>analogical reasoning</kwd>
        <kwd>OIE graphs</kwd>
        <kwd>automated assessment</kwd>
        <kwd>large language models</kwd>
        <kwd>code comprehension</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Assessing students’ self-explanations of learning content is a critical task in educational Natural
Language Processing (NLP) and education technologies that prompt students for natural language
input. Indeed, accurate assessment of learner’s self-explanations is critical in highly-adaptive education
technologies such as Intelligent Tutoring Systems (ITSs) that prompt learners for such explanations.
Accurate assessment enables these ITSs to provide tailored feedback and hints to each individual student,
which is needed to maximize student learning outcomes[1, 2, 3].</p>
      <p>One approach to automatically assessing student self-explanations consists of evaluating the semantic
equivalence between, for instance, two explanations of a code statement: a student explanation versus
a benchmark or gold explanation, e.g., generated by an expert. That is, a student self-explanation is
assessed against a benchmark explanation whose correctness value is known. Specifically, a semantic
similarity score is computed, for instance, on a scale from 0 (no semantic similarity) to 5 (complete
semantic similarity) [4]. Other scoring scales have been used, such as a normalized continuous score
between 0 and 1 (1=semantic equivalence). It should be noted that this semantic similarity approach is
the most widely used approach for assessing free-text self-explanations.</p>
      <p>Many approaches have been explored to address the semantic similarity task over the past several
decades[5, 6, 7].Recent advancements in foundation models have leveraged fine-tuned pre-trained
language models to capture the similarity on a sentence level, yielding state-of-the-art performance.
For instance, Lee and colleagues [8] introduced a contrastive learning framework that optimizes the
weighted sum of contextualized token distances, enhancing both similarity scores and interpretability.
Rep and colleagues [9] improved performance by applying a novel Truncated Model Fine-Tuning (TMFT)
method for ELECTRA [10]. The TMFT method uses mean pooling over the last layer’s embedding.TMFT
improves the Spearman correlations by over 8 points while increasing parameter eficiency on the</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>STS benchmark. Despite these successes, Gatto et al.[11] showed that generative LLMs outperform
existing encoder-based models, revealing limitations of these encoder-based approaches in their learning
capabilities. These findings suggest that while LLMs excel at pattern recognition, they may benefit from
incorporating more structured reasoning mechanisms, such as analogical reasoning, which humans
naturally use to understand and compare complex conceptual relationships.</p>
      <p>
        Analogical reasoning, a cornerstone of human cognitive intelligence, enables knowledge adaptation
[12], decision-making [13], and problem-solving [14]. According to the Structure Mapping Theory
(SMT) [15], humans conduct analogical reasoning using the shared relational structure between two
domains. Mental Models Theory [16] focuses on how people create mental models of the world to reason
about structures and functions. This theory supports functional analogical reasoning by suggesting
that humans can simulate mental models of complex dynamic systems based on how they function in
real life. For example, the function of the ’heart pumping blood’ is analogous to the function of a ’pump
pumping water’ despite diferences in their structures. The multi-constraint theory [ 17] introduces the
concept of constraints satisfaction, where the analogical reasoning is guided by: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) direct similarity
between the elements involved, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the structural parallels in the roles between the source and target
domains, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) the purpose constraint implying that analogical thinking is guided by the goal of the
reasoner.
      </p>
      <p>
        Recent studies have shown that LLMs exhibit analogical reasoning capabilities, particularly through
in-context learning [18]. For example, Yasunaga et al. [14] demonstrated that analogical prompting
instructing LLMs to self-generate relevant exemplars before solving the problem using in-context
learning, improves their reasoning process. Qin et al. [19] demonstrated that self-generated random
examples can achieve comparable or even better performance. However, investigating the LLMs’
potential to perform analogical reasoning in semantic similarity contexts remains underexplored for
the following reasons: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the existing LLMs work on semantic similarity focuses on surface-level
similarity, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the need for more robust evaluation to validate the LLMs’ analogical reasoning
capabilities. To address this gap, we propose a novel approach based on the Structure Mapping and
Mental Model theories and which combines analogical reasoning with LLMs-based prompting to enhance
the assessment of code self-explanations. More specifically, our approach uses LLMs to infer structured
representations of code explanations in the form of Open Information Extraction (OIE) graphs, which
use subject-predicate-object triplets as their basic unit structure. The proposed approach difers from
existing LLM-based prompting approaches by extracting OIE units from self-explanations. Constructing
OIE graphs for structured representations enables more interpretable similarity assessment and also
allows us to make use of the analogical reasoning of the LLMs on these structured representations.
      </p>
      <p>Our main contributions are as follows:
• Explore the potential of LLMs to extract OIE triplets to represent the semantics of self-explanations
as structured components.
• Explore the potential of LLMs to construct OIE graphs from extracted triplets that capture
relational and semantic roles reflecting relationships within self-explanations.
• Explore the potential of LLMs to conduct functional and structural analogical reasoning for
assessment purposes of explanations.</p>
      <p>The remainder of the paper is organized as follows: Section 2 reviews related works, Section 3
describes the proposed method, Section 4 presents the experiments and results, and Section 5 concludes
this work with future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Large Language Models (LLMs) have recently gained prominence in education, particularly in the
automated assessment task. For instance, Xie et al.[20]proposed an LLM-based grading system based on
three stages: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) rubric generation, where students’ answers are incorporated into the rubric design; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
grading, where the generated rubric guides the LLM to grade students’ answers using various prompting
strategies (one shot-prompts, self-reflection prompts where the LLM is asked to reflect on the previous
output, batching prompts where students answers are processed in batches and graded together by the
LLM ); and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) post-grading review, where the LLM defines the unreasonable results and sends them back
to re-grading. Their empirical experiments demonstrated the efectiveness of their approach, providing
new insights for the assessment task using LLMs. Similarly, Cohn et al. [21]employed
Chain-ofThought (CoT) prompting and an active learning approach for grading and explaining the assessment of
responses in science. The model’s reasoning was used to elicit explanations for its responses, enabling
teachers to provide feedback to students. Xiao et al.[22] showcased the efectiveness of LLMs in the
Automated Essay Scoring (AES) task, emphasizing the importance of providing comprehensive contexts,
clear rubrics, and high-quality examples. Their research highlighted the capability of LLMs to provide
explanations and feedback to students besides the essay’s score. Additionally, Katuka et al. [23]
explored the potential of quantized LLMs in improving the performance of the automatic grading and
feedback generation downstream tasks. They fine-tuned 4-bit quantized Llama-2 for automatic grading
and feedback generation, using supervised instruction fine-tuning. Their empirical results showed a
reduction in the error rates in grading by 3% and their approach outperforming traditional LLMs like
GPT-2 in generating feedback with improvements of 0.19,0.2, and 0.12 in BLEU, ROUGE-1, and ROUGE-2
scores respectively, for the SAR dataset. For the Proprietary dataset, they achieved improvements of
0.64, 0.46, and 0.65 in BLEU, ROUGE-1, and ROUGE-2 scores respectively. Oli et al.[24] used LLMs
to compare students’ self-explanations and experts’ self-explanations in the context of line by-line
explanation of computer programs. Their results demonstrated the superior performance of the LLMs,
when prompted in few-shot and Chain of Thoughts settings, over the fine-tuned encoder-based models.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>Our approach relies on LLMs at diferent phases of its process. First, we prompt the LLM to extract Open
Information Extraction (OIE) triplets from the student and expert explanations, capturing their semantics.
Second, we prompt the LLM to construct OIE graphs, reflecting relationships in the explanations. Third,
we prompt the LLM to perform structural and functional analogical reasoning on OIE graphs to assess
code self-explanations.</p>
      <sec id="sec-3-1">
        <title>3.1. The Open Information Extraction Phase</title>
        <p>Open Information Extraction (OIE) systems map a sentence S=  1, …,   with a relational triplet (  1,   ,
  2), where   1 and   2 are noun phrases created from tokens   in S, and rel represents the semantic
relationship between   1 and   2 [25]. In this phase, we prompt the LLM to extract OIE triplets
from both student and expert self-explanations. The OIE prompt was refined through an iterative
refinement process to optimize its efectiveness. We started with a simple prompt, which didn’t yield
very good results. Providing some examples, i.e., few-shot prompting where the model is provided with
few examples to facilitate the understanding of the task [26], helped. Additionally, we incorporated
a specialized role within the prompting framework [27], and we systematically varied instruction
phrasing and example selections to optimize triplet extraction quality.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. The OIE Graph Construction Phase</title>
        <p>In this phase, the LLM is prompted to construct OIE graphs for both student and expert explanations,
representing the key relationships and interactions within the explanations.</p>
        <p>Using OIE triplets in the form of (  1,   ,   2), we instruct the LLM to identify the most significant
OIE triplets by prompting it to extract those representing core conceptual relationships and main
operations in the explanations. These triplets are then structured into nodes, representing arguments
(  1 and   2), and edges, representing their semantic relationships (  ). Additionally, we incorporated
a few examples in a JSON format to facilitate the LLM’s understanding of the OIE graphs’ construction.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. The Analogical Reasoning Phase</title>
        <p>Functional analogical reasoning looks for analogous functions and roles that nodes or edges play in
the OIE student and expert graphs. It assesses whether the explanations describe the same purpose and
behavior of the code, even if their phrasing or level of detail difers. For example:
• Code statement: count += 1.
• User explanation: The variable count is incremented by one each time the loop runs.
• Expert explanation: Each loop iteration increases count by one.
• OIE graph generation: Using Open Information Extraction, key entities (nodes) and
relationships (edges) are automatically extracted from each explanation.
• OIE graph1:
• OIE graph2:
– Node1: ”count”, Node2: ”one”, Edge: ”incremented by”.
– Node1: ”loop”, Node2: ”each time”, Edge: ”runs”.
– Node1: ”count”, Node2: ”one”, Edge:”increases by”
– Node1: ”loop iteration”, Node2:”count”, Edge:”increases”
These graphs are functionally analogous, both capturing the essential relationship between loop
execution and count incrementing.</p>
        <p>Structural analogical reasoning compares the organizational structure between two OIE graphs,
such as the arrangement of nodes and edges, even if they describe diferent code functions.</p>
        <p>The functional and structural analogical reasoning prompt was refined through an iterative process,
where we systematically varied instruction phrasing, role definitions, and example selections, evaluating
each version’s ability to perform accurate analogical reasoning.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Methodology Illustration</title>
        <p>To illustrate our complete methodology, consider the following example :
• Code statement: user_info[’name’] = input (”Enter your name: ”).
• Student explanation: The user_info dictionary stores data entered by the user
• Expert explanation:The name field in user_info dictionary is updated with input provided by
the user.
• Student Explanation OIE triplets is shown in Listing 1:
– (user_info, stores, data)
– (data, entered by, user)
• Expert Explanation OIE triplets is shown in Listing 2:
– (name field, in, user_info dictionary)
– (user_info, updated with, input)
– (input, provided by, user)
• OIE student explanation graph:</p>
        <p>Listing 1: JSON Student Graph Representation
{
"nodes": [
{"id": "user_info", "label": "dictionary"},
{"id": "data", "label": "value"},
{"id": "user", "label": "source"}
],
"edges": [</p>
        <p>{"source": "user_info", "target": "data", "relation": "stores"},
}
• Similarity Score: The graphs show diferent focuses - the student emphasizes data storage while
the expert emphasizes field updating. These represent diferent functional aspects of the same
code. Structural patterns difer in node relationships. Similarity score :0.45. The OIE triplets, OIE
graphs, and the semantic score are generated by the LLM.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>To evaluate the performance of our proposed method, we have conducted several experiments with the
Self-code corpus.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The Self-code corpus[28] consists of sentence pairs of student and expert self-explanations of Java code
examples, along with semantic similarity judgments using a 1-5 scale, where 1 means the student and
expert self-explanations are not similar, whereas 5 means they are semantically equivalent. The expert
explanations are from a collection of expert-annotated examples from a catalog of interactive learning
content[29]. A crowdsourcing approach has been followed to collect explanations from students with
diferent backgrounds and skills. Examples of student and expert explanations of code lines are shown
in Table 1.</p>
        <sec id="sec-4-1-1">
          <title>We define a string variable to hold Sets the fullName string to John the name. Smith.</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>This line invokes the method trans- moves the X, and Y points by adding late of the point. (11,6) point.</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Setup</title>
        <p>LLM-based prompting: We evaluated our proposed analogical reasoning approach against
Chain-of-Thought (CoT) prompting in a few-shot setting using 3 examples per prompt. We randomly
selected 100 instances from the Self-code dataset to balance analytical depth with practical constraints
of cost and computational time. The selected sample’s distribution across similarity scores is similar to
the original distribution. We experimented with multiple state-of-the-art LLMs: GPT-4o, GPT-4-turbo
[30], mistral-large-latest[31], and llama-3.1-70b [32]. For the GPT models, temperature was set to 0 to
ensure deterministic outputs. Default temperature settings were used for Mistral and LLaMA models.
The CoT baseline prompting approach asks the LLMs to compute the similarity scores between student
and expert self-explanations step by step, using the phrase “Let’s think step by step”.</p>
        <p>
          Supervised STS approaches: We compare our analogical prompting-based approach to: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
ALBERT[33], (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) RoBERTa[34], (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) BERT [35], and (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) CodeBERT[36]. We used the base versions of the
pretrained models that are fine-tuned for 3 epochs. The AdamW optimizer with a learning rate of
2e-5 has been used. The self-code dataset contains 1770 instances. The models were tested using the
same test data for the analogical reasoning prompting-based approach. The remaining dataset (1670
‘instances) has been used for training (80%) and validation (20%).
        </p>
        <p>
          Unsupervised STS approaches: Student and expert self-explanations are encoded using
state-ofthe-art models on the STS task and then compared using the cosine similarity. We experimented with:
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Open AI embeddings (openai-ada-002 and openai-embedding-3-large), (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) the pretrained
all-mpnetbase-v2 finetuned on a 1B sentence pair dataset, and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) SimCSE[ 5], which leverages enhanced sentence
representations using contrastive learning. We used SimCSE-BERT and SimCSE-Roberta.
        </p>
        <p>All approaches produce similarity scores on the same 100-instance test dataset. LLM outputs are
normalized to the 0-1 range. We report Spearman’s rank correlation and Pearson’s correlation between
the predicted scores and ground-truth similarity scores (normalized from 1-5 scale to 0-1 scale), in the
dataset’s subset.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Results</title>
        <p>LLM-based Approaches
According to Table 2, our analogical reasoning approach on OIE graphs consistently outperforms the
CoT baseline across most LLMs. GPT-4o achieves the strongest performance with Pearson correlation
of 0.8 and Spearman correlation of 0.76, representing improvements of 8.1% and 8.6% respectively
over CoT. LLaMA shows substantial improvements of 10.8% in both metrics (0.72 vs 0.65). Notably,
GPT-4-Turbo shows mixed results, with slightly lower Pearson correlation (0.72 vs 0.73) but improved
Spearman correlation (0.7 vs 0.66). This suggests that our approach may be sensitive to some LLMs’
architectures and training procedures. Mistral demonstrates the most modest improvements, achieving
4.9% and 3.3% gains in Pearson and Spearman correlations, respectively, suggesting limitations in
analogical reasoning and information extraction capabilities for this task. The findings also underscore
the superior analogical reasoning ability, and the information extraction of GPT-4 models compared to
the open-source models: Mistral and LLaMA. These results align with prior findings on the analogical
reasoning performance of GPT-4 models[37].</p>
        <p>Comparison with Supervised and Unsupervised Baselines
As evidenced by Table 2, our approach significantly outperforms both supervised and unsupervised
baselines. Compared to the best-performing supervised model (BERT: 0.74, 0.75), GPT-4o with our
Models
GPT-4o Experiments</p>
        <sec id="sec-4-3-1">
          <title>OIE-analogical reasoning CoT</title>
          <p>GPT-4-Turbo Experiments</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>OIE-analogical reasoning CoT</title>
          <p>Mistral Experiments</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>OIE-analogical reasoning CoT</title>
          <p>LLaMA Experiments</p>
        </sec>
        <sec id="sec-4-3-4">
          <title>OIE-analogical reasoning CoT</title>
          <p>Supervised STS Experiments</p>
        </sec>
        <sec id="sec-4-3-5">
          <title>ALBERT</title>
        </sec>
        <sec id="sec-4-3-6">
          <title>RoBERTa</title>
        </sec>
        <sec id="sec-4-3-7">
          <title>BERT</title>
        </sec>
        <sec id="sec-4-3-8">
          <title>CodeBERT</title>
          <p>Unsupervised STS Experiments</p>
        </sec>
        <sec id="sec-4-3-9">
          <title>OpenAI-ada-002</title>
        </sec>
        <sec id="sec-4-3-10">
          <title>OpenAI-embedding-3-large all-mpnet-base-v2</title>
        </sec>
        <sec id="sec-4-3-11">
          <title>SimCSE-BERT SimCSE-ROBERTA</title>
          <p>
            approach achieves improvements of 8.1% and 1.3% in Pearson and Spearman correlations. Against
other supervised models, the improvements are even more substantial: 19.4% and 5.6% over ALBERT,
15.9% and 10.1% over RoBERTa, and 12.7% and 7.0% over CodeBERT. Unsupervised approaches
perform considerably worse, with the best OpenAI embedding (embedding-3-large) achieving only 0.67
and 0.64 correlations, representing a performance gap of 19.4% and 18.8% compared to our GPT-4o
implementation. The superior performance of the LLMs can be explained by several factors: (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) the
limited size of the training dataset of the fine-tuned supervised models, which may be insuficient for
capturing semantic relationships in the programming self-explanations, (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) LLMs leverage extensive
pretraining and complex reasoning capabilities, (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) our structured OIE-graph approach captures more
nuanced semantic understanding in comparison to the surface-level cosine similarity in unsupervised
models.
          </p>
          <p>OIE Graphs Ablation
Table 3 demonstrates the critical role of OIE graph construction in our approach. When performing
analogical reasoning directly on self-explanations without OIE graphs, all models show performance
degradation. GPT-4o experiences drops of 5.0% and 2.6% in Pearson and Spearman correlations,
respectively, while LLaMA shows more substantial decreases of 5.6% and 9.7%. These results confirm that
structured knowledge representation through OIE graphs is essential for efective analogical reasoning
in this domain.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Error Analysis</title>
        <p>We analyzed failed predictions with low confidence scores (&lt;= 0.6) to understand where LLMs fail in
performing accurate similarity assessment.</p>
        <p>Confidence Computation
For any student self-explanation and expert self-explanation pair, we prompt the LLM to generate a
verbalized confidence score.</p>
        <p>Failure Analysis
We analyzed 24 low-confidence predictions (out of 100 instances) from GPT-4o to understand where our
proposed approach fails in performing an accurate similarity assessment. The analysis revealed three
error types: OIE extraction errors (33.3%), structural analogical reasoning errors (29.2%), and functional
analogical reasoning errors (37.5%). Functional reasoning errors were more prevalent, indicating the
LLMs’ challenges in recognizing semantic equivalence between diferent expressions of the same
concept. For example, the model may fail to map ”loop processes numbers” to ”iteration over values,”
despite both describing the same algorithmic behavior. OIE extraction errors include incorrect or
incomplete triplet extraction from explanations. For instance, extracting ’(variable, set to, 0)’ when the
explanation describes initialization with array values. While less frequent, these errors had a significant
impact on performance. Structural analogical errors involve dificulty in aligning graph structures, even
when the extracted triplets are correct. The model may struggle to identify corresponding nodes and
edges between student and expert graphs with diferent organizational patterns, leading to incorrect
similarity assessment.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>
        We proposed a novel approach that combines analogical reasoning with Open Information Extraction
(OIE) graphs to enhance LLM-based code self-explanation assessment. Experimental evaluation on the
Self-code corpus demonstrates significant improvements: 8.1% and 8.6% gains over Chain-of-Thought
prompting using GPT-4o, and up to 19.4% improvement over both supervised and unsupervised baselines,
highlighting the efectiveness of structured analogical reasoning for this task. The ablation study
confirms that OIE graph construction is essential, with performance drops of 5.0% when removed from
the pipeline. Error analysis of low-confidence failed predictions identified three categories of errors:
functional analogical reasoning errors (37.5%), OIE extraction errors (33.3%), and structural analogical
reasoning errors (29.2%). These findings suggest that while our approach is efective, there remain
challenges in semantic mapping and robust triplet extraction for programming domain explanations.
Building on these findings, we plan to: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) explore LLMs’ generated explanations to provide tailored
hints and feedback for students, aiming to improve their learning outcomes, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) implement advanced
prompt optimization strategies to enhance the LLMs’ capabilities in open information extraction, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
enhance the LLM prompting for analogical reasoning, and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) add richer visualizations to convey the
results more eficiently.
      </p>
      <p>This work contributes to the AI-education field by demonstrating how structured reasoning
approaches can enhance LLM capabilities for programming-based assessment tasks.</p>
    </sec>
    <sec id="sec-6">
      <title>Generative AI Statement</title>
      <p>The authors acknowledge the use of generative artificial intelligence (AI) tools during the preparation
of this research paper. Claude was used to rephrase selected paragraphs, polish sentences, and generate
illustrative percentages presented in the results section. In addition, the authors discussed the approach
structure with the AI tool, which proposed to include an example that was subsequently adapted
and verified by the authors. All research ideas, conceptual development, data analysis, interpretation
of findings, and final writing were carried out and validated solely by the authors, who take full
responsibility for the accuracy and integrity of the paper.
[14] M. Yasunaga, X. Chen, Y. Li, P. Pasupat, J. Leskovec, P. Liang, E. H. Chi, D. Zhou, Large language
models as analogical reasoners, arXiv preprint arXiv:2310.01714 (2023).
[15] D. Gentner, Structure-mapping: A theoretical framework for analogy, Cognitive science 7 (1983)
155–170.
[16] P. N. Johnson-Laird, Mental models, Cambridge University Press, Cambridge, UK, 1983.
[17] K. J. Holyoak, P. Thagard, The analogical mind: Perspective and scope, Psychological Review 102
(1995) 428–444.
[18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, others, D. Amodei, Language
models are few-shot learners, Advances in neural information processing systems 33 (2020)
1877–1901.
[19] C. Qin, W. Xia, T. Wang, F. Jiao, Y. Hu, B. Ding, R. Chen, S. R. Joty, Relevant or random: Can llms
truly perform analogical reasoning?, 2024. arXiv:abs/2404.12728.
[20] W. Xie, J. Niu, C. J. Xue, N. Guan, Grade like a human: Rethinking automated assessment with
large language models, 2024. arXiv:2405.19694, arXiv preprint.
[21] C. Cohn, N. Hutchins, T. Le, G. Biswas, A chain-of-thought prompting approach with llms
for evaluating students’ formative assessment responses in science, Proceedings of the AAAI
Conference on Artificial Intelligence 38 (2024) 23182–23190.
[22] C. Xiao, W. Ma, S. X. Xu, K. Zhang, Y. Wang, Q. Fu, From automation to augmentation: Large
language models elevating essay scoring landscape, 2024. arXiv:2401.06431, arXiv preprint
(2024).
[23] G. A. Katuka, A. Gain, . Y. Y. Yu, Investigating automatic scoring and feedback using large language
models, 2024. arXiv:2405.00602, arXiv preprint.
[24] P. Oli, R. Banjade, J. Chapagain, V. Rus, Proceedings of the 2024 aaai conference on artificial
intelligence, PMLR 257 (2024) 118–128.
[25] P. Gamallo, An overview of open information extraction (invited talk), in: 3rd Symposium on
Languages, Applications and Technologies, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik,
2014.
[26] Y. Wang, Q. Yao, J. T. Kwok, L. M. Ni, Generalizing from a few examples: A survey on few-shot
learning, ACM computing surveys (csur) 53 (2020) 1–34.
[27] B. Cabral, D. Claro, M. Souza, Exploring open information extraction for portuguese using large
language models, in: Proceedings of the 16th International Conference on Computational
Processing of Portuguese - Vol. 1, Santiago de Compostela, Galicia/Spain, Association for Computational
Linguistics, 2024, pp. 127–136.
[28] J. Chapagain, Z. Risha, R. Banjade, P. Oli, L. Tamang, P. Brusilovsky, V. RUs, Selfcode: An
annotated corpus and a model for automated assessment of self-explanation during source code
comprehension, The International FLAIRS Conference Proceedings 36 (2023).
[29] A. Hicks, K. Akhuseyinoglu, C. Shafer, P. Brusilovsky, Live catalog of smart learning objects for
computer science education, in: Sixth SPLICE Workshop, 2020.
[30] OpenAI, Learning to reason with llms, 2024.
[31] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. D. L. Casas, others, W. E.</p>
      <p>Sayed, Mistral 7b, 2023. arXiv:2310.06825, arXiv preprint.
[32] H. Touvron, et al., Llama: Open and eficient foundation language models, 2023.</p>
      <p>arXiv:2302.13971, arXiv preprint.
[33] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut:ALBERT, A Lite BERT for
Selfsupervised Learning of Language Representations, ICLR, 2020.
[34] L. Zhuang, L. Wayne, S. Ya, Z. Jun, A robustly optimized bert pre-training approach with
posttraining, in: Proceedings of the 20th Chinese National Conference on Computational Linguistics,
Huhhot, China, Chinese Information Processing Society of China, 2021, pp. 1218–1227.
[35] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 conference of the North American chapter
of the association for computational linguistics: human language technologies, volume 1 (long
and short papers), 2019, pp. 4171–4186.
[36] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou,
Codebert: A pre-trained model for programming and natural languages, Findings of the Association
for Computational Linguistics: EMNLP 2020, , Online. Association for Computational Linguistic
(2020) 1536–1547.
[37] Z. Sourati, F. Ilievski, P. Sommerauer, Y. Jiang, Arn: Analogical reasoning on narratives,
Transactions of the Association for Computational Linguistics 12 (2024) 1063–1086.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Boud</surname>
          </string-name>
          , E. Molloy (Eds.),
          <source>Feedback in Higher and Professional Education, Routledge</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V. J.</given-names>
            <surname>Shute</surname>
          </string-name>
          , Focus on formative feedback,
          <source>Review of educational research 78</source>
          (
          <year>2008</year>
          )
          <fpage>153</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Azaiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kiesler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Strickroth</surname>
          </string-name>
          ,
          <article-title>Feedback-generation for programming exercises with gpt-4, in: Proceedings of the 2024 on Innovation</article-title>
          and Technology in Computer Science Education V,
          <year>2024</year>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Banea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gonzalez-Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          , G. Rigau,
          <string-name>
            <surname>J. Wiebe,</surname>
          </string-name>
          <article-title>Semeval2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation</article-title>
          ,
          <source>in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)</source>
          , , San Diego, California, Association for Computational Linguistics,
          <year>2016</year>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>511</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Simcse:
          <article-title>Simple contrastive learning of sentence embeddings</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , , Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6894</fpage>
          -
          <lpage>6910</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Generate, discriminate and contrast: A semi-supervised sentence representation learning framework</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , ,
          <string-name>
            <surname>Abu</surname>
            <given-names>Dhabi</given-names>
          </string-name>
          , United Arab Emirates,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>8150</fpage>
          -
          <lpage>8161</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohebbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Razavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Balafar</surname>
          </string-name>
          ,
          <article-title>Computing semantic similarity of texts based on deep graph learning with ability to use semantic role label information</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>12</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Toward interpretable semantic textual similarity via optimal transportbased contrastive sentence learning</article-title>
          ,
          <source>in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</source>
          (Volume
          <volume>1</volume>
          , Association for Computational Linguistics,
          <year>2022</year>
          , pp.
          <fpage>5969</fpage>
          -
          <lpage>5979</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Rep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dukić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Šnajder</surname>
          </string-name>
          ,
          <article-title>Are electra's sentence embeddings beyond repair? the case of semantic textual similarity, Findings of the Association for Computational Linguistics: EMNLP 2024, Miami</article-title>
          , Florida, USA. Association for Computational Linguistics (
          <year>2024</year>
          )
          <fpage>9159</fpage>
          -
          <lpage>9169</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>T.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Electra: pretraining text encoders as discriminators rather than generators</article-title>
          ,
          <source>in: 8th International Conference on Learning Representations, ICLR</source>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Addis</given-names>
            <surname>Ababa</surname>
          </string-name>
          , Ethiopia,
          <source>April 26-30</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gatto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sharif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Seegmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bohlman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Preum</surname>
          </string-name>
          ,
          <article-title>Text encoders lack knowledge: Leveraging generative llms for domain-specific semantic textual similarity</article-title>
          ,
          <source>in: Proceedings of the Third Workshop on Natural Language Generation</source>
          , Evaluation, and
          <string-name>
            <surname>Metrics</surname>
          </string-name>
          (GEM), Singapore, Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>277</fpage>
          -
          <lpage>288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Hofstadter</surname>
          </string-name>
          ,
          <article-title>Analogy as the core of cognition, The analogical mind: Perspectives from cognitive science (</article-title>
          <year>2001</year>
          )
          <fpage>499</fpage>
          -
          <lpage>538</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Hansen-Estruch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <article-title>Bisimulation makes analogies in goalconditioned reinforcement learning</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>8407</fpage>
          -
          <lpage>8426</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>