<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Using LLMs to extract UML class diagrams from Java and Python programs: an empirical study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hanan Abdulwahab Siala</string-name>
          <email>hanan.siala@kcl.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin Lano</string-name>
          <email>kevin.lano@kcl.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Joint Proceedings of the STAF 2025 Workshops: OCL</institution>
          ,
          <addr-line>OOPSLE, LLM4SE, ICMM, AgileMDE, AI4DPS, and TTC. Koblenz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>King's College London</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Learning, Large Language Models (LLMs), Java programs, Python programs</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Unified Modeling Language (UML), UML Class Diagram, Model-driven Reverse Engineering (MDRE)</institution>
          ,
          <addr-line>Machine</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>0</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>In this paper, we present a comprehensive study of the capabilities of five large language models (LLMs), namely StarCoder2, LLaMA, CodeLlama, Mistral, and DeepSeek, for abstracting UML class diagrams from code, with the aim to provide researchers and developers with insights into the capabilities and limitations of using various LLMs in a model-driven reverse engineering process. We evaluate the LLMs by prompting them to generate UML class diagrams for both Java and Python programs, with the key focus on accuracy, consistency, and F1 score. Our findings reveal that all LLMs have higher accuracy and F1 scores for Python than for Java. DeepSeek and Mistral perform best overall, while LLaMA consistently performs the lowest in all metrics and for both languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Various organizations have used software systems to carry out their responsibilities. Over time, these
systems may become legacy systems if they are not adequately maintained and evolved to meet new
requirements and needs. Reverse engineering aims to understand software systems and facilitate their
maintenance and evolution.</p>
      <p>
        Model-Driven Engineering (MDE) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] can be used together with the reverse engineering process by
providing modeling representations and high-level abstractions of software systems. These derived
models can then be used with forward engineering to re-engineer the software systems or to enable
reuse of the abstracted software functions within new MDE developments. It is also possible to use
reverse engineering to integrate code and models within a round-trip engineering process, whereby
developers can work in an agile manner at either modelling or code levels.
      </p>
      <p>
        The Object Management Group (OMG) has defined open standards for the design and development
of software using numerous modelling notations, including the Unified Modeling Language (UML) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>UML is the most widely used standard notation for modeling software systems. It includes several
diagrams, the most popular of which is the UML class diagram, which is used to represent the classes
and relationships in a system.</p>
      <p>
        Large Language Models (LLMs) are a type of machine learning (ML) technology that is initially
trained (pre-trained) on massive amounts of textual data, to acquire deep implicit knowledge of the
language(s) of the data, including software languages. Once pre-trained, LLMs can be fine-tuned to
carry out specific downstream tasks by further training on demonstration examples. Common examples
of LLMs include GPT3&amp;4 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Bard [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], LLaMA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and Mistral [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. LLMs have had a major impact on
various fields, including software engineering, where LLMs have been used in a variety of software
engineering tasks [7].
      </p>
      <p>During the last twenty years, a large amount of research has been carried out on abstracting various
representations from existing software systems. Modern integrated development environments (IDEs)
such as IntelliJ and PyCharm support generating class diagrams through plugins and libraries. These</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
tools often rely on complete, syntactically correct, and fully compilable source code. In contrast, the
LLM-based approach ofers greater flexibility and automation that can process code and generate
structural representations even when the code is incomplete or not compiled.</p>
      <p>A recent systematic literature review (SLR) [8] of model-driven reverse engineering (MDRE)
approaches over this period identified that using LLMs to abstract diferent representations from source
code is a novel concept. Hence, our research investigates the capabilities and limitations of five
opensource LLMs—StarCoder2, LLaMA, CodeLlama, Mistral, and DeepSeek—for abstracting UML class
diagrams from both Java and Python programs. The LLMs are evaluated by prompting them to generate
UML class diagrams, which are then assessed using accuracy, consistency, and the F1 score metrics.
Our goals are:
• To investigate the abstraction of UML class diagrams from Java and Python code using LLMs.
• To compare the generated UML class diagrams with reference models to assess accuracy and
completeness.
• To study how LLMs deal with programming languages of diferent kinds: statically-typed (Java)
and dynamically-typed (Python).</p>
      <p>The structure of the paper is as follows: section 2 presents related work, while our methodology is
explained in section 3. Section 4 illustrates the evaluation of various LLMs. The threats are outlined
in section 5, and section 6 provides the conclusions and future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Here we describe related work in the areas of model-driven reverse engineering approaches and the
generation of code representations with LLMs.</p>
      <sec id="sec-2-1">
        <title>2.1. Model-driven Reverse Engineering Approaches</title>
        <p>Raibulet et al. [9] present a comprehensive analysis of MDRE approaches in the literature from 2003 up
to 2017. They compared fiteen MDRE approaches and presented their diferent features, such as the
level of automation, extensibility, and genericity.</p>
        <p>Siala et al. [8] provide an SLR of MDRE over the period 2000–2023, which surveys 55 distinct MDRE
approaches. They found that the majority of MDRE research has concentrated on developing code
visualisations such as class, sequence, and activity diagrams.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Generating Code Representations with LLMs</title>
        <p>Boronat and Mustafa [10] introduce a tool, MDRE-LLM, that integrates LLMs with MDRE to automate
and improve domain model recovery from source code. The tool supports diverse use cases, including
analyzing undocumented legacy systems, understanding large-scale codebases, validating LLM
performance, and generating reproducible datasets. They use retrieval augmented generation (RAG) to
improve the accuracy and relevance of LLM responses.</p>
        <p>Siala [11] introduces a new approach for using LLMs to abstract UML class diagrams and object
constraint language (OCL) from Java and Python programs. In [12], Siala and Lano present the
LLM4Models LLM, based on the fine-tuning of the Mistral LLM, which abstracts OCL specifications
from Java and Python programs. In [13], the LLM4Models LLM is also used to abstract UML class
diagrams from Java programs, while the LLM4Models LLM is used in [14] to abstract UML class diagrams
from Python programs. The evaluation results of these papers indicate that the LLM4Models LLM can
efectively abstract UML and OCL from Java and Python programs.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset Creation</title>
        <p>This section details the methodology used in our study, including dataset creation, selection of LLMs,
definition of UML abstraction criteria, prompt formulation, and the evaluation metrics employed.
We selected Java and Python cases from established and large-scale program datasets: CoTran [15] and
AVATAR [16].</p>
        <p>1. Collecting sample programs: We selected 14 Java programs and 16 Python programs from the
datasets for our experiment to analyze the reverse-engineering performance of various LLMs
on Java and Python. When choosing these programs, we were not concerned with the length of
the programs, but rather with representing all the possible elements and all relationships to test
the capability of the abstraction process. The selected examples cover all the program elements
considered by our evaluation: interfaces, classes, various attributes and methods, visibility, inner
classes, and abstract and static attributes and methods. The examples include cases with one to
six classes and with various relationships.
2. Automatically generating ground-truth UML class diagrams for these programs: We used the
Java2JSON [13] and Python2JSON [14] parsers and rulesets to precisely abstract expected UML
models from the selected programs to serve as reference models for comparison with the
LLMgenerated UML models.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model Selection</title>
        <p>To investigate the capability of existing pre-trained LLMs to abstract UML class diagrams from Java
and Python programs, we experiment using five open-source LLMs, all from Hugging Face [ 17], which
have achieved promising results in various code-related tasks. We selected LLMs with parameters
from 7B to 8B to maintain a balance between performance and accessibility. Larger models require
significantly more computational resources and inference time, while models in the 7B-8B range can run
eficiently on commonly available hardware while still providing suficient capabilities and high-quality
output [18].</p>
        <p>
          1. StarCoder2: StarCoder2 [19] is a family of LLMs for code (Code LLMs) trained on 3.3 to 4.3
trillion tokens and evaluated on various Code LLM benchmarks. The smallest model (3B) and the
largest model (15B) outperform comparable models [19]. StarCoder2-7B is used in our experiment.
2. LLaMA: LLaMA is a collection of LLMs in 8B, 70B, and 405B sizes trained using up to 15 trillion
tokens of publicly accessible data from diferent sources. Our experiment uses Llama-3.1-8B,
which was released on July 23, 2024.
3. CodeLlama: CodeLlama [20] is a code LLM based on the LLaMA 2 architecture. A large dataset
of code and natural language related to code has been used to train CodeLlama to support code
generation and infilling tasks in several programming languages, including Java, Python, C#,
PHP, and C++. CodeLlama-7b-hf is used in our experiment.
4. Mistral: Mistral [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a decoder-only transformer. It performs well in natural language
understanding and generation when compared to other LLMs, and it may also be utilized for code-related
tasks. Mistral-7B-v0.3 is used in our experiment.
5. DeepSeek: DeepSeek-R1 [21] is trained through extensive reinforcement learning (RL) to tackle
the challenges associated with DeepSeek-R1-Zero. Multi-stage training and cold-start data are
integrated before RL. Additionally, DeepSeek is distilled into smaller, dense models based on
Qwen and Llama, which deliver outstanding performance on benchmarks.
DeepSeek-R1-DistillLlama-8B is used in our experiment.
        </p>
        <p>A temperature hyperparameter of 0.2 was chosen for the selected LLMs. This limits variation in
responses, increases deterministic output, and maintains the ability for diverse responses.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Identifying Criteria for UML Class Diagram abstraction</title>
        <p>The following criteria are used to evaluate the capabilities of various LLMs to abstract UML class
diagram representations from Java and Python programs and to identify any potential shortcomings:
C1: Are all classes and interfaces in the reference class diagram correctly abstracted by the LLM?
C2: Are all the attributes and their types in the reference class diagram correctly abstracted?
C3: Are all the methods in the reference class diagram correctly abstracted?
C4: Is each expected inheritance relationship between classes generated in the LLM output?
C5: Is each expected realization relationship between classes/interfaces generated in the LLM output?
C6: Is each correct association relationship between classes generated in the LLM output?
C7: Are correct aggregation/composition relationships between classes included in the generated
output?
C8: Are correct abstract classes and abstract methods identified in the generated output?
C9: Are the correct static classes and static methods included in the generated output?
C10: Are correct association multiplicities included in the generated output?
C11: Are correct association role names included in the generated output?
C12: Are correct dependency relationships between classes, identified by examining method parameter
types, local variables, or return types, included in the generated output?</p>
        <p>Although, in principle, these properties could be automatically checked by comparing the reference
and generated UML models, in practice, we found this was not possible due to the variations in the
model format produced by the LLMs. The generated output often includes additional text and varies
between LLMs and individual LLMs. Thus, we used a manual comparison of the reference and generated
models to check the above criteria.</p>
        <p>Furthermore, we could not rely only on automatic checking in many circumstances. For example,
distinguishing between association, aggregation, and composition relationships is challenging for
automated checking. We could not omit an association if it is identified instead of an aggregation
relationship, as aggregation is a special form of association and could be overlooked completely by
automatic comparison.</p>
        <p>The variation in the model format generated by the LLMs can be seen in the processing of Java
example 3. Consider the class VitalSigns as an example of this variation in response, where it inherits
from the class MedicalData. CodeLama LLM defines the relationship within the class blocks and
generates the output shown in Fig. 1a.</p>
        <p>It assigns a unique numerical id to each class, allowing references to that class in other classes. For
example, the field target refers to class number 1, which corresponds to the MedicalData class. In
contrast, the DeepSeek LLM defines all classes, including the VitalSigns class, shown in Fig. 1b, and
then it generates all the relationships together, as shown in Fig. 1c.</p>
        <p>Likewise, the other LLMs each have their own variations on the result format. For example, Fig. 2a
shows a Python example that contains a composition relationship between classes. While both
StarCoder2 and Mistral LLMs abstract attributes and methods for the given example, Mistral correctly
identifies the composition relationship, as shown in Fig. 2b, whereas StarCoder2 instead abstracts a
dependency relationship, as shown in Fig. 2c.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Prompt Engineering for UML Class Diagram abstraction</title>
        <p>We use an Alapaca-style prompt format. The prompt schema used to abstract the UML class diagram
for Java input is shown in Fig. 3. This prompt was engineered to obtain the elements and relationships
of UML class diagrams, with the required quality properties of non-redundancy and non-duplication.
We found that this version produced the most accurate and consistent results. A corresponding prompt
was applied for the Python abstraction.</p>
        <p>We requested each LLM to generate output in JSON format, from which the abstracted information
could be graphically converted into UML class diagrams using libraries like Graphviz or PlantUML.
(a) Output (elements and relationships) of the class
diagram generated by CodeLlama LLM.</p>
        <p>(b) Elements of the class diagram generated by</p>
        <p>DeepSeek LLM.</p>
        <p>(c) Relationships of the class diagram generated by DeepSeek LLM.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Evaluation Metrics</title>
        <p>The evaluation was carried out based on three key metrics: accuracy, consistency, and F1 score. We
ifrst evaluate the accuracy of each LLM, where accuracy is defined as the proportion of source code
elements that are accurately represented as UML class diagram elements. Accuracy, also referred to as
recall, is calculated using Equation (1), where a true positive (TP) is an element accurately translated
from code to UML class diagram elements, and a false negative (FN) is an element that is not translated
or is translated inaccurately.</p>
        <p>TP
Recall = (1)</p>
        <p>TP + FN</p>
        <p>Next, we evaluate the consistency of each LLM, where consistency is defined as the proportion of
elements in the generated UML class diagram that are accurately derived from the source programs.
Consistency is important for ensuring traceability and alignment of the derived class diagram element
concerning the source code. Consistency, also referred to as precision, is calculated using Equation (2),
where a false positive (FP) is an element that appears in the UML class diagram that is not correctly
derived from a source code element.</p>
        <p>TP
Precision = (2)</p>
        <p>TP + FP</p>
        <p>Accuracy evaluates the completeness and correctness of the abstraction process, while consistency
evaluates the quality of the generated UML class diagram element in terms of the absence of spurious
(a) Python example containing a composition relationship.
(b) Class diagram generated by Mistral for the
given Python example.
(c) Class diagram generated by StarCoder2 for
the given Python example.
elements not derived from the source code.</p>
        <p>Note that we do not consider any extra elements found outside the generated UML class diagrams.
Any extra explanations or examples provided by the LLM do not afect consistency. Consistency is
afected only by extra elements that are part of the generated UML class diagram and are not present in
the Java or Python source code.</p>
        <p>The F1 score provides a balance between precision and recall, as shown in Equation (3). It considers
both false positives and false negatives and can be especially beneficial when the LLM produces
conflicting results; for example, high precision but low recall, or vice versa.</p>
        <p>F1 score = 2 ∗</p>
        <p>Precision ∗ Recall
Precision + Recall
(3)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>In this section, we present a comprehensive analysis of the performance of the LLMs in terms of
accuracy, consistency, and F1 score.</p>
      <sec id="sec-4-1">
        <title>4.1. Evaluation of LLMs to Generate UML Class Diagrams from Java Programs</title>
        <p>Accuracy Fig. 4a presents the accuracy of various LLMs in abstracting UML class diagrams from 14
Java examples. DeepSeek, Mistral, and StarCoder2 achieve higher peaks, with accuracy scores reaching
1.0 for certain examples. However, StarCoder2 has the lowest accuracy for both examples one and
twelve, while Mistral has the lowest accuracy score for example five. CodeLlama follows them in its
accuracy scores. Although the accuracy scores of the LLaMA LLM are generally lower compared to
other LLMs, these scores are also relatively stable across diferent examples.</p>
        <p>Consistency Fig. 4b shows an overview of the consistency scores for abstracting UML class diagrams
from Java programs. Overall, Mistral has higher consistency scores compared to the other LLMs,
followed by both StarCoder2 and DeepSeek LLMs. Both LLaMa and CodeLlama show moderate
consistency, with consistency scores hovering around 0.6.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation of LLMs to Generate UML Class Diagrams from Python Programs</title>
        <p>Accuracy Fig. 4c shows the accuracy of the selected LLMs in abstracting UML class diagrams from
16 Python examples. Both DeepSeek and Mistral achieve higher accuracy scores for abstracting UML
class diagrams from Python examples. StarCoder2 follows DeepSeek and Mistral in accuracy scores,
while CodeLlama tends to stay stable between 0.8 and 1.0 scores. However, LLaMA shows a significant
drop in example 10, where accuracy dips to slightly above 0.2.</p>
        <p>Consistency Fig. 4d presents the consistency scores for the selected LLMs in generating UML class
diagrams for Python code. Apart from a drop in performance, for example, in Python case number two,
DeepSeek has in general higher consistency scores compared to other LLMs. StarCoder2 and Mistral
follow DeepSeek, maintaining generally consistent performance, with a slight decline observed in two
examples for each. Meanwhile, LLaMA shows marked inconsistency, with multiple drops, particularly
in example 16, whereas CodeLlama performs slightly better than LLama but experiences a significant
drop in example 6.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Overall Results</title>
        <p>Accuracy Fig. 5a presents the overall accuracy for UML class diagrams generated by the selected
LLMs. The results show that all LLMs have more accuracy for Python than for Java. This could seem to
be a surprising result because Java programs usually contain more precise typing information about
(a) Accuracy for Java examples, numbered 1 to 14.</p>
        <p>(b) Consistency for Java examples, numbered 1 to 14.
(c) Accuracy for Python examples, numbered 1 to 16. (d) Consistency for Python examples, numbered 1 to 16.
program features and variables compared to Python programs. On the other hand, Java programs tend
to be more structurally complex than Python programs.</p>
        <p>DeepSeek has the highest accuracy score for Java, while both Mistral and DeepSeek share the highest
accuracy for Python. By contrast, LLaMA has the lowest accuracy scores in both languages. In addition,
the diference in accuracy between the two languages is most noticeable for LLaMA and CodeLlama.
Consistency Fig. 5b presents the overall consistency generated by the selected LLMs. Mistral achieves
the highest consistency for Java, while DeepSeek has the highest consistency score. In comparison,
CodeLlama and LLaMA share the lowest consistency for Python, whereas CodeLlama has the lowest
consistency score for Java. Additionally, the consistency diferences between Java and Python are minor
compared to the accuracy diferences.</p>
        <p>F1 score The analysis of Python outperforms that of Java across all LLMs, according to the F1 score,
with DeepSeek achieving the highest F1 score for Python and Mistral achieving the highest value for
Java. The F1 scores for Mistral and CodeLlama closely follow those of DeepSeek for Python. In contrast,
LLaMA has the lowest F1 score in both languages, as shown in Fig. 5c.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Threats to Validity</title>
      <p>In this section, we outline potential threats that could impact the findings of our experimental study.
(a) Overall accuracy for Java and Python examples.</p>
      <p>(b) Overall consistency for Java and Python examples.</p>
      <p>(c) Overall F1 score for Java and Python examples.</p>
      <sec id="sec-5-1">
        <title>5.1. Response of models</title>
        <p>There are cases where we receive responses from LLMs unrelated to UML class diagrams. We mitigate
this by performing multiple (up to 5 attempts) iterations of the inference process. This approach allowed
us to enhance the results and select the most accurate class diagram generated by each LLM.</p>
        <sec id="sec-5-1-1">
          <title>5.1.1. Evaluation of the responses</title>
          <p>Human error and bias can occur when evaluating manually generated UML class diagrams. This
manual evaluation process may produce inconsistent or inaccurate results. However, this is mitigated
by involving the second author to check cases where the first author is unsure at any stage in the
experiment.</p>
          <p>As we noted in subsection 3.3 above, due to variability in the LLM result formats, it was not possible
to perform an automated comparison of the reference and generated UML models.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Scope of the experiment</title>
          <p>Here, we have only used open-source LLMs, and there is a risk that the most powerful LLMs are not
considered. However, this risk is partially resolved by including the most powerful open-source LLMs.
DeepSeek is the latest powerful open-source LLM included in our experiment.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Representativeness</title>
        <p>We aimed to consider real-world Java and Python programs, typical of those created by practitioners.
Thus, we chose cases from well-known and established datasets of such programs, which have been used
in other LLM research. We selected cases to cover all the modelling aspects that should be represented
in generated class diagrams.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This paper conducted an empirical study of using LLMs to abstract UML class diagrams from Java
and Python code. We found that DeepSeek LLM and Mistral are the most reliable LLMs in abstracting
UML class diagrams for both Java and Python. Generally, LLMs generate better results with Python
than with Java, which is contrary to expectations. In future work, we will aim to expand our work to
include other programming languages like C++, C#, COBOL, and others. We also aim to fine-tune an
open-source LLM (e.g., Mistral or DeepSeek) on a large-scale dataset of these programming languages
with their corresponding class diagram representations to improve the performance of LLMs in reverse
engineering tasks.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Data Availability</title>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgment</title>
      <p>The data used in this study - including Java and Python programs and their corresponding UML class
diagrams generated by the selected open-source LLMs - are available in our online repository [22].
Hanan Siala acknowledges the financial support provided by the Libyan Ministry of Higher Education
and Scientific Research. She also acknowledges the use of resources provided by King’s Computational
Research, Engineering and Technology Environment (CREATE).</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>The authors used MISTRAL to extract UML class diagrams from program code. All outputs were
reviewed and validated by the authors, who take full responsibility. No proprietary/third-party
confidential data were provided to these tools.
[7] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, LLMs for
software engineering: a systematic literature review, arXiv 2308.10620 (2023).
[8] H. A. Siala, K. Lano, H. Alfraihi, Model-driven approaches for reverse engineering – a systematic
literature review, IEEE Access (2024). doi:10.1109/ACCESS.2024.3394732.
[9] C. Raibulet, F. A. Fontana, M. Zanoni, Model-driven reverse engineering approaches: A systematic
literature review, Ieee Access 5 (2017) 14516–14542. doi:10.1109/ACCESS.2017.2733518.
[10] A. Boronat, J. Mustafa, MDRE-LLM: A tool for analyzing and applying LLMs in software reverse
engineering, in: Proceedings of the IEEE International Conference on Software Analysis, Evolution,
and Reengineering (SANER), 2025. URL: https://conf.researchr.org/home/saner-2025.
[11] H. A. Siala, Enhancing model-driven reverse engineering using machine learning, in: Proceedings
of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion
Proceedings, ICSE-Companion ’24, Association for Computing Machinery, New York, NY, USA,
2024, p. 173–175. URL: https://doi.org/10.1145/3639478.3639797. doi:10.1145/3639478.3639797.
[12] H. A. Siala, K. Lano, Towards using LLMs in the reverse engineering of software systems to object
constraint language, in: Proceedings of the IEEE International Conference on Software Analysis,
Evolution, and Reengineering (SANER), 2025. URL: https://conf.researchr.org/home/saner-2025.
[13] H. A. Siala, K. Lano, Using large language models to extract UML class diagrams from Java
programs, in: Proceedings of the International Conference on Software and System Engineering
(ICoSSE), 2025. URL: http://www.icsse.org/.
[14] H. A. Siala, K. Lano, Leveraging large language models for abstracting UML class diagrams from</p>
      <p>Python programs, 2025. Under review.
[15] P. Jana, P. Jha, H. Ju, G. Kishore, A. Mahajan, V. Ganesh, CoTran: An LLM-based code translator
using reinforcement learning with feedback from compiler and symbolic execution, IOS Press,
2024. URL: http://dx.doi.org/10.3233/FAIA240968. doi:10.3233/faia240968.
[16] W. Ahmad, M. Tushar, S. Chakraborty, K.-W. Chang, AVATAR: a parallel corpus for Java-Python
program translation, in: Annual Meeting of the Association for Computational Linguistics, 2021.</p>
      <p>URL: https://api.semanticscholar.org/CorpusID:237304035.
[17] Hugging Face, Hugging face, Online, 2016. Available at: https://huggingface.co/ [Accessed Mar.</p>
      <p>2025].
[18] M. Hassid, T. Remez, J. Gehring, R. Schwartz, Y. Adi, The larger the better? improved
LLM code-generation via budget reallocation, 2024. URL: https://arxiv.org/abs/2404.00725.
arXiv:2404.00725.
[19] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, et al., StarCoder 2 and the stack v2: The
next generation, 2024. URL: https://arxiv.org/abs/2402.19173. arXiv:2402.19173.
[20] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin,
et al., Code llama: Open foundation models for code, 2024. URL: https://arxiv.org/abs/2308.12950.
arXiv:2308.12950.
[21] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, et al., DeepSeek-R1: Incentivizing
reasoning capability in LLMs via reinforcement learning, 2025. URL: https://arxiv.org/abs/2501.12948.
arXiv:2501.12948.
[22] H. A. Siala, K. Lano, Online repository for Java and Python programs with UML diagrams, https:
//doi.org/10.5281/zenodo.15108621, 2025. Accessed: May 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kent</surname>
          </string-name>
          , Model driven engineering, in: Integrated Formal Methods: Third International Conference, IFM 2002 Turku, Finland, May
          <volume>15</volume>
          -18,
          <year>2002</year>
          Proceedings, Springer,
          <year>2002</year>
          , pp.
          <fpage>286</fpage>
          -
          <lpage>298</lpage>
          . doi:https: //doi.org/10.1007/3- 540- 47884- 1_
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] OMG, OMG Systems Modeling Language (UML</article-title>
          ),
          <source>Version 2.5.1</source>
          ,
          <year>2017</year>
          . https://www.omg.org/spec/ UML.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , et al.,
          <source>A survey of large language models</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2303.18223.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Google</surname>
          </string-name>
          , Bard,
          <year>2025</year>
          . URL: https://bard.google.com/, accessed: [
          <fpage>04</fpage>
          -
          <lpage>03</lpage>
          -2025].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2302.13971. arXiv:
          <volume>2302</volume>
          .
          <fpage>13971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Lavaud</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Sayed</surname>
          </string-name>
          , Mistral 7b,
          <source>arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ). URL: https://arxiv.org/abs/2310.06825. arXiv:
          <volume>2310</volume>
          .
          <fpage>06825</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>