<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Directing generative AI for Pharo Documentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pascal Zaragoza</string-name>
          <email>pascal.zaragoza@berger-levrault.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Hlad</string-name>
          <email>nicolas.hlad@berger-levrault.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohamed Ilyes Amara</string-name>
          <email>mohamedilyesamara@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Berger-Levrault</institution>
          ,
          <addr-line>64 Rue Jean Rostand, 31670 Labège</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Inadequate code documentation can inhibit this activity and increases the time a developer spends on a task. However, for multiple reasons, good documentation is often lacking and rarely updated. This paper investigates how generative AI, specifically Large Language Models (LLMs), can enhance the quality of package-level documentation in the Pharo Smalltalk environment. Despite Pharo's support for embedded documentation and CRC (Class-Responsibility-Collaborator) methodology, proper package-level comments remain rare and often insuficient. To address this, we introduce three retrieval-augmented strategies for automatic package comment generation: (1) source-based extraction using class source code, (2) comment-based extraction leveraging existing class-level comments, and (3) a hybrid approach combining class comments with inter-class dependency data. We conduct an empirical evaluation across 21 packages of varying sizes using a structured Likert-scale questionnaire centered on CRC criteria, responsibility clarity, collaborator relevance, and comparison to existing comments. Results indicate that while no single strategy significantly outperforms others. However, all methods generate comments perceived as more useful, complete, and clear than the originals. Moreover, this study highlights the limitations in reliably identifying collaborators, especially in larger packages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;generative AI</kwd>
        <kwd>documentation</kwd>
        <kwd>comment generation</kwd>
        <kwd>Tool</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>1. 128/768 (16,7%) packages have a comment,
2. 10 196/12 568 (81,1%) classes have a comment, and
3. 115 050/274 282 (41,9%) methods contain a comment.</p>
      <sec id="sec-1-1">
        <title>Furthermore, package comments tend to be small with 76/128 (60,3%) comments are less than 100</title>
        <p>characters (see fig. 1). Package comments of these sizes provide few information that can be used for
code comprehension. An example package comment is given in fig. 2 (a).</p>
      </sec>
      <sec id="sec-1-2">
        <title>In this work, we study how we can increase the quality of generated package comments using</title>
      </sec>
      <sec id="sec-1-3">
        <title>Large Language Models (LLMs). Particularly, we are interested in studying the impact of the source of information during the retrieval process of retrieval-augmented generation has on the generation of</title>
        <p>the package comment. This source of information can range from the source text of the classes to user
descriptions and inter-package dependencies. With this in mind, we have implemented three diferent
strategies to generate a package comment using its underlying classes, their comments (when available),
and their dependencies to study the underlying efect on the source of information.</p>
      </sec>
      <sec id="sec-1-4">
        <title>In the following section, we present the related works. Next, we present the general approach</title>
        <p>towards applying retrieval-augmented strategies, as well as the three proposed strategies. Next, we
present an experimentation with the goal of exploring the impact of retrieval-augmented generation
around package comment generation. Finally, we conclude and ofer insights on the results of the
experimentation.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <sec id="sec-2-1">
        <title>Generating documentation for developers is not a recent endevour with [3] exploring comment gen</title>
        <p>eration. In this work the authors, propose a method to automatically generate comments for Java
classes using heuristics and stereotypes. The generated summaries are indicative and abstract, aiming
to provide a brief, informative description of the class content. The summaries are constructed based
on heuristics, such as focusing on accessor methods for Data Provider classes while excluding other
methods which follows a similar principle as the CRC methodology when it comes to the collaborators.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Another study [4] explores the use of large language models (LLMs) for the task of generating</title>
        <p>project-specific code summaries, using the few-shot learning technique. The principle behind this
technique is to provide the LLM, via its prompt, with examples of input/output pairs (usually from
other projects) that are similar to the input we want to process and the expected output.</p>
        <p>More recently,[5] proposes a new method called Automatic Semantic Augmentation of Prompts
(ASAP), which aims to build prompts for software engineering tasks. The hypothesis underlying this
research is that an efective prompt is one that provides the LLM with everything a developer would
take into consideration as semantic and syntactic facts when manually executing the task. This study
deals with the case of code comment generation (code summaries). ASAP adds semantic facts extracted
from an analysis of the source code to the prompt. In addition, it uses the few-shot technique. The ASAP
approach improves the average performance of LLMs for several commonly used metrics, including</p>
      </sec>
      <sec id="sec-2-3">
        <title>BLEU (a semantic based-metric), by 1.68% to 18.69%.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Our comment generation solution</title>
      <sec id="sec-3-1">
        <title>To generate a package comment, we have implemented three diferent retrieval-augmented strategies</title>
        <p>
          towards generating packages comments. Each approach relies on a general three-step plan: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) extract
a model representation of the package (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) extract (or retrieve) the relevant data (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) generate a package
comment. Depending on the strategy, we extract diferent types of data from the model. In fig. 3, we
illustrate 3 diferent extraction strategies that rely on the model representation.
        </p>
        <p>Strategy 1: Naive extraction: For the first strategy, we apply a naive strategy where for each class
within a package, we extract the source text (or .st file). From this extraction, we generate a summary
with instructions to describe the class’ responsibilities, collaborators and key implementations. Then,
we instruct an LLM to create a package comment following the CRC card methodology based on the sum
of the class summaries. This strategy relies on the LLM to understand the Smalltalk-based languages
to extract quality class summaries. However, there are risks that including all the source text will
increase the context and therefore increase the likelihood of hallucinations [6]. However, it is a more
compute-expensive strategy as there is a multi-step generation process. An example generation is given
in fig. 2, and compared with its reference package comment. This example highlights the potential
strength of package comment generation. In this case, we are able to retrieve a succinct description
of the responsibility of the package, its collaborators, and key classes within the package and their
responsibilities. Indeed, with this strategy, the LLM is able to recognize that OSPlatform and that it
is used by PharoShortcuts to manage platform specific shortcuts. However, this strategy tends to
create hallucinations which can be seen with the mention of non-existent classes within the package
such as TimeStampMessageConverter, which should be TimeStampMethodConverter.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Strategy 2: Comment-based extraction: For the second strategy, we apply a strategy that relies on</title>
        <p>the existing class comments. In this context, we extract a list of existing class comment from the package
entity. Then, we instruct an LLM to create a package comment following the CRC card methodology
from the sum of the existing class comments. This strategy relies on the high rate of class comments to
work. In the case where a package has a low rate of class comments, this strategy can generate poor
quality package comments. Furthermore, it removes the ability of the LLM to extract information from
the classes which would not be included in a class comment such as dependencies.</p>
        <p>Strategy 3: Comment &amp; dependency-based extraction For the third strategy, we apply a strategy
that relies on the entity relationship between the classes. Just like with strategy 2, we extract the
existing comments. Furthermore, we extract the outgoing references from the class methods to identify
the potential collaborators. Then, we instruct an LLM to create a package comment following the CRC
card methodology from the sum of the existing class comments along with the outgoing references in
its context. This strategy has the upside of potentially exploiting user-written class comments, while
also having an understanding of the internal and external dependencies of the package. Similarly to</p>
      </sec>
      <sec id="sec-3-3">
        <title>Strategy 2, in the case where a package has a low rate of class comments, this strategy can generate poor quality package comments.</title>
        <p>The implementation of these strategies are made in Pharo and can be found on Github2.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimentation</title>
      <p>We propose a set of package comment generation strategies which use diferent sources of information
to highlight the strength and weaknesses of each source. By exploring diferent sources, we wish
to highlight their advantages and inconveniences. To highlight this, we establish a set of research
questions:
• RQ 1: Does the strategy have an impact on the overall CRC structure?
• RQ 2: Does the strategy have an impact on the description of the responsibility of the package?
• RQ 3: Does the strategy have an impact on the description of the collaborators of the package?
• RQ 4: Does the strategy have an impact on the overall quality across the CRC structure, the
responsibility and collaborator description of the generated comment when it is compared to the
source comment?</p>
      <sec id="sec-4-1">
        <title>Furthermore, the longer the context provided to an LLM becomes, the more likely it is that the LLM</title>
        <p>will improperly retrieve relevant information for a task [6]. This raises another question:
• RQ 5: Does the size of a package impact the quality of the generated comment?</p>
      </sec>
      <sec id="sec-4-2">
        <title>2https://github.com/pzaragoza93/AutoCodeDocumentator</title>
        <sec id="sec-4-2-1">
          <title>4.1. Evaluation Dataset</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>To study the impact of package comment generation, we applied the following package selection process</title>
        <p>to create a dataset of packages with existing comments to serve as a reference (see fig. 4). Initially, we
extract all Pharo 12 packages (768) and filter them to only include packages with comments which
can serve as a reference (128). Then, we exclude test and baseline packages since we deem them less
interesting for comment generation (107).</p>
      </sec>
      <sec id="sec-4-4">
        <title>To study the impact of the diferent strategies, as well as the context size on the quality of package</title>
        <p>comment generation, we order the remaining packaged based on their size (nb of classes). We then,
create three distinctive groups of packages by splitting the ordered list into three groups (small, medium,
and large). Finally, from each group we randomly select 7 packages for a total of 21 packages. From
this dataset of 21 packages, we apply the three diferent strategies to generate 63 package comments.</p>
      </sec>
      <sec id="sec-4-5">
        <title>To generate the set of package comments, we used ’mistral-small-2503’, an open-source LLM model</title>
        <p>available on huggingface3.</p>
        <sec id="sec-4-5-1">
          <title>4.2. Evaluation Metrics</title>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>To evaluate the quality of the generated package comments, we propose a manual evaluation of the 63 generated comments based 4 categories : (1) the overall CRC methodology structure, (2) the description of the package’s responsibility (3) description of the package’s collaborators, and (4) overall comparison with the original comment as a reference.</title>
      </sec>
      <sec id="sec-4-7">
        <title>With this in mind, we propose a set of 12 statements divided into 4 diferent categories corresponding</title>
        <p>to the categories mentioned above (see table 1). For each statement, the user is asked to evaluate the
veracity of a statement on a Likert scale of 7 points. The more they agree with a statement the higher
the score. Inversely, the more they disagree with a statement the lower the score. For each package, we
then ask these statements for each comment generated by the three comment-generation strategies.</p>
      </sec>
      <sec id="sec-4-8">
        <title>This means 36 statements are asked per package. Finally, we divide the 21 packages into three groups, so that each group of participants only evaluates 7 packages (and their 3 generated comments) for a total of 21 generated comments, or 252 statements.</title>
      </sec>
      <sec id="sec-4-9">
        <title>For this experiment, we separated 6 Pharo users into three diferent groups and evaluate the 21 generated comments over 7 packages. To better illustrate the overall evaluation, we propose fig. 5.</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>The results of the manual evaluation, the scripts for the analysis, and the evaluation dataset is available</title>
        <p>in the github project4. First, we present the average score for each statement and strategy across all</p>
      </sec>
      <sec id="sec-5-2">
        <title>3https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503</title>
      </sec>
      <sec id="sec-5-3">
        <title>4https://github.com/pzaragoza93/label-studio-pharo-evaluation</title>
        <p>CRC Methodology
CRC Methodology
Responsibility
Responsibility
Responsibility
Collaborators
Collaborators
Collaborators
Comparison
Comparison
Comparison</p>
        <p>Statement ID Statement
crc_methodology ac- The CRC format (Class name, Responsibility,
Collaboracuracy q1 tors) is clearly respected.
crc_methodology ac- The comment clearly explains both what the package
curacy q2 does and who it interacts.
crc_methodology ac- The generated comment contains a structured title,
decuracy q3 scription of purpose, and list of external dependencies.
responsibility accu- The generated responsibility description correctly
deracy q1 scribes the core responsibilities of the package.
responsibility accu- The generated description of the package’s responsibility
racy q2 is clear.
responsibility accu- The generated description of the package’s responsibility
racy q3 is succinct.
collaborator accuracy The package’s main collaborators are mentioned and
q1 described accurately.
collaborator accuracy The generated comment DOES NOT omit important
q2 external dependencies.
collaborator accuracy The reason for the interactions between its collaborators
q3 is clearly explained.
comparison accuracy The generated comment is more complete the original
q1 comment.
comparison accuracy The generated comment is more clear than the original
q2 comment.
comparison accuracy The generated comment is more useful than the original
q3 comment.
packages and groups in table 2.</p>
      </sec>
      <sec id="sec-5-4">
        <title>When we apply an analysis of variance (ANOVA) test to evaluate whether the diference between the</title>
        <p>three strategies across all 12 statements, we obtain a p-value above 0.05 across the board. Therefore, we
cannot conclude that the diference is significative. With regards to RQ1, RQ2, RQ3, and RQ4, we cannot
conclude that there is a strategy that is particularly better at increasing the quality of the comment
based on the CRC methodology structure, the description of the responsibility, the description of the
collaborators, nor quality when compared to the reference.</p>
      </sec>
      <sec id="sec-5-5">
        <title>However, when we look at statements regarding the quality when compared to the reference (comparison_accuracy), we highlight the overall positive score of 6+ out of 7 across all three strategies. This</title>
        <p>denotes that while a strategy is not particularly better than any other, applying any of these strategy
does generate comments that are preferred over the existing comment in terms of completeness (q1),
clarity (q2), and usefulness (q3).</p>
      </sec>
      <sec id="sec-5-6">
        <title>Furthermore, statements related towards the description of the collaborators of a package contains</title>
        <p>the lowest average. This seems to indicate that LLMs have dificulty highlighting key collaborators
within a package. Indeed, when asked for feedback, several evaluators reported that the listing of
collaborators had a tendency to go on and on, to hallucinate certain relationships, and to not provide
key details.</p>
        <p>Statement
crc_methodology_accuracy_q1
crc_methodology_accuracy_q2
crc_methodology_accuracy_q3
responsibility_accuracy_q1
responsibility_accuracy_q2
responsibility_accuracy_q3
collaborator_accuracy_q1
collaborator_accuracy_q2
collaborator_accuracy_q3
comparison_accuracy_q1
comparison_accuracy_q2
comparison_accuracy_q3</p>
      </sec>
      <sec id="sec-5-7">
        <title>Regarding RQ5, we calculate the average score for each statement based on the size of the</title>
        <p>packages. Furthermore, we apply an ANOVA test to evaluate whether the diference between
the three strategies across all 12 statements are significative. This is the case for 4
statements: crc_methodology_accuracy_q2, collaborator_accuracy_q2, collaborator_accuracy_q3,
comparison_accuracy_q3. In general, there is a tendency for a better score when the package size is smaller.</p>
      </sec>
      <sec id="sec-5-8">
        <title>In the case of statements collaborator_accuracy_q2 and comparison_accuracy_q3, generated comments for medium packages performed better followed by those generated for small packages. Overall, comments generated for large packages received the lowest score, which indicates that LLMs have a tougher time generating accurate comments the larger a package gets.</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <sec id="sec-6-1">
        <title>We proposed three distinct strategies for generating package comments to investigate how the choice</title>
        <p>of context may influence the quality of CRC-based package comments. While no statistically significant
diferences were found between the three strategies, all strategies-generated comments that were
preferred over existing comments in terms of clarity, completeness, and usefulness. These results are
encouraging in terms of LLMs use for better code comprehension and documentation quality.</p>
      </sec>
      <sec id="sec-6-2">
        <title>However, several limitations must be acknowledged. First, the inter-agreement on manual evaluations</title>
        <p>remains to be addressed. While we attempted to reduce bias by introducing multiple ratings for each
package, we were unable to reach a suficient Kappa coeficient to confirm an inter-judge agreement.</p>
      </sec>
      <sec id="sec-6-3">
        <title>This may be caused by having an insuficient number of evaluators per package or a questionnaire that is not explained well enough.</title>
      </sec>
      <sec id="sec-6-4">
        <title>In future works, we wish to address this by proposing a CI-integrated comment generation pipeline</title>
        <p>where the generated comments can be judged by active developers within the community through
a submitted pull request. Furthermore, we propose enhancing the system’s precision by reinforcing
the comment generation with static analysis-based heuristics. For instance, studies in hallucinations
reduction in models suggest graph-oriented data can help with reducing this type of hallucinations [7].</p>
      </sec>
      <sec id="sec-6-5">
        <title>By providing a list of key collaborators through these heuristics we can reduce the hallucinations and anchor the generation to verified class interactions. In this hybrid approach, LLMs would focus on generating the responsibility descriptions of the package, while collaborator information is algorithmically selected to ensure correctness and consistency.</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[5] T. Ahmed, K. S. Pai, P. Devanbu, E. T. Barr, Automatic semantic augmentation of
language model prompts (for code summarization), 2024. URL: https://arxiv.org/abs/2304.06815.
arXiv:2304.06815.
[6] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How
language models use long contexts, Transactions of the Association for Computational Linguistics
12 (2024) 157–173. doi:10.1162/tacl_a_00638.
[7] M. Barry, G. Caillaut, P. Halftermeyer, R. Qader, M. Mouayad, F. Le Deit, D. Cariolaro, J. Gesnouin,</p>
      </sec>
      <sec id="sec-7-2">
        <title>GraphRAG: Leveraging graph-based eficiency to minimize hallucinations in LLM-driven RAG for</title>
        <p>ifnance data, in: G. A. Gesese, H. Sack, H. Paulheim, A. Merono-Penuela, L. Chen (Eds.), Proceedings
of the Workshop on Generative AI and Knowledge Graphs (GenAIK), International Committee on</p>
      </sec>
      <sec id="sec-7-3">
        <title>Computational Linguistics, Abu Dhabi, UAE, 2025, pp. 54–65.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Measuring program comprehension: A large-scale ifeld study with professionals</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>44</volume>
          (
          <year>2018</year>
          )
          <fpage>951</fpage>
          -
          <lpage>976</lpage>
          . doi:
          <volume>10</volume>
          .1109/TSE.
          <year>2017</year>
          .
          <volume>2734091</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          ,
          <article-title>A laboratory for teaching object oriented thinking</article-title>
          ,
          <source>in: Conference Proceedings on Object-Oriented Programming Systems, Languages and Applications</source>
          , OOPSLA '89,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>1989</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1145/74877. 74879.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aponte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sridhara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pollock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vijay-Shanker</surname>
          </string-name>
          ,
          <article-title>Automatic generation of natural language summaries for java classes</article-title>
          ,
          <source>in: 2013 21st International Conference on Program Comprehension (ICPC)</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>32</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICPC.
          <year>2013</year>
          .
          <volume>6613830</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Devanbu</surname>
          </string-name>
          ,
          <article-title>Few-shot training llms for project-specific code-summarization</article-title>
          ,
          <source>in: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering</source>
          , ASE '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          . URL: https: //doi.org/10.1145/3551349.3559555. doi:
          <volume>10</volume>
          .1145/3551349.3559555.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>