<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automated Assessment of Code Comment Quality</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hard Kapadia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Goa, 403401</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The usefulness of code comments in software development can vary widely and recognizing this requires methods capable of rigorously measuring their true benefits. This work seeks to help improve the classification of code comment usefulness through a hybrid approach that combines manually-retagged datasets with synthetic data augmentation. For our augmentation, we provided GPT-3.5-turbo, a leading large language model, with prompts to create additional labelled examples of comments to aid the project. We constructed a random forests baseline classification model. Importantly, despite the synthetic examples added to the dataset, we showed no drop in the models performance, with F1 scores remaining around 0.79 before and after augmentation. The findings of this study shine a light on some of the benefits and limitations of applying synthetic data augmentation in the classification of code comments usefulness.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Random Forest</kwd>
        <kwd>Data Augmentation</kwd>
        <kwd>Comment Classification</kwd>
        <kwd>Qualitative Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Software metadata, such as runtime traces and structural attributes, is integral to code maintenance and
comprehension, leading to the development of numerous extraction tools [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8">1, 2, 3, 4, 5, 6, 7, 8</xref>
        ]. In terms
of mining code comments, initial quality assessment eforts relied on lexical and structural analysis,
comparing word similarity (e.g., Levenshtein distance) and comment length to filter non-informative
entries [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11, 12, 13, 14</xref>
        ]. More sophisticated approaches used feature engineering based on developer
surveys [15, 16] or semantic interpretation via knowledge graphs [17, 18] to classify comments as useful
or not useful, thereby aiding codebase decluttering. The recent emergence of Large Language Models
(LLMs) [19] introduces the need to evaluate if their automated quality assessments (e.g., using GPT-3.5
or LLaMA) align with human interpretations. The IRSE track at FIRE 2023 [20] addresses this by
extending prior methodology [17, 21, 22, 23, 24, 25, 26, 27], specifically examining the efectiveness of
various vector space models and the impact of incorporating GPT-generated labels on the performance
of models designed for comment utility prediction in open-source software. Similarly, [28, 29] also
explores LLMs for the tasks related to this topic.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task and Dataset Description</title>
      <p>This paper addresses a binary classification task aimed at categorizing source code comments as
either Useful or Not Useful. The system takes a code comment along with its surrounding lines of
code as input and outputs a binary label, which ultimately helps developers more efectively understand
the associated code. This classification system is developed using classical machine learning algorithms,
specifically Random Forests.</p>
      <sec id="sec-3-1">
        <title>3.1. Task Definition</title>
        <p>The two categories of source code comments are defined based on their relevance to the surrounding
code:</p>
        <sec id="sec-3-1-1">
          <title>Label</title>
          <p>Useful
Not Useful</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Definition</title>
          <p>The given comment is relevant to the corresponding
source code.</p>
          <p>The given comment is not relevant to the corresponding
source code.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Datasets</title>
        <p>Our study utilizes two distinct datasets for training and analysis:
1. Primary Manually Annotated Dataset: This dataset comprises over 11,000 code-comment
pairs written in the C programming language. Each instance, sourced from GitHub, includes
the comment text, a corresponding code snippet, and a binary label (Useful/Not Useful). The
entire dataset was meticulously annotated by a team of 14 human annotators. A sample of this
data structure is presented in Table 1.
2. GPT-Labeled Augmentation Dataset: We created a secondary, similarly structured dataset also
sourced from GitHub. In this case, the binary labels were assigned by the GPT large language
model. This dataset is explicitly used to augment the primary manually annotated dataset during
subsequent analyses to assess model performance with synthetic data.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Working Principle</title>
      <p>We employ a Random Forest (RF) algorithm to implement the binary classification system. The system
classifies code comments as Useful or Not Useful by taking the comment text and its surrounding
code snippet as input.</p>
      <p>To prepare the data for the model, we use a pre-trained Universal Sentence Encoder to generate
embeddings for both the code snippets and their corresponding comments. These resultant vector
embeddings form the input features for the RF model.</p>
      <p>The complete dataset is partitioned into a training set and a testing set using an 80%/20% split,
respectively, for all experiments.</p>
      <sec id="sec-4-1">
        <title>Description</title>
        <p>/*fix issue 404 handler*/
/*check for carriage return
followed by null character*/
/*Process security context
message*/</p>
        <p>Context Snippet
-10. int err_code = 0;
-9. Request *req = NULL;
...
-1. #ifndef ISSUE404_FIX
/*fix issue 404 handler*/
1. handle_error();
-1. if (end_of_stream)
/*check for carriage return...*/
1. c = read_char();
2. if (c == ’\n’) {
3. line_count++;
-10. do_cleanup();
...
-2. int status = 0;
-1. while(status == 0) {
/*Process security context message*/
1. send_msg(msg_ctx);
Unnecessary
Informative</p>
        <p>Useful</p>
        <sec id="sec-4-1-1">
          <title>4.1. Random Forest Model</title>
          <p>We leverage Random Forest, an ensemble method based on decision trees, to enhance predictive
accuracy and mitigate overfitting. The RF prediction is based on the majority vote of its constituent
trees:
 (x) = majority ({(x)}=1)
(1)
where (x) is the prediction of the -th tree for input vector x, and  is the total number of trees.</p>
          <p>Each tree in the ensemble is constructed through bootstrapping (sampling the training data with
replacement) and random feature selection at every node before splitting based on a criterion like
Gini impurity.</p>
          <p>Key advantages of using Random Forest include its ability to handle multi-dimensional feature spaces
without requiring feature scaling and its robustness in dealing with missing values. The out-of-bag
(OOB) error is used during training to provide an unbiased estimate of the generalization error for
hyperparameter tuning. While a default threshold of 0.5 is used for binary classification, this can be
adjusted to prioritize the identification of the Useful comment class.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The results for the binary classification task on both datasets are summarized in Table 2.</p>
      <p>The negligible change in performance metrics between the two experiments suggests that the
synthetically generated data is practically indistinguishable from the human-annotated original data. This
observation validates the utility of using a model like GPT-3.5-turbo for efective data augmentation
in this domain.</p>
      <sec id="sec-5-1">
        <title>Dataset</title>
        <p>Initial Dataset
GPT-Augmented Set
Acc. (%)
81.12
81.08</p>
      </sec>
      <sec id="sec-5-2">
        <title>Prec.</title>
        <p>0.7915
0.7922
Rec.
0.8020
0.8015</p>
      </sec>
      <sec id="sec-5-3">
        <title>F1-Score</title>
        <p>0.7955
0.7958</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This article discusses a binary classification problem in source code comment classification, specifically
targeted towards the usefulness of comments embedded within source code written in C language.
The primary classification method was Random Forests. In total, two experiments were completed;
the first used only the original dataset while the second included both the original dataset and a
synthetic dataset created by the Generative Pre-Trained Transformer (GPT). The similar results from
both experiments indicated the syntehtic data closely resembled the original data, showcasing how
the generation of synthetic data can improve the volume of data needed for developing models. The
accuracy of the synthetic data in comparison to the original dataset is, in part, evident based upon those
results. Therefore, overall, the generation of synthetic data is useful for data augmentation and can be
applied in various pipelines.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>In the course of preparing this manuscript, the author(s) employed the generative AI tool ChatGPT. Its
use was limited to performing checks for grammar and spelling. Following this, the author(s) conducted
a thorough review and revision of the text and assume full responsibility for the final published content.
[12] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder, Can
we predict useful comments in source codes?-analysis of findings from information retrieval in
software engineering track@ fire 2022, in: Proceedings of the 14th Annual Meeting of the Forum
for Information Retrieval Evaluation, 2022, pp. 15–17.
[13] S. Majumdar, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder, Overview
of the irse track at fire 2022: Information retrieval in software engineering., in: FIRE (Working
Notes), 2022, pp. 1–9.
[14] J. L. Freitas, D. da Cruz, P. R. Henriques, A comment analysis approach for program comprehension,</p>
      <p>Annual Software Engineering Workshop (SEW), IEEE, 2012, pp. 11–20.
[15] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using textual
features and developer experience, International Conference on Mining Software Repositories
(MSR), IEEE, 2017, pp. 215–226.
[16] A. Bosu, M. Greiler, C. Bird, Characteristics of useful code reviews: An empirical study at microsoft,</p>
      <p>Working Conference on Mining Software Repositories, IEEE, 2015, pp. 146–156.
[17] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation of
comments to aid software maintenance, Journal of Software: Evolution and Process 34 (2022)
e2463.
[18] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search approach to
program comprehension from code comments, in: Advanced Computing and Systems for Security,
Springer, 2020, pp. 29–42.
[19] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information
processing systems 33 (2020) 1877–1901.
[20] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough,
P. Majumder, Generative ai for software metadata: Overview of the information retrieval in
software engineering track at fire 2023, arXiv preprint arXiv:2311.03374 (2023).
[21] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An efective low-dimensional
software code representation using bert and elmo, in: 2022 IEEE 22nd International Conference
on Software Quality, Reliability and Security (QRS), IEEE, 2022, pp. 763–774.
[22] S. Majumdar, A. Deshpande, P. P. Das, P. P. Chakrabarti, Comprehending c codes with llms:</p>
      <p>Efective comment generation through retrieval and reasoning, Pattern Recognition Letters (2025).
[23] S. Paul, S. Majumdar, R. Shah, S. Das, M. Ghosh, D. Ganguly, G. Calikli, D. Sanyal, P. P. Das, P. D.</p>
      <p>Clough, et al., Overview of the “information retrieval in software engineering”(irse) track at forum
for information retrieval 2024, in: Proceedings of the 16th Annual Meeting of the Forum for
Information Retrieval Evaluation, 2024, pp. 18–21.
[24] N. Chatterjee, S. Majumdar, P. P. Das, A. Chakrabarti, Parallelc-assist: Productivity accelerator
suite based on dynamic instrumentation, IEEE Access (2023).
[25] P. Chakraborty, S. Dutta, D. K. Sanyal, S. Majumdar, P. P. Das, Bringing order to chaos:
Conceptualizing a personal research knowledge graph for scientists., IEEE Data Eng. Bull. 46 (2023)
43–56.
[26] S. Paul, S. Majumdar, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. Das, P. D. Clough, P.
Majumder, Eficiency of large language models to scale up ground truth: Overview of the irse track
at forum for information retrieval 2023, in: Proceedings of the 15th Annual Meeting of the Forum
for Information Retrieval Evaluation, 2023, pp. 16–18.
[27] N. Chatterjee, S. Majumdar, P. P. Das, A. Chakrabarti, Tool assisted agile approach for legacy
application migration, International Journal of System Assurance Engineering and Management
(2025) 1–16.
[28] A. Deshpande, A. Maji, D. Mondol, P. P. Das, P. D. Clough, S. Majumdar, The code–llm handshake:
Smarter maintenance through ai, in: Proceedings of the 17th annual meeting of the Forum for
Information Retrieval Evaluation, 2025, pp. 9–12.
[29] A. Mitra, S. Majumdar, A. Mukhopadhyay, P. P. Das, P. D. Clough, P. P. Chakrabarti,
Operationalizing large language models with design-aware contexts for code comment generation, arXiv</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papdeja</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <article-title>Smartkt: a search framework to assist program comprehension using smart knowledge transfer</article-title>
          ,
          <source>in: 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
          </string-name>
          ,
          <article-title>Debugging multi-threaded applications using pin-augmented gdb (pgdb)</article-title>
          ,
          <source>in: International conference on software engineering research and practice (SERP)</source>
          . Springer,
          <year>2015</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
          </string-name>
          ,
          <article-title>D-cube: tool for dynamic design discovery from multi-threaded applications using pin</article-title>
          ,
          <source>in: 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Siegmund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Peitek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Parnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Apel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofmeister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kästner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Begel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bethmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brechmann</surname>
          </string-name>
          ,
          <article-title>Measuring neural eficiency of program comprehension</article-title>
          ,
          <source>in: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>140</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <article-title>A mathematical framework for design discovery from multi-threaded applications using neural sequence solvers</article-title>
          ,
          <source>Innovations in Systems and Software Engineering</source>
          <volume>17</volume>
          (
          <year>2021</year>
          )
          <fpage>289</fpage>
          -
          <lpage>307</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pratim Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <article-title>Dcube_ nn d cube nn: Tool for dynamic design discovery from multi-threaded applications using neural sequence models</article-title>
          ,
          <source>Advanced Computing and Systems for Security:</source>
          Volume
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>75</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>C. B. de Souza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Anquetil</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. de Oliveira</surname>
          </string-name>
          ,
          <article-title>A study of the documentation essential to software maintenance</article-title>
          ,
          <source>Conference on Design of communication, ACM</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
          </string-name>
          ,
          <article-title>Smart knowledge transfer using google-like search</article-title>
          ,
          <source>arXiv preprint arXiv:2308.06653</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Hotcomments: how to make program comments more useful?, in: Conference on Programming language design and implementation (SIGPLAN)</article-title>
          , ACM,
          <year>2007</year>
          , pp.
          <fpage>20</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Gotmare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Bui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hoi</surname>
          </string-name>
          , Codet5+:
          <article-title>Open code large language models for code understanding and generation</article-title>
          ,
          <source>arXiv preprint arXiv:2305.07922</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Steidl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hummel</surname>
          </string-name>
          , E. Juergens,
          <article-title>Quality analysis of source code comments</article-title>
          ,
          <source>International Conference on Program Comprehension (ICPC)</source>
          , IEEE,
          <year>2013</year>
          , pp.
          <fpage>83</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>