<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Investigating the Impact of Synthetic Data on Code Comment Quality Prediction: A Logistic Regression Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Subhajit Dutta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Kharagpur, 721302</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The importance of code comment quality in software development can vary, which emphasizes the need for trustworthy evaluation methods. By combining manually annotated datasets with artificially generated data, this work aims to improve the classification of comment usefulness. We used GPT-4.1 to label extra comments in order to augment the data. The baseline classifier, a logistic regression model, had an F1 score of roughly 0.81 and showed little improvement when the synthetic data was added. The study looks into the benefits and drawbacks of using artificial data to assess the relevance of code comments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLMs</kwd>
        <kwd>GPT-4</kwd>
        <kwd>1</kwd>
        <kwd>Comment Classification</kwd>
        <kwd>Logistic Regression</kwd>
        <kwd>Qualitative Analysis</kwd>
        <kwd>Data Augmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Software is now essential in many vital industries, including banking, healthcare, and transportation, in
the current digital landscape. Organizations must regularly update their current systems and create
new applications in order to stay up with changing requirements. Code complexity inevitably rises as a
result of this continuous evolution, which is necessary to support new features. Thus, a key component
of the Software Development Life Cycle (SDLC) is eficiently managing these sizable and intricate
codebases.</p>
      <p>Quick fixes, code additions, and updates to already-existing applications are frequently the outcome of
rapid development cycles. These hurried schedules, though, occasionally result in less-than-ideal coding
techniques. Supporting documentation, such as requirement specifications and design documents, may
become obsolete as software develops further, and the original developers are frequently unavailable
for consultation. Program comprehension is a crucial tactic for preserving and enhancing current
codebases, and this scenario emphasizes the significance of implementing organized, quality-focused
processes in software development.</p>
      <p>Because software is always changing, test execution logs, static code analysis, and code comments
are all trustworthy sources of information. Code comments are emphasized in this study as important
software design indicators for automated tools and developers alike. They help with understanding
and upkeep by encapsulating the goals and reasoning behind the code. Their uneven quality, however,
emphasizes the necessity for automated techniques to assess their value.</p>
      <p>The lack of well-annotated datasets that represent a variety of programming contexts presents a
significant challenge when assessing comment utility. In order to get around this, new methods are
needed to increase the size of current datasets and improve model performance on actual data. In
order to tackle this, our work combines synthetic data augmentation produced by GPT-4.1 with manual
annotation.</p>
      <p>The challenge of dividing source code comments in the C programming language into two
groups—useful and not useful—is the focus of this paper. As the baseline classifier, we trained a
logistic regression model on a dataset of 11,500 manually annotated comments. We added more than
300 GPT-labeled samples to the dataset in order to investigate possible enhancements. With an F1 score
of 0.81 on both the original and augmented datasets, the model’s performance stayed consistent.</p>
      <p>By combining synthetic data augmentation with manual annotations, our study improves our
understanding of code comment utility classification. Our goal is to solve current issues and facilitate the
creation of more flexible models for contemporary software engineering.</p>
      <p>The structure of this paper is as follows: Section 2 reviews related work on comment classification.
Section 3 describes the task and dataset, while Section 4 outlines the proposed methodology. Section 5
presents the results, and Section 6 provides the conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Literature Review</title>
      <p>
        Within the field of software engineering, automatic program comprehension is a well-established
research topic. Using sources like runtime execution traces and structural code attributes, a number of
tools have been developed to make it easier to extract knowledge from software metadata. [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4, 5,
6, 7, 8</xref>
        ].
      </p>
      <p>In order to comprehend the code flow, novice programmers usually rely on the comments that are
already there. Not all comments, though, successfully aid in program comprehension, so it’s important
to consider their applicability before using them. Numerous scholars have investigated the automatic
categorization of source code comments, emphasizing the assessment of their quality. For instance,
Omal et al. [9] suggested a hierarchical structure for the variables afecting software maintainability.
They made it possible to evaluate software features that can be combined into a single maintainability
index by introducing quantifiable attributes for each factor in the form of metrics.Fluri et al. [ 10]
looked into whether source code and related comments change together over time. They discovered
that 97% of comment changes happened in the same revision as the corresponding code changes in
three open-source projects: ArgoUML, Azureus, and JDT Core. The quality model was transformed
into a structured knowledge base appropriate for industrial applications by Deissenboeck et al. [11],
who proposed a two-dimensional maintainability model that explicitly links system properties with
maintenance activities.An empirical study on task annotations embedded in source code was carried
out by Storey et al. [12], with a focus on their function in task management for developers. They noted
that task management entails striking a balance between the manual annotations developers add to
their code and oficial issue-tracking systems. In order to improve the readability of PL/I programs,
Tenny et al. [13] conducted a 3 × 2 experiment comparing procedure formatting with comments.
Programs without comments were the least readable, according to the results of a survey given to
student participants after they had read the program.</p>
      <p>Yu et al. [14] categorized source code comments into four groups: unqualified, qualified, good, and
excellent. They also showed that classification performance was enhanced by combining several basic
classifiers. Majumdar et al. [ 15] presented CommentProbe, an automated classification system for
assessing the quality of comments in C codebases.Even though there has been a lot of progress in
examining source code comments from a variety of angles [15, 16, 17, 18, 19, 20, 21, 22, 8, 20, 23, 24, 19,
25, 26], automated quality evaluation of comments is still an important and developing field that needs
more investigation.</p>
      <p>It is crucial to compare the quality assessment of code comments by standard models such as GPT
4.1 or LLM with human interpretation in light of the emergence of large language models [27]. The
IRSE track at FIRE 2025 [28, 22] expands on the approaches put forth in [15, 23, 29] to explore diferent
vector space models [30] and features for binary classification and evaluation of comments in relation
to code comprehension. Incorporating GPT-generated labels for code quality and comment snippets
taken from open-source software, this track also evaluates the predictive model’s performance.
#
1
/*Transform a minor status code
(the underlying routine error) into text.*/
2</p>
      <p>/*cr to cr,nul*/
3</p>
      <p>/*test 637*/</p>
    </sec>
    <sec id="sec-3">
      <title>3. Classification Goal and Data Characteristics</title>
      <p>The task of developing a binary classification system to diferentiate source code comments as either
useful or not useful is the focus of this paper. After processing a comment and the code that goes with it,
the system forecasts a label that accurately reflects the comment’s significance. For this classification,
conventional machine learning techniques like logistic regression can be used. The following is a
definition of the comment categories:
• Useful - The comment ofers precise or pertinent code information.</p>
      <p>• Not Useful - The comment doesn’t provide any useful details about the code.</p>
      <p>11,500 C code-comment pairs make up our primary dataset. A team of 15 people annotated the
pairs to indicate whether or not the comments were helpful. Additionally, by gathering code-comment
pairs from GitHub and labeling them with GPT, we produced an extra dataset. Our primary dataset is
supplemented by this secondary dataset, which is formatted identically to the original.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Model Implementation and Experimental Setup</title>
      <p>The binary classification functionality is implemented using logistic regression. The system accepts
surrounding code snippets and comments as input. We use a pre-trained Universal sentence encoder to
generate embeddings of each code segment and the corresponding comment. Both machine learning
models are trained using the output of the embedding process. 80% of the data instances and their labels
are included in the training dataset. In both experiments, the remainder is used for testing. The next
section discusses the model’s description.</p>
      <p>Original Dataset(11,500 manually labeled)
Synthetic Data(311 GPT-labeled)
Dataset Augmentation(Combine both sources)</p>
      <p>Feature Extraction(Universal Sentence Encoder)
Logistic Regression Model(Binary Classification: Useful / Not Useful)</p>
      <p>Evaluation(F1 Score ≈ 0.81)
4.1. Logistic Regression
For the binary comment classification task, we employ logistic regression, which maintains the
regression output between 0 and 1 by using a logistic function. The following is the definition of the logistic
function:</p>
      <p>=  + 
() =</p>
      <p>1
1 + (−)
(1)
(2)</p>
      <p>The logistic function (see equation 2) receives the output of the linear regression equation (see
equation 1). Based on the acceptance threshold, the logistic function’s probability value is used to
predict binary classes. For the useful comment class, we maintain the threshold value of 0.6. Every
training instance yields a three-dimensional input feature that is fed into the regression function. The
hyper-parameter tuning is trained using the Cross entropy loss function.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Classification Performance Metrics</title>
      <p>We use both datasets to train our logistic regression model. There are 311 samples in the GPT-generated
data compared to 11,500 samples in the original dataset. The first experiment yields the following scores
using only the original data.</p>
      <p>The following outcomes were observed after adding the GPT-generated data to the original dataset.</p>
      <p>Accuracy
Original Dataset 82.6374
Augmented Dataset 81.7362</p>
      <p>The validity of using GPT-generated data for data augmentation is demonstrated by the very slight
change in scores across metrics, which implies that the newly generated data was essentially
indistinguishable from the original dataset.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This study looked into the classification of code comment quality using synthetic data. By combining
preexisting manually annotated data with synthetic samples, the main goal was to enhance the classification
of comment utilities. For the binary classification task, the researchers used a baseline logistic regression
model. An F1 score of approximately 0.81 was successfully established by this baseline model. One
of the main conclusions of the study was that classification performance was not improved by the
augmentation of synthetic data. Even after the augmentation, the model’s results held steady, retaining
the F1 score of 0.81, suggesting that the addition of the synthetic data had little efect. This result
implies that in some situations, traditional annotated data remains highly efective. The study suggests
investigating diferent strategies for future research, like employing more intricate models or integrating
domain-specific expertise.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>In the course of preparing this manuscript, the author(s) employed the generative AI tool ChatGPT. Its
use was limited to performing checks for grammar and spelling. Following this, the author(s) conducted
a thorough review and revision of the text and assume full responsibility for the final published content.
[5] S. Majumdar, N. Chatterjee, P. P. Das, A. Chakrabarti, A mathematical framework for design
discovery from multi-threaded applications using neural sequence solvers, Innovations in Systems
and Software Engineering 17 (2021) 289–307.
[6] S. Majumdar, N. Chatterjee, P. Pratim Das, A. Chakrabarti, Dcube_ nn d cube nn: Tool for dynamic
design discovery from multi-threaded applications using neural sequence models, Advanced
Computing and Systems for Security: Volume 14 (2021) 75–92.
[7] J. Siegmund, N. Peitek, C. Parnin, S. Apel, J. Hofmeister, C. Kästner, A. Begel, A. Bethmann,
A. Brechmann, Measuring neural eficiency of program comprehension, in: Proceedings of the
2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp. 140–150.
[8] N. Chatterjee, S. Majumdar, P. P. Das, A. Chakrabarti, Parallelc-assist: Productivity accelerator
suite based on dynamic instrumentation, IEEE Access 11 (2023) 73599–73612.
[9] P. Oman, J. Hagemeister, Metrics for assessing a software system’s maintainability, in: Proceedings</p>
      <p>Conference on Software Maintenance 1992, IEEE Computer Society, 1992, pp. 337–338.
[10] B. Fluri, M. Wursch, H. C. Gall, Do code and comments co-evolve? on the relation between source
code and comment changes, in: 14th Working Conference on Reverse Engineering (WCRE 2007),
IEEE, 2007, pp. 70–79.
[11] F. Deissenboeck, S. Wagner, M. Pizka, S. Teuchert, J.-F. Girard, An activity-based quality model for
maintainability, in: 2007 IEEE International Conference on Software Maintenance, IEEE, 2007, pp.
184–193.
[12] M.-A. Storey, J. Ryall, R. I. Bull, D. Myers, J. Singer, Todo or to bug, in: 2008 ACM/IEEE 30th</p>
      <p>International Conference on Software Engineering, IEEE, 2008, pp. 251–260.
[13] T. Tenny, Program readability: Procedures versus comments, IEEE Transactions on Software</p>
      <p>Engineering 14 (1988) 1271.
[14] H. Yu, B. Li, P. Wang, D. Jia, Y. Wang, Source code comments quality assessment method based on
aggregation of classification algorithms, Journal of Computer Applications 36 (2016) 3448.
[15] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation of
comments to aid software maintenance, Journal of Software: Evolution and Process 34 (2022)
e2463.
[16] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search approach to
program comprehension from code comments, in: Advanced Computing and Systems for Security,
Springer, 2020, pp. 29–42.
[17] S. Majumdar, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder, Overview
of the irse track at fire 2022: Information retrieval in software engineering., in: FIRE (Working
Notes), 2022, pp. 1–9.
[18] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder, Can
we predict useful comments in source codes?-analysis of findings from information retrieval in
software engineering track@ fire 2022, in: Proceedings of the 14th Annual Meeting of the Forum
for Information Retrieval Evaluation, 2022, pp. 15–17.
[19] S. Majumdar, P. P. Das, Smart knowledge transfer using google-like search, arXiv preprint
arXiv:2308.06653 (2023).
[20] P. Chakraborty, S. Dutta, D. K. Sanyal, S. Majumdar, P. P. Das, Bringing order to chaos:
Conceptualizing a personal research knowledge graph for scientists., IEEE Data Eng. Bull. 46 (2023)
43–56.
[21] S. Majumdar, A. Deshpande, P. P. Das, P. P. Chakrabarti, Comprehending c codes with llms:</p>
      <p>Efective comment generation through retrieval and reasoning, Pattern Recognition Letters (2025).
[22] S. Paul, S. Majumdar, R. Shah, S. Das, M. Ghosh, D. Ganguly, G. Calikli, D. Sanyal, P. P. Das, P. D.</p>
      <p>Clough, et al., Overview of the “information retrieval in software engineering”(irse) track at forum
for information retrieval 2024, in: Proceedings of the 16th Annual Meeting of the Forum for
Information Retrieval Evaluation, 2024, pp. 18–21.
[23] S. Paul, S. Majumdar, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. Das, P. D. Clough, P.
Majumder, Eficiency of large language models to scale up ground truth: Overview of the irse track
at forum for information retrieval 2023, in: Proceedings of the 15th Annual Meeting of the Forum
for Information Retrieval Evaluation, 2023, pp. 16–18.
[24] N. Chatterjee, S. Majumdar, P. P. Das, A. Chakrabarti, Tool assisted agile approach for legacy
application migration, International Journal of System Assurance Engineering and Management
(2025) 1–16.
[25] A. Deshpande, A. Maji, D. Mondol, P. P. Das, P. D. Clough, S. Majumdar, The code–llm handshake:
Smarter maintenance through ai, in: Proceedings of the 17th annual meeting of the Forum for
Information Retrieval Evaluation, 2025, pp. 9–12.
[26] A. Mitra, S. Majumdar, A. Mukhopadhyay, P. P. Das, P. D. Clough, P. P. Chakrabarti,
Operationalizing large language models with design-aware contexts for code comment generation, arXiv
preprint arXiv:2510.22338 (2025).
[27] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information
processing systems 33 (2020) 1877–1901.
[28] S. Paul, S. Majumdar, R. Shah, S. Das, M. Ghosh, D. Ganguly, G. Calikli, D. Sanyal, P. P. Das,
P. D Clough, A. Bandyopadhyay, S. Chattopadhyay, Generative ai for code metadata quality
assessment, in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval
Evaluation, 2024.
[29] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough,
P. Majumder, Generative ai for software metadata: Overview of the information retrieval in
software engineering track at fire 2023, arXiv preprint arXiv:2311.03374 (2023).
[30] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An efective low-dimensional
software code representation using bert and elmo, in: 2022 IEEE 22nd International Conference
on Software Quality, Reliability and Security (QRS), IEEE, 2022, pp. 763–774.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>C. B. de Souza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Anquetil</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. de Oliveira</surname>
          </string-name>
          ,
          <article-title>A study of the documentation essential to software maintenance</article-title>
          ,
          <source>Conference on Design of communication, ACM</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papdeja</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <article-title>Smartkt: a search framework to assist program comprehension using smart knowledge transfer</article-title>
          ,
          <source>in: 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
          </string-name>
          ,
          <article-title>Debugging multi-threaded applications using pin-augmented gdb (pgdb)</article-title>
          ,
          <source>in: International conference on software engineering research and practice (SERP)</source>
          . Springer,
          <year>2015</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
          </string-name>
          ,
          <article-title>D-cube: tool for dynamic design discovery from multi-threaded applications using pin</article-title>
          ,
          <source>in: 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>