<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Code Readability: A Machine Learning Approach to Classifying Valuable Source Code Comments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Subhash Agrawal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Kharagpur</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper introduces a framework designed to classify source code comments based on their practical value, helping new developers better understand the code. We applied three machine learning models-logistic regression, support vector machine, and multinomial naive Bayes-to an initial training dataset to sort comments into two groups: useful and not useful. The initial training results showed accuracies of 83%, 84.12%, and 51% for each model, respectively. Next, we expanded the dataset with additional code-comment pairs from various online sources, using the ChatGPT large language model (LLM) to label the new data. After retraining the models with this combined dataset, we found that the models' accuracy decreased slightly, suggesting that the new data introduced some level of noise and bias.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Comment classification</kwd>
        <kwd>Qualitative analysis</kwd>
        <kwd>Logistic Regression</kwd>
        <kwd>Support vector machine</kwd>
        <kwd>Multinomial naive Bayes</kwd>
        <kwd>Large language model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In today’s technology-driven landscape, programming plays an integral role across industries—from
healthcare and communications to transportation. As the demand for new applications grows, so does
the volume of source code. Frequent updates to existing code bases only add to this volume, making it
increasingly challenging to manage and debug substantial code repositories within tight timelines. This
pressure often leads to poor code documentation, making it dificult to reuse older code efectively for
future projects.</p>
      <p>Without clear documentation, developers may need to repeatedly execute and analyze the code to
understand its functionality, which is tedious and time-consuming. To speed up the process, developers
might skip crucial steps, increasing the risk of errors and reducing the overall efectiveness of the new
software. A common way to address these challenges is through code comments, which give insights
into the original developer’s thought process. However, variations in individual commenting styles can
create inconsistency, making code dificult to interpret. For this reason, standard practices for writing
comments are essential.</p>
      <p>When dealing with existing source code that lacks clear comments or documentation, a diferent
solution is needed. Developing tools that enhance the readability of legacy code has become a research
focus, as such tools can save developers valuable time and help in building more robust applications.</p>
      <p>In this paper, we present a framework to classify code comments based on their usefulness, aiming to
categorize comments as either useful or not useful in helping developers understand the code. To build
this framework, we used a dataset of approximately 10,000 code-comment pairs written in C, training
three machine learning models—support vector machine, logistic regression, and multinomial naive
Bayes—on this initial dataset. Additionally, we expanded our dataset by gathering more code-comment
pairs from online sources, again in C language, and used ChatGPT-4 to label the new data as useful
or not useful. With this combined dataset, we re-trained each model and compared the new accuracy
and F1 scores to the initial results. This comparison helped us assess the impact of the additional data,
highlighting both the benefits and potential drawbacks of using large language model-generated labels.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        New programmers often rely on existing comments to comprehend code flow. However, not all
comments contribute efectively to program comprehension, necessitating a relevancy assessment
of source code comments prior to their use. Numerous researchers have focused on the automatic
classification of source code comments in terms of quality evaluation. For instance, Omal et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] noted
that factors influencing software maintainability can be organized into hierarchical structures. The
authors defined measurable attributes in the form of metrics for each factor, enabling the assessment of
software characteristics, which can then be consolidated into a single index of software maintainability.
Fluri et al.[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] examined whether the source code and associated comments are changed together along
the multiple versions. They investigated three open source systems, such as ArgoUML, Azureus, and JDT
Core, and found that 97% of the comment changes are done in the same revision as the associated source
code changes. Another work[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] published in 2007 which proposed a two-dimensional maintainability
model that explicitly associates system properties with the activities carried out during maintenance.
The author claimed that this approach transforms the quality model into a structured quality knowledge
base that is usable in industrial environments. Storey et al. did an empirical study on task annotations
embedding within a source code and how it plays a vital role in a developer’s task management[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
The paper described how task management is negotiated between formal issue tracking systems and
manual annotations that programmers include within their source code. Ted et al.[5] performed a 3 × 2
experiment to compare the eforts of procedure format with those of comments on the readability of a
PL/I program. The readability accuracy was checked by questioning students about the program after
reading it. The result said that the program without comment was the least readable. Yu Hai et al.[6]
classified source code comments into four classes - unqualified, qualified, good, and excellent. The
aggregation of basic classification algorithms further improved the classification result. Another work
published in [7] in which author proposed an automatic classification mechanism "CommentProbe"
for quality evaluation of code comments of C codebases. We see that people worked on source code
comments with diferent aspects[ 7, 8, 9, 10, 11, 12], but still, automatic quality evaluation of source code
comments is an important area and demands more research.
      </p>
      <p>With the advent of large language models [13], it is important to compare the quality assessment
of code comments by the standard models like GPT 3.5 or llama with the human interpretation. The
IRSE track at FIRE 2024 [14, 15] builds upon the methodologies proposed in [7, 16, 17, 10] to investigate
various vector space models [18] and features for binary classification and evaluation of comments
in relation to code comprehension. This track also assesses the performance of the predictive model
by incorporating GPT-generated labels for the quality of code and comment snippets extracted from
open-source software.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task and Dataset Description</title>
      <p>This section outlines the focus of our research, which is organized into three main components: designing
a classification framework with a seed dataset, enhancing this dataset using a large language model, and
retraining the classification framework with the newly augmented data. Our objective is to establish a
binary classification system that categorizes pairs of source code and comments as either - useful and
not useful. The input for this process includes a comment and its corresponding lines of code, while the
output will be a label indicating its relevance either useful or not useful. This framework is intended to
assist developers in better understanding the associated code. We employ traditional machine learning
algorithms, including Naive Bayes, logistic regression, and support vector machines (SVM), to create
the classification system.</p>
      <p>The two classes of source code comments can be described as follows:
• Useful - The comment is relevant and provides insight into the corresponding source code.
• Not Useful - The comment does not relate meaningfully to the source code it references.
#
1
/*test 529*/
2</p>
      <p>/*cr to cr,nul*/
3
/*convert minor status code
(underlying routine error) to text*/</p>
      <p>Code
-10. int res = 0;
-9. CURL *curl = NULL;
-8. FILE *hd_src = NULL;
-7. int hd;
-6. struct_stat file_info;
-5. CURLM *m = NULL;
-4. int running;
-3. start_test_timing();
-2. if(!libtest_arg2) {
-1. #ifdef LIB529
/*test 529*/
1. fprin
-1. else
/*cr to cr,nul*/
1. newline = 0;
2. }
3. else {
4. if(test-&gt;rcount) {
5. c = test-&gt;rptr[0];
6. test-&gt;rptr++;
7. test-&gt;rcount–;
8. }
9. else
10. break;
-10. break;
-9. }
-8. gss_release_bufer(&amp;min_stat, &amp;status_string);
-7. }
-6. if(sizeof(buf) &gt; len + 3) {
-5. strcpy(buf + len, ".\n");
-4. len += 2;
-3. }
-2. msg_ctx = 0;
-1. while(!msg_ctx) {
/*con</p>
      <p>Label
Not Useful
Not Useful
Useful</p>
      <p>For our study, we utilize a seed dataset comprising around than 9,500 pairs of code and comments,
all written in C. Each entry contains the comment text, an accompanying code snippet, and a label
indicating whether the comment is useful or not. This dataset was sourced from GitHub and meticulously
annotated by a team of 11 individuals. A sample data is illustrated in table 1. To enhance our dataset, we
collected additional code-comment pairs from various online platforms. These new pairs were classified
into the aforementioned categories using a large language model. We then merged this new subset with
our original seed dataset.</p>
      <p>The augmented dataset is utilized to retrain the classification model, allowing us to evaluate the
impact of the augmentation process. We also investigate the introduction of noise, the distribution of
the dataset, and several other factors that may influence changes in accuracy during training with the
enriched dataset.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Working Principle</title>
      <p>We aim to train three diferent machine learning models—Logistic Regression, Support Vector Machine
(SVM), and Multinomial Naïve Bayes—to achieve the binary classification task. It’s important to note
that we are not utilizing any deep learning frameworks due to the specific constraints of our project. The
system processes both comments and their corresponding code segments as input. Before converting
the comments into a numerical vector space, we preprocess the text by lemmatizing and tokenizing it
using an English word tokenizer. To transform each English keyword into numerical representations,
we employ a TF-IDF vectorizer. This approach produces a TF-IDF matrix encompassing all keywords
found within the comment dataset, which serves as our feature set for the classification framework. We
split the entire dataset into two distinct subsets: a training set comprising 90% of the data and a test set
containing the remaining 10%. Features are extracted from the training data using the aforementioned
methods. The TF-IDF matrix along with corresponding class levels are then fed into machine learning
models for training. These models will subsequently classify each code-comment pair into one of the
two categories. In the following subsections, we will briefly discuss overview of the three classification
models employed in our study.</p>
      <sec id="sec-4-1">
        <title>4.1. Logistic Regression</title>
        <p>For our binary comment classification task, we employ Logistic Regression. To ensure that the regression
output remains within the range of 0 to 1, we utilize a logistic function defined as follows:
(1)
(2)
(3)
 =  + 
() =</p>
        <p>1
1 + (− )</p>
        <p>In this process, the output from the linear regression equation (see the first equation) is fed into the
logistic function (see the second equation). The logistic function generates a probability value, which
we use to determine the binary class based on a specified acceptance threshold. We set this threshold at
0.6 to favor classifying comments as "useful."</p>
        <p>From each training example, we extract a three-dimensional input feature that is then processed
through the regression function. During training, we apply the Cross-Entropy Loss function for
hyperparameter tuning in the Logistic Regression model.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Support Vector Machine</title>
        <p>Next, we implement a Support Vector Machine (SVM) model for our binary classification task. The
classification process begins with the output of a linear function. If this output exceeds 1, we categorize
it as one class; conversely, if the output falls below -1, we assign it to the other class. To train the SVM
model, we utilize the Hinge Loss function, defined as follows:
(, , ) =</p>
        <p>0   *  ≥ 1
= 1 −  * , ℎ</p>
        <p>This loss function indicates that there is no cost when the predicted and actual values align (i.e.,
they have the same sign). However, when they difer in sign, we compute a loss value. The Hinge Loss
function plays a crucial role in fine-tuning the hyperparameters of the Support Vector Machine model.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Multinomial Naive Bayes</title>
        <p>We also employ the Multinomial Naïve Bayes model, which is particularly efective for text classification
tasks, to perform our binary classification. This model uses the Bayes’ theorem mentioned as below:
(b) Enter Caption
where,
 (|) is the posterior probability of class y given features X.
 (|) is the likelihood, which indicates the probability of observing the features X under class y.
 () is the prior probability of class y.
 () The overall probability of observing the features X, serving as a normalization factor.</p>
        <p>The Multinomial Naïve Bayes model operates under the assumption that each feature is conditionally
independent of others when conditioned on a particular class.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The implementation of our task was carried out on a system equipped with an Intel i7 processor and 32
GB of RAM. As outlined in Section 3, the entire process comprises three stages. We began by splitting
the seed dataset into two segments: 90% for training and 10% for validation. We then trained three
classification models—logistic regression, support vector machine, and multinomial naive Bayes—using
the same training dataset. For testing, we utilized a dataset containing 1,001 instances, which were
labeled as 719 instances of not useful and 282 instances of useful. Upon evaluation, the models achieved
overall accuracies of 83%, 84.12%, and 51%, respectively. The confusion matrices for these models can
be found in Figure1. Notably, the naive Bayes algorithm struggled to accurately classify not useful
instances, leading to a lower accuracy compared to the other two models.</p>
      <p>To enhance our dataset, we incorporated additional data generated by a large language model, which
included 309 useful samples and 25 not useful samples. The complete dataset was then split again into
training and validation subsets. We retrained the same classification models with this newly formed
dataset and tested them using the original test data. The updated models yielded overall accuracies of
83.92%, 83.92%, and 50%, respectively. The individual confusion matrices for these models are shown in
Figure 2. It is evident that the models trained on the augmented dataset experienced a slight decline
in accuracy compared to those trained on the seed data. This indicates that the introduction of data
from large language models contributed some noise to the original seed data, negatively impacting the
overall accuracy of all three models. This noise stemmed from the limitations of the large language
model used, specifically ChatGPT-4 in this instance. Nonetheless, we can assert that the augmented
dataset remains well-balanced for training machine learning models.
(b) Support vector machine</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this study, we presented the development of three binary classification models aimed at categorizing
code and comment pairs. These models were trained using a seed dataset containing two distinct
classes, where all source code was written in C and the comments were in English. We employed
tokenization and TF-IDF vectorization to transform the comments into numerical vectors, which served
as the features for our classification frameworks.</p>
      <p>Additionally, we enhanced the initial seed dataset by incorporating new data sourced from online
platforms. This newly collected data was classified into the same two categories using a large language
model, specifically ChatGPT. Upon retraining the classification models with the augmented dataset, we
noted a slight decline in accuracy compared to the models trained solely on the seed data. This suggests
that the inclusion of noise from the large language model-generated dataset may have impacted the
performance.</p>
      <p>Furthermore, we conducted a comparative analysis of the results from both datasets. Our findings
indicate that the biases and noise present in the augmented dataset contributed to the reduced accuracy
of our classification models.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
[5] T. Tenny, Program readability: Procedures versus comments, IEEE Transactions on Software</p>
      <p>Engineering 14 (1988) 1271.
[6] H. Yu, B. Li, P. Wang, D. Jia, Y. Wang, Source code comments quality assessment method based on
aggregation of classification algorithms, Journal of Computer Applications 36 (2016) 3448.
[7] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation of
comments to aid software maintenance, Journal of Software: Evolution and Process 34 (2022)
e2463.
[8] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search approach to
program comprehension from code comments, in: Advanced Computing and Systems for Security,
Springer, 2020, pp. 29–42.
[9] S. Majumdar, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder, Overview
of the irse track at fire 2022: Information retrieval in software engineering., in: FIRE (Working
Notes), 2022, pp. 1–9.
[10] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder, Can
we predict useful comments in source codes?-analysis of findings from information retrieval in
software engineering track@ fire 2022, in: Proceedings of the 14th Annual Meeting of the Forum
for Information Retrieval Evaluation, 2022, pp. 15–17.
[11] S. Majumdar, P. P. Das, Smart knowledge transfer using google-like search, arXiv preprint
arXiv:2308.06653 (2023).
[12] P. Chakraborty, S. Dutta, D. K. Sanyal, S. Majumdar, P. P. Das, Bringing order to chaos:
Conceptualizing a personal research knowledge graph for scientists., IEEE Data Eng. Bull. 46 (2023)
43–56.
[13] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information
processing systems 33 (2020) 1877–1901.
[14] S. Paul, S. Majumdar, R. Shah, S. Das, M. Ghosh, D. Ganguly, G. Calikli, D. Sanyal, P. P. Das,
P. D Clough, A. Bandyopadhyay, S. Chattopadhyay, Generative ai for code metadata quality
assessment, in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval
Evaluation, 2024.
[15] S. Paul, S. Majumdar, R. Shah, S. Das, M. Ghosh, D. Ganguly, G. Calikli, D. Sanyal, P. P. Das,
P. D Clough, A. Bandyopadhyay, S. Chattopadhyay, Overview of the irse track at fire 2024:
Information retrieval in software engineering, in: FIRE (Working Notes), 2024.
[16] S. Paul, S. Majumdar, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. Das, P. D. Clough, P.
Majumder, Eficiency of large language models to scale up ground truth: Overview of the irse track
at forum for information retrieval 2023, in: Proceedings of the 15th Annual Meeting of the Forum
for Information Retrieval Evaluation, 2023, pp. 16–18.
[17] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough,
P. Majumder, Generative ai for software metadata: Overview of the information retrieval in
software engineering track at fire 2023, arXiv preprint arXiv:2311.03374 (2023).
[18] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An efective low-dimensional
software code representation using bert and elmo, in: 2022 IEEE 22nd International Conference
on Software Quality, Reliability and Security (QRS), IEEE, 2022, pp. 763–774.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Oman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hagemeister</surname>
          </string-name>
          ,
          <article-title>Metrics for assessing a software system's maintainability</article-title>
          ,
          <source>in: Proceedings Conference on Software Maintenance</source>
          <year>1992</year>
          , IEEE Computer Society,
          <year>1992</year>
          , pp.
          <fpage>337</fpage>
          -
          <lpage>338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Fluri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wursch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Gall</surname>
          </string-name>
          ,
          <article-title>Do code and comments co-evolve? on the relation between source code and comment changes</article-title>
          ,
          <source>in: 14th Working Conference on Reverse Engineering (WCRE</source>
          <year>2007</year>
          ), IEEE,
          <year>2007</year>
          , pp.
          <fpage>70</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Deissenboeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pizka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Teuchert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Girard</surname>
          </string-name>
          ,
          <article-title>An activity-based quality model for maintainability</article-title>
          ,
          <source>in: 2007 IEEE International Conference on Software Maintenance, IEEE</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>184</fpage>
          -
          <lpage>193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.-A.</given-names>
            <surname>Storey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ryall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. I.</given-names>
            <surname>Bull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Myers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Singer</surname>
          </string-name>
          , Todo or to bug,
          <source>in: 2008 ACM/IEEE 30th International Conference on Software Engineering</source>
          , IEEE,
          <year>2008</year>
          , pp.
          <fpage>251</fpage>
          -
          <lpage>260</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>