Binary Classification of Source Code Comments using
                                Machine Learning Models
                                Lisa Sarkar1,*,†
                                1
                                    Indian Institute of Technology, Kharagpur, West Bengal, Kol-721302


                                                                         Abstract
                                                                         This paper reports a detailed analysis on the viability of a classification framework which can classify
                                                                         a comment based on its usefulness within the source code. This classification will be helpful for new
                                                                         developers for correctly comprehending the source code. Three machine learning models: such as logistic
                                                                         regression, support vector machine, and multinomial naive Bayes are trained using an initial dataset
                                                                         called seed dataset. Each comment is classified into one of the two categories - useful and not useful.
                                                                         An accuracy of 82.92%, 83.92%, and 50.75% respectively is achieved from the initial training of three
                                                                         models. The dataset is then augmented using a new set of data extracted from several online resources.
                                                                         The corresponding class for the new set are generated using chatGPT large language model (LLM). The
                                                                         augmented dataset is then again used to train those three machine learning models. It is observed that
                                                                         for the new augmented dataset, the accuracy drops down for all three models due to inclusion of noise
                                                                         and biasness owing to the LLM generated dataset.

                                                                         Keywords
                                                                         Logistic Regression, Support vector machine, Comment classification, Qualitative analysis


                                1. Introduction
                                Software is emerging as the backbone of modern technology as they warrant many promising
                                applications owing to their integration into electronics and appliances. They are simplifying
                                the challenges of our daily life and making it easier in all aspects. For example, GPS software
                                facilitates driving from one place to another. Constant modification of existing software and
                                building of new software is the key of improving the software functionality which leads to an
                                increase in the source code. Maintaining this large amount of source code is a crucial phase
                                of Software Development Life Cycle (SDLC). In most of the cases, developers face numerous
                                challenges during this source code maintenance including comprehending large code base in
                                short period of time, outdated and incomplete required documents, unavailability of knowledge
                                from previous developer to name a few.
                                   This type of scenario can be tackled by following systematic process flow. New developers
                                generally have the source code, sample test cases, requirement documents, and a debugger to im-
                                plement new functionality. For further modification of the code, the developer must understand


                                ⋆
                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                Corresponding author.
                                *

                                $ lisasarkar11@gmail.com (L. Sarkar)
                                 0009-0006-3456-2409 (L. Sarkar)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
the existing source code. So, they repeatedly run the current application on the sample test cases
to identify execution patterns, understand the design, and to comprehend the program. But this
whole process is time-hungry, effort-intensive, monotonous and sometimes becomes unman-
ageable. To overcome those bottlenecks, developers often follow shortcut method which further
introduce errors that are difficult to filter. This results in the degradation of software quality
and developer’s efficiency. These types of situations demand a systematic quality-controlled
development process for ease of use by the developers. Program comprehension is one such
process for maintaining existing source code in a better way. This reverse engineering process
is beneficial for reuse, inspection, maintenance, and many others in the context of software
engineering[1].
   Inserting comments within program is incumbent for better understanding of the source-
code. Many developers are working on the same code-base and describe differently while
inserting comments. Different description of same code-base decreases the program’s readability.
Therefore, a standardized procedure of writing codes and comments is imperative in order
to enhance the readability. Still this approach is not effective for understanding an already
registered code. An intuitive understanding of source code comments may be a good idea to
solve the readability program. In recent years, researchers are exploring this domain to develop
new applications aiming to enhance the efficiency of new programmers through understanding
the existing code.
   In this paper, we present a classification framework applied on a dataset of code and comment
pairs written in C language. The work is done in three stages - Training of classification
framework based on seed dataset, augment the dataset volume using large language model,
and again train the classification framework using the augmented dataset. In the first stage,
the framework takes code and comment pair as an input and classify them into one of the two
classes - Useful and Not Useful. In this case, logistic regression method, support vector machine
(SVM), and multinomial naive Bayes techniques are employed for comment classification. A
training dataset of 9000 samples and test dataset of 1001 samples are used for this purpose. The
model is validated using five fold cross-validation process. We employ linear kernel for the SVM
strategy and L2 regularization method for the logistic regression strategy. In the next stage,
another set of code-comment pair is gathered from online sources such as github. Consequently,
chatGPT-4 large language model is used to categorise the newly gathered code-comment pair
into two classes- useful and not-useful. This generated dataset is augmented with previous seed
dataset. The newly generated dataset is used to train these classification frameworks again. A
small reduction in all the F1 score values and accuracies is obtained which can be due to noise
inclusion as part of the newly generated dataset.
   The rest of the paper is organized as follows. Section 2 discusses the background work done
in the domain of comment classification. Details of existing methods are discussed in section
3. We discuss the proposed method in section 4. Results are addressed in section 5. Section 6
concludes the paper.
2. Related Work
Software metadata [2] plays a crucial role in the maintenance of code and its subsequent
understanding. Numerous tools have been developed to assist in extracting knowledge from
software metadata, which includes runtime traces and structural attributes of code [3, 4, 5, 6, 7,
8, 9, 10, 11].
   In the realm of mining code comments and assessing their quality, several authors have
conducted research. Steidl et al. [12] employ techniques such as Levenshtein distance and
comment length to gauge the similarity of words in code-comment pairs, effectively filtering
out trivial and non-informative comments. Rahman et al. [13] focus on distinguishing useful
from non-useful code review comments within review portals, drawing insights from attributes
identified in a survey conducted with Microsoft developers [14]. Majumdar et al. [15, 16, 17, 18]
have introduced a framework for evaluating comments based on concepts crucial for code
comprehension. Their approach involves the development of textual and code correlation
features, utilizing a knowledge graph to semantically interpret the information within comments.
These approaches employ both semantic and structural features to address the prediction
problem of distinguishing useful from non-useful comments, ultimately contributing to the
process of decluttering codebases
   In light of the emergence of large language models, such as GPT-3.5 or llama [19], it becomes
crucial to assess the quality of code comments and compare them to human interpretation. The
IRSE track at FIRE 2023 [20] expands upon the approach presented in a prior work [15]. It delves
into the exploration of various vector space models [21] and features for binary classification
and evaluation of comments, specifically in the context of their role in comprehending code.
Furthermore, this track conducts a comparative analysis of the prediction model’s performance
when GPT-generated labels for code and comment quality, extracted from open-source software,
are included.


3. Task and Dataset Description
The task of implementing binary classification framework is accomplished in three consecutive
steps. Those are designing of classification framework with seed dataset, augmenting the
seed dataset volume using large language model and train the same framework using the new
augmented dataset. The source code and comments pairs are classified into two classes useful
and not useful using the trained framework. The procedure takes a comment description with
associated lines of code as input and generates a label such as useful or not useful corresponding
to each code-comment pair. The classification system was developed using classical machine
learning models such as logistic regression, naive Bayes, and SVM.

    • Useful - The specified comment is appropriate for the corresponding source code.
    • Not Useful - The specified comment is not appropriate for the corresponding source code.

  The seed dataset has 9000 code-comment pairs which are written in C language. Each data
contains comment text, surrounding code snippet, and a label that describes its usefullness. The
whole dataset is gathered from the GitHub and is annotated with the help of a 14 annotators
Table 1
Sample data instances from the seed dataset
 #                  Comment                                          Code                       Label
                                              -1. test_setopt(curl, CURLOPT_UPLOAD, 1L);
 1   /*enable verbose*/                       /*enable verbose*/                              Not Useful
                                              1. test_setopt(curl, CURLOPT_VERBOSE, 1L);
                                              -1. else
                                              /*cr to cr,nul*/
                                              1. newline = 0;
                                              2. }
                                              3. else {
                                              4. if(test->rcount) {
 2   /*cr to cr,nul*/                                                                         Not Useful
                                              5. c = test->rptr[0];
                                              6. test->rptr++;
                                              7. test->rcount–;
                                              8. }
                                              9. else
                                              10. break;
                                              -1. if(imapcode == ’*’) {
                                              1. char tmp[20];
                                              2. if(sscanf(line + 2, "OK [UIDVALIDITY
                                              %19[0123456789]]" , tmp) == 1) {
     /*See if this is
 3                                            3. Curl_safefree(imapc->mailbox_uidvalidity);   Useful
     an UIDVALIDITY response*/ to text*/
                                              4. imapc->mailbox_uidvalidity = strdup(tmp);
                                              }
                                              }
                                              else if(imapcode == IMAP_RESP_OK) {


group. An example of dataset is presented in table 1. Another set of code-comment pairs is
collected from a different online resources and is augmented with the above mentioned dataset.
The set of code-comment pairs is categorized into two above-mentioned classes using a large
language model. This newly generated dataset in then added with the seed dataset.
   The classification model is then again trained using this augmented dataset in order to
understand the effect of augmentation. Different factors are analysed including noise inclusion,
distribution of dataset, which cause the change in accuracy during the training of classification
framework with augmented dataset.


4. Working Principle
A binary classification system was implemented with the help of three machine learning models
- logistic regression, support vector machine, and multinomial naïve Bayes. The system took
both the source code and corresponding comments as input. Considering the task criteria we
did not use any deep learning frameworks in our classification system. The comments are first
tokenized using English word lemmatizer. Following that the set of tokens are vectorized using
TF-IDF vectorizer. The TF-IDF matrix generated from the vectorization step along with class
labels was used as feature fed into the classification models. These models are trained using the
primary seed data and tested using test dataset. The unlikeliness in training data was controlled
using five-fold cross validation. We will briefly elucidate each machine learning models in the
subsequent subsections.

4.1. Logistic Regression
We use logistic regression for the binary comment classification task where a logistic function
is used in order to keep the output between 0 and 1. The logistic function is defined as follows:

                                           𝑍 = 𝐴𝑥 + 𝐵                                              (1)
                                                         1
                                   𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝑍) =                                                   (2)
                                                   1 + 𝑒𝑥𝑝(−𝑍)
   Equation (1) is referred as the linear regression equation whose output (Z) is passed to the
logistic function. The logistic function is defined in equation 2. The binary class is predicted from
the probability value generated by the logistic function based on the acceptance threshold. The
threshold value is kept to 0.6 which is in favor of the useful comment class. A three-dimensional
input feature is extracted from each training instance which is passed to the regression function.
During training the Cross entropy loss function is used for the hyper-parameter tuning.

4.2. Support Vector Machine
In the next step, a support vector machine model is implemented for binary classification task.
Classification is done based on the output of the linear function (equation 1). If the output is
greater than 1, then it is classified with one type of class, and if the output is less than -1, then
it is classified it with another class. We train the SVM model using the hinge loss function, as
shown below.


                              𝐻(𝑥, 𝑦, 𝑍) = 0,             𝑖𝑓 𝑦 * 𝑍 ≥ 1
                                                                                                   (3)
                                          = 1 − 𝑦 * 𝑍, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

   It is noticed from the loss function, that the cost will be 0 if the predicted and actual values
are of the same sign. In this case, the loss value is called if the predicted and actual values are of
different signs. The Hinge loss function is used for the SVM model hyper-parameter tuning.

4.3. Multinomial Naive Bayes
Multinomial Naïve Bayes model is also used in this task mainly for text classification. This
model uses the Bayes’ theorem mentioned as below:

                                                 𝑃 (𝑋|𝑦).𝑃 (𝑦)
                                     𝑃 (𝑦|𝑋) =                                                     (4)
                                                    𝑃 (𝑋)
   where,
𝑃 (𝑦|𝑋) is the posterior probability of class y given features X.
Table 2
Experimental results of three classification model on the test dataset
   Dataset                Models                      Precision   Recall   F1-score   Accuracy (%)
                          Logistic regression         0.8024      0.7878   0.7943     82.92
        Seed data         Support vector machine      0.8223      0.7994   0.809      83.92
                          Naive Bayes                 0.5785      0.568    0.5035     50.75
       Seed data          Logistic regression         0.8057      0.7879   0.7955     82.92
           +              Support vector machine      0.8226      0.8017   0.8106     84.12
   LLM generated data     Naive Bayes                 0.5812      0.5721   0.4981     50.05


𝑃 (𝑋|𝑦) is the likelihood, representing the probability of observing features X given class y.
𝑃 (𝑦) is the prior probability of class y.
𝑃 (𝑋) is the probability of observing features X, which acts as a normalization constant.

   Multinomial Naive Bayes operates on the assumption that each of the features are condition-
ally independent of the other given some class.


5. Results
A system with Intel i5 processor and 32 GB RAM is employed for the task implementation.
The whole task has three steps, as mentioned earlier. At first, the seed dataset is divided into
two segments training data (90%) and validation data (10%). Training dataset is exploited to
train the three ML models - logistic regression, support vector machine, and multinomial naive
Bayes. The test dataset contains 1001 instances. Among them, 719 instances are labeled as
not useful and 282 instances are useful. All three models are tested on this test dataset and an
overall accuracy of 82.92%, 83.92%, and 50.75% are achieved for logistic regression, support
vector machine, and multinomial naive Bayes model respectively. The corresponding confusion
matrix are plotted in figure 1. We noticed that the naive Bayes algorithm fails to predict the not
useful data efficiently which leads to a degradation in the overall accuracy.
   Another dataset is generated using large language model which consists of 311 useful samples
and 21 not useful samples. This generated data is then augmented with seed data. This new
dataset is then again divided into two parts - training and validation dataset. Furthermore,
those new training and validation dataset are used to train the same classification models. The
newly trained models are tested with the same test data. An overall accuracy of 82.92%, 84.12%,
and 50.05% are achieved from the three models respectively. The individual confusion matrix
for all three models are also displayed in figure 2. The evaluation results for all three models
are illustrated in table 2. It is evident that the models trained with the augmented dataset
experience a slight decrease in the accuracy compared to the previously achieved accuracy
from seed dataset. This may be attributed to the incorporation of noises in the seed data from
large language models. This noises are mainly generated because of the imperfection of large
language model such as chatGPT 4 in our case, which leads to a decrease in the overall accuracy.
Still we can argue that the augmented dataset is well-balanced for machine learning model
training and generates a similar accuracy as for the initial seed dataset.
        (a) Logistic regression      (b) Support vector machine     (c) Multinomial naive Bayes

Figure 1: Confusion matrix for classification models related to seed data


        (a) Logistic regression      (b) Support vector machine     (c) Multinomial naive Bayes

Figure 2: Confusion matrix for classification models related to seed + LLM generated data


6. Conclusion
This paper proposes a framework for source code comment classification, which classify a
comment based on its usefulness within the source code. Three machine learning models such
as logistic regression, support vector machine, and multinomial naive Bayes are implemented
and trained using seed dataset. These classifier classify each comment into two categories -
useful and not useful. These three models exhibit an accuracy of 82.92%, 83.92%, and 50.75%
respectively. Subsequently this seed dataset is augmented with a newly generated dataset that
are gathered from online sources. The corresponding class for the new set are generated using
chatGPT large language model (LLM). The newly generated augmented dataset is again used to
train all the models. It is observed that the new augmented dataset drops down the accuracy for
all three models due to inclusion of noise and biasness owing to the LLM generated dataset.
References
 [1] M. Berón, P. R. Henriques, M. J. Varanda Pereira, R. Uzal, G. A. Montejano, A language
     processing tool for program comprehension, in: XII Congreso Argentino de Ciencias de la
     Computación, 2006.
 [2] S. C. B. de Souza, N. Anquetil, K. M. de Oliveira, A study of the documentation essential to
     software maintenance, Conference on Design of communication, ACM, 2005, pp. 68–75.
 [3] L. Tan, D. Yuan, Y. Zhou, Hotcomments: how to make program comments more useful?,
     in: Conference on Programming language design and implementation (SIGPLAN), ACM,
     2007, pp. 20–27.
 [4] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Smartkt: a search framework to assist
     program comprehension using smart knowledge transfer, in: 2019 IEEE 19th International
     Conference on Software Quality, Reliability and Security (QRS), IEEE, 2019, pp. 97–108.
 [5] N. Chatterjee, S. Majumdar, S. R. Sahoo, P. P. Das, Debugging multi-threaded applications
     using pin-augmented gdb (pgdb), in: International conference on software engineering
     research and practice (SERP). Springer, 2015, pp. 109–115.
 [6] S. Majumdar, N. Chatterjee, S. R. Sahoo, P. P. Das, D-cube: tool for dynamic design discovery
     from multi-threaded applications using pin, in: 2016 IEEE International Conference on
     Software Quality, Reliability and Security (QRS), IEEE, 2016, pp. 25–32.
 [7] S. Majumdar, N. Chatterjee, P. P. Das, A. Chakrabarti, A mathematical framework for design
     discovery from multi-threaded applications using neural sequence solvers, Innovations in
     Systems and Software Engineering 17 (2021) 289–307.
 [8] S. Majumdar, N. Chatterjee, P. Pratim Das, A. Chakrabarti, Dcube_ nn d cube nn: Tool
     for dynamic design discovery from multi-threaded applications using neural sequence
     models, Advanced Computing and Systems for Security: Volume 14 (2021) 75–92.
 [9] J. Siegmund, N. Peitek, C. Parnin, S. Apel, J. Hofmeister, C. Kästner, A. Begel, A. Bethmann,
     A. Brechmann, Measuring neural efficiency of program comprehension, in: Proceedings
     of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp. 140–150.
[10] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5+: Open code large
     language models for code understanding and generation, arXiv preprint arXiv:2305.07922
     (2023).
[11] J. L. Freitas, D. da Cruz, P. R. Henriques, A comment analysis approach for program
     comprehension, Annual Software Engineering Workshop (SEW), IEEE, 2012, pp. 11–20.
[12] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, International
     Conference on Program Comprehension (ICPC), IEEE, 2013, pp. 83–92.
[13] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using
     textual features and developer experience, International Conference on Mining Software
     Repositories (MSR), IEEE, 2017, pp. 215–226.
[14] A. Bosu, M. Greiler, C. Bird, Characteristics of useful code reviews: An empirical study at
     microsoft, Working Conference on Mining Software Repositories, IEEE, 2015, pp. 146–156.
[15] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation
     of comments to aid software maintenance, Journal of Software: Evolution and Process 34
     (2022) e2463.
[16] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search
     approach to program comprehension from code comments, in: Advanced Computing and
     Systems for Security, Springer, 2020, pp. 29–42.
[17] S. Majumdar, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder,
     Overview of the irse track at fire 2022: Information retrieval in software engineering, in:
     Forum for Information Retrieval Evaluation, ACM, 2022.
[18] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder,
     Can we predict useful comments in source codes?-analysis of findings from information
     retrieval in software engineering track@ fire 2022, in: Proceedings of the 14th Annual
     Meeting of the Forum for Information Retrieval Evaluation, 2022, pp. 15–17.
[19] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
[20] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. P. Das, P. D.
     Clough, P. Majumder, Generative ai for software metadata: Overview of the information
     retrieval in software engineering track at fire 2023, in: Forum for Information Retrieval
     Evaluation, ACM, 2023.
[21] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An effective low-
     dimensional software code representation using bert and elmo, in: 2022 IEEE 22nd
     International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2022,
     pp. 763–774.