=Paper=
{{Paper
|id=Vol-3681/T7-12
|storemode=property
|title=Source Code Comment Classification using Naive Bayes and Support Vector Machine
|pdfUrl=https://ceur-ws.org/Vol-3681/T7-12.pdf
|volume=Vol-3681
|authors=Raj Jitendra Shah
|dblpUrl=https://dblp.org/rec/conf/fire/Shah23
}}
==Source Code Comment Classification using Naive Bayes and Support Vector Machine==
<pdf width="1500px">https://ceur-ws.org/Vol-3681/T7-12.pdf</pdf>
<pre>
                                Source Code Comment Classification using Naive
                                Bayes and Support Vector Machine
                                Raj Shah1,*,†
                                1
                                    Indian Institute of Technology Goa, Goa-403401


                                                                         Abstract
                                                                         This paper proposes a framework for source code comment classification, which classify a comment
                                                                         based on its usefulness within the source code. This qualitative classification assists new developers
                                                                         in correctly comprehending the source code. We implement two binary classification mechanisms of
                                                                         source code comments based on two machine learning models: Naive Bayes and support vector machine.
                                                                         The classifier will classify each comment into two categories - useful and not useful. We extract comment
                                                                         features such as comment length, the position of comment within source code, and significant word
                                                                         ratio before training both models. We use a source code database of over 9000 instances written in C
                                                                         language in our work. Both models achieve an F1-score value of 0.632 and 0.765, respectively.

                                                                         Keywords
                                                                         Naive Bayes, Support vector machine, Comment classification, Qualitative analysis


                                1. Introduction
                                In software development, code comments play a pivotal role in enhancing code understanding
                                and reusability. Our research centers on the critical evaluation of comment quality, emphasizing
                                clarity and the avoidance of redundancy. Through our work, we endeavor to elevate overall
                                code quality and boost developer productivity by filtering and prioritizing useful comments,
                                ultimately decluttering the codebase and streamlining the development process.
                                   In our paper, we introduce a binary classification algorithm tailored to C language source
                                code comments, categorizing them as either "Useful" or "Not Useful." Leveraging Naive Bayes
                                and Support Vector Machine (SVM) techniques, we analyze over 8,000 training samples and
                                1,000 test samples. We extract structural features, including comment length, position within
                                the source code, and significant word ratio[1], to train our models.
                                   For SVM, we employ the hinge loss function with a linear kernel. In contrast, the Multinomial
                                Naive Bayes uses a probabilistic model that calculates class probabilities based on the training
                                data, selecting the most probable class as the prediction. both models consistently achieve an
                                average F1-score of 69.85%, showcasing their effectiveness.
                                   Furthermore, to bolster our model’s capabilities, we employed ChatGPT to augment our
                                dataset. This innovative approach involved generating data labels for an additional 500 instances
                                of code comment pairs, which we extracted from open-source GitHub C code repositories. The


                                ⋆
                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                $ raj2100789@gmail.com (R. Shah)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
extraction process was systematically carried out using regular expressions, allowing us to
obtain these code-comment pairs.
   To evaluate the enhanced performance of our model, we leveraged a Large Language Model
(LLM), specifically the chatgpt-3.5-turbo model, which played a pivotal role in generating labels
for our data. We classified each code-comment pair as either "Useful" or "Not Useful" with the
help of this LLM. This categorization was essential for assessing the effectiveness of our model
in understanding and contextualizing code comments. The integration of ChatGPT not only
expanded our dataset but also significantly contributed to an increase in accuracy, to 70.84%,
further validating the robustness of our comment classification model. This augmentation
process underscores the potential of combining state-of-the-art language models with machine
learning techniques to enhance the performance and versatility of software development tools.
   It’s worth noting that, while there are alternative options available for LLMs, such as BARD,
LLaMA, and several others, we chose to utilize the OpenAI LLM for several reasons. One of
the primary considerations was cost efficiency, as it offers a cost-effective solution for our
research needs. Additionally, the accessibility of OpenAI LLM’s API keys was a significant
factor, facilitating our research efforts when compared to other alternatives.
   The rest of the paper is organized as follows. Section 2 discusses the background work done
in the domain of comment classification. Details of existing methods are discussed in section
3. We discuss the proposed method in section 4. Results are addressed in section 5. Section 6
concludes the paper.


2. Related Work
Software metadata is integral to code maintenance and subsequent comprehension. A significant
number of tools [2, 3, 4, 5, 6, 7] have been proposed to aid in extracting knowledge from software
metadata [8] like runtime traces or structural attributes of codes.
   In terms of mining code comments and assessing the quality, authors [9, 10, 11, 12, 13, 14]
compare the similarity of words in code-comment pairs using the Levenshtein distance and
length of comments to filter out trivial and non-informative comments. Rahman et al. [15] detect
useful and non-useful code review comments (logged-in review portals) based on attributes
identified from a survey conducted with developers of Microsoft [16]. Majumdar et al. [17, 18]
proposed a framework to evaluate comments based on concepts that are relevant for code
comprehension. They developed textual and code correlation features using a knowledge graph
for semantic interpretation of information contained in comments. These approaches use
semantic and structural features to design features to set up a prediction problem for useful
and not useful comments that can be subsequently integrated into the process of decluttering
codebases.
   With the advent of large language models [19], it is important to compare the quality as-
sessment of code comments by the standard models like GPT 3.5 or llama with the human
interpretation. The IRSE track at FIRE 2023 [1] extends the approach proposed in [17] to ex-
plore various vector space models [20] and features for binary classification and evaluation of
comments in the context of their use in understanding the code. This track also compares the
performance of the prediction model with the inclusion of the GPT-generated labels for the
quality of code and comment snippets extracted from open-source software.
   Our previous experience with LLMs involved their application in generating data related
to voting processes and project selections. Beyond label generation, we have also employed
machine learning models for various classification tasks, encompassing binary classification,
multiclass classification, and multilabel classification, utilizing the extensive dataset at our
disposal.


3. Task and Dataset Description
In this section, we describe the task addressed in this paper. We aim to implement a binary
classification system to classify source code comments into useful and not useful. The procedure
takes a code comment with associated lines of code as input. The output will be a label such as
useful or not useful for the corresponding comment, which helps developers comprehend the
associated code. Classical machine learning algorithms such as Naive Bayes and SVM are used
to develop the classification system. The two classes of source code comments can be described
as follows:

    • Useful - The given comment is relevant to the corresponding source code.
    • Not Useful - The given comment is not relevant to the corresponding source code.

  A dataset consisting of over 9000 code-comment pairs written in C language is used in our
work. Each instance of data consists of comment text, a surrounding code snippet, and a label
that specifies whether the comment is useful or not. The whole dataset is collected from GitHub
and annotated by a team of 14 annotators. A sample seed data is illustrated in table 1. The
development dataset consists of 8000 instances, and the test dataset consists of 1000 instances.
The model is trained again using an additional dataset, consisting of 500 instances, generated
using ChatGPT. A sample LLM-generated data is illustrated in table 2.


4. Working Principle
We try two machine learning models - Multinomial Naive Bayes and support vector machine
(SVM) to implement the binary classification functionality. The system takes comments as
well as surrounding code snippets as input. We extract features such as comment length, the
position of comment within source code, and significant word ratio[17] from the given input.
The output of the feature extraction process is used to train both machine learning models.
The training dataset consists of 8591 data instances (including an additional 500 data instances
generated using ChatGPT) along with their labels. Among them, 3791 data instances are labeled
as not useful and 4796 data instances are marked as useful. The description of each model is
discussed in the following section.

4.1. Multinomial Naive Bayes
Multinomial Naive Bayes is a classification algorithm used for text data. It assumes that features
(e.g., word counts) are conditionally independent given the class label.
 #                  Comment                                          Code                          Label
                                            -10. int res = 0;
                                            -9. CURL *curl = NULL;
                                            -8. FILE *hd_src = NULL;
                                            -7. int hd;
                                            -6. struct_stat file_info;
                                            -5. CURLM *m = NULL;
 1   /*test 529*/                                                                                Not Useful
                                            -4. int running;
                                            -3. start_test_timing();
                                            -2. if(!libtest_arg2) {
                                            -1. #ifdef LIB529
                                            /*test 529*/
                                            1. fprin
                                            -1. else
                                            /*cr to cr,nul*/
                                            1. newline = 0;
                                            2. }
                                            3. else {
                                            4. if(test->rcount) {
 2   /*cr to cr,nul*/                                                                            Not Useful
                                            5. c = test->rptr[0];
                                            6. test->rptr++;
                                            7. test->rcount–;
                                            8. }
                                            9. else
                                            10. break;
                                            -10. break;
                                            -9. }
                                            -8. gss_release_buffer(&min_stat, &status_string);
                                            -7. }
                                            -6. if(sizeof(buf) > len + 3) {
     /*convert minor status code
 3                                          -5. strcpy(buf + len, ".\n");                        Useful
     (underlying routine error) to text*/
                                            -4. len += 2;
                                            -3. }
                                            -2. msg_ctx = 0;
                                            -1. while(!msg_ctx) {
                                            /*con
Table 1
Sample seed data instance


  In the probability model, We want to find the class label 𝐶𝑖 that maximizes the posterior
probability 𝑃 (𝐶𝑖 |𝐷), where 𝐷 is a comment.

                                                𝑃 (𝐶𝑖 ) · 𝑃 (𝐷|𝐶𝑖 )
                                  𝑃 (𝐶𝑖 |𝐷) =
                                                      𝑃 (𝐷)
  In text classification, we use a multinomial distribution for the likelihood model:
        #              Comment                               Code                 Label
                                               {
                                               "nosombrero",
                                               NULL,
                                               NULL,
        1   // Turns on test of xor function   &nosombrero,                     Not Useful
                                               PL_OPT_BOOL,
                                               "nosombrero",
                                               "No sombrero plot"
                                               },
                                               free( img );
                                               *width = w;
                                               *height = h;
        2   // flip image up-down                                               Not Useful
                                               *img_f = imf;
                                               return 0;
                                               }
                                               if ( f_name )
                                               save_plot( f_name );
        3   // save the plot                                                    Useful
                                               }
                                               plFree2dGrid( z, XDIM, YDIM );
Table 2
Sample LLM-generated (ChatGPT) data instance


                                                   𝑛
                                                  ∏︁
                                    𝑃 (𝐷|𝐶𝑖 ) =         𝑃 (𝑋𝑗 |𝐶𝑖 )
                                                  𝑗=1

  Where 𝑋𝑗 is the count of each term in the comment.
  To classify a comment, calculate
                                ⎛                                 ⎞
                                                𝑛
                                               ∑︁
                     argmax𝐶𝑖 ⎝log(𝑃 (𝐶𝑖 )) +     log(𝑃 (𝑋𝑗 |𝐶𝑖 ))⎠
                                                        𝑗=1

   for each class and choose the class with the highest probability.
   Tune the smoothing parameter (𝛼) and class prior probabilities (𝑃 (𝐶𝑖 )) during training. For
binary classification, we set a threshold on the posterior probabilities. Multinomial Naive Bayes
is effective for text classification tasks, especially when features represent word counts.

4.2. Support Vector Machine
We have incorporated a Support Vector Machine (SVM) model into our binary classification
task. The decision boundary is determined by the linear function outlined in Equation 1. In
this setup, instances with an output exceeding 1 are assigned to one class, while those with an
output less than -1 are categorized into the other class. The training of the SVM model involves
the utilization of the hinge loss function, as shown below.
                             𝐻(𝑥, 𝑦, 𝑍) = 0,            𝑖𝑓 𝑦 * 𝑍 ≥ 1
                                                                                                (1)
                                         = 1 − 𝑦 * 𝑍, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

   The loss function suggests that the cost is 0 if the predicted and actual values are of the same
sign. We calculate the loss value if they are of different signs. The Hinge loss function is used
for the SVM model hyper-parameter tuning.


5. Results
We train both models in a system having an Intel i5 processor and 16GB RAM. We test both
models using the test dataset. The test dataset consists of 1001 data instances, among which 719
data instances are labeled as not useful and 282 instances are marked as useful. Our Multinomial
Naive Bayes model has been tested on this dataset and achieved an F1-score value of 0.632
(without ChatGPT-generated data) and 0.640 (with the augmented model). Similarly, the SVM
model achieves a 0.765 F1-score value (without ChatGPT-generated data) and 0.776 (with the
augmented model). Both models achieve high recall values of 0.652 and 0.763 without the
LLM-generated data and 0.659 and 0.774, respectively with the augmented model. It shows that
both models correctly predict useful comments in a better way. Both models achieve lower
precision, such as 0.608 and 0.763, compared to the recall value. Apart from this, our model
does not use any qualitative feature, which may be important to understand the usefulness
of a comment within a source code. Using these qualitative features may increase the overall
accuracy of the binary classification.


                                   Figure 1: F1 Score vs Model


6. Conclusion
This paper has addressed a binary classification problem in the domain of source code comment
classification. The classification has been done based on the usefulness of the comment present
within a source code written in C language. We have used two machine learning models,
Multinomial Naive Bayes, and support vector machine, to implement the binary classification
task. We extracted three structural features: the length of the comment, the position of the
comment within the source code, and the significant word ratio from each data instance. Both
models have been trained using a training dataset with more than 8,000 data instances. Hinge
loss has been used during hyper-parameter tuning for the SVM model. The models are tested
on a test dataset of 1000 data instances. The Multinomial Naive Bayes model has been tested
on this dataset and achieved an F1-score value of 0.632 (without ChatGPT-generated data) and
0.640 (with the augmented model), and the SVM model achieves a 0.765 F1-score value (without
ChatGPT-generated data) and 0.776 (with the augmented model). Currently, we are using
structural features for the classification task, which may not be sufficient for the qualitative
analysis of the source code comments. In the future, we will use some qualitative features of
the comment, which may increase the accuracy of the comment classification task.


References
 [1] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. P. Das, P. D.
     Clough, P. Majumder, Generative ai for software metadata: Overview of the information
     retrieval in software engineering track at fire 2023, in: Forum for Information Retrieval
     Evaluation, ACM, 2023.
 [2] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Smartkt: a search framework to assist
     program comprehension using smart knowledge transfer, in: 2019 IEEE 19th International
     Conference on Software Quality, Reliability and Security (QRS), IEEE, 2019, pp. 97–108.
 [3] N. Chatterjee, S. Majumdar, S. R. Sahoo, P. P. Das, Debugging multi-threaded applications
     using pin-augmented gdb (pgdb), in: International conference on software engineering
     research and practice (SERP). Springer, 2015, pp. 109–115.
 [4] S. Majumdar, N. Chatterjee, S. R. Sahoo, P. P. Das, D-cube: tool for dynamic design discovery
     from multi-threaded applications using pin, in: 2016 IEEE International Conference on
     Software Quality, Reliability and Security (QRS), IEEE, 2016, pp. 25–32.
 [5] S. Majumdar, N. Chatterjee, P. P. Das, A. Chakrabarti, A mathematical framework for design
     discovery from multi-threaded applications using neural sequence solvers, Innovations in
     Systems and Software Engineering 17 (2021) 289–307.
 [6] S. Majumdar, N. Chatterjee, P. Pratim Das, A. Chakrabarti, Dcube_ nn d cube nn: Tool
     for dynamic design discovery from multi-threaded applications using neural sequence
     models, Advanced Computing and Systems for Security: Volume 14 (2021) 75–92.
 [7] J. Siegmund, N. Peitek, C. Parnin, S. Apel, J. Hofmeister, C. Kästner, A. Begel, A. Bethmann,
     A. Brechmann, Measuring neural efficiency of program comprehension, in: Proceedings
     of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp. 140–150.
 [8] S. C. B. de Souza, N. Anquetil, K. M. de Oliveira, A study of the documentation essential to
     software maintenance, Conference on Design of communication, ACM, 2005, pp. 68–75.
 [9] L. Tan, D. Yuan, Y. Zhou, Hotcomments: how to make program comments more useful?,
     in: Conference on Programming language design and implementation (SIGPLAN), ACM,
     2007, pp. 20–27.
[10] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5+: Open code large
     language models for code understanding and generation, arXiv preprint arXiv:2305.07922
     (2023).
[11] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, International
     Conference on Program Comprehension (ICPC), IEEE, 2013, pp. 83–92.
[12] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder,
     Can we predict useful comments in source codes?-analysis of findings from information
     retrieval in software engineering track@ fire 2022, in: Proceedings of the 14th Annual
     Meeting of the Forum for Information Retrieval Evaluation, 2022, pp. 15–17.
[13] S. Majumdar, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder,
     Overview of the irse track at fire 2022: Information retrieval in software engineering, in:
     Forum for Information Retrieval Evaluation, ACM, 2022.
[14] J. L. Freitas, D. da Cruz, P. R. Henriques, A comment analysis approach for program
     comprehension, Annual Software Engineering Workshop (SEW), IEEE, 2012, pp. 11–20.
[15] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using
     textual features and developer experience, International Conference on Mining Software
     Repositories (MSR), IEEE, 2017, pp. 215–226.
[16] A. Bosu, M. Greiler, C. Bird, Characteristics of useful code reviews: An empirical study at
     microsoft, Working Conference on Mining Software Repositories, IEEE, 2015, pp. 146–156.
[17] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation
     of comments to aid software maintenance, Journal of Software: Evolution and Process 34
     (2022) e2463.
[18] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search
     approach to program comprehension from code comments, in: Advanced Computing and
     Systems for Security, Springer, 2020, pp. 29–42.
[19] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
[20] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An effective low-
     dimensional software code representation using bert and elmo, in: 2022 IEEE 22nd
     International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2022,
     pp. 763–774.

</pre>