=Paper=
{{Paper
|id=Vol-3681/T7-11
|storemode=property
|title=Assessing the Utility of C Comments with SVM and Naïve Bayes Classifier
|pdfUrl=https://ceur-ws.org/Vol-3681/T7-11.pdf
|volume=Vol-3681
|authors=Aritra Mitra
|dblpUrl=https://dblp.org/rec/conf/fire/Mitra23
}}
==Assessing the Utility of C Comments with SVM and Naïve Bayes Classifier==
<pdf width="1500px">https://ceur-ws.org/Vol-3681/T7-11.pdf</pdf>
<pre>
                                Assessing the Utility of C Comments with SVM and
                                Naïve Bayes Classifier
                                Aritra Mitra1,∗
                                1
                                    Indian Institute of Technology, Kharagpur (IIT-KGP), West Bengal-721302, India


                                                                         Abstract
                                                                         Comments are very useful to the flow of code development. With the increasing use of code in common-
                                                                         place life, commenting the codes becomes a hassle for rookie coders, and often they do not even think
                                                                         commenting as a part of the development process. This in general causes the quality of comments to
                                                                         degrade, and a considerable amount of useless comments are found in such codes. In these experiments,
                                                                         the usefulness of C comments are evaluated using Support Vector Machine (SVM) and Naïve Bayes
                                                                         Classifier. The results of the experiments create a baseline for better results that can be found in the
                                                                         future through more research. Based on these findings, more complex and intricate machine learning
                                                                         models can be created that can improve the accuracy achieved in performing said task.

                                                                         Keywords
                                                                         Machine Learning, Natural Language Processing, SVM, Naïve Bayes Classifier


                                1. Introduction
                                Comments play an essential role in code development [1], consuming a significant amount of
                                time to enhance code readability [2]. However, not all comments contribute to this objective.
                                As coding becomes more commonplace, novice programmers tend to overlook the art of com-
                                menting, leading to a deterioration in both the quality and quantity of comments [3]. Many
                                comments turn out to be unproductive, and sifting through lengthy comments only to discover
                                their futility can be frustrating and time-consuming.
                                   Various deep learning-based automatic commenting models can boost the quantity of com-
                                ments [4]. Regrettably, there has been insufficient research dedicated to addressing the issue of
                                comment quality. Nevertheless, recent efforts are addressing these challenges by developing
                                machine learning models capable of identifying and categorizing comments based on their
                                usefulness.
                                   The author has explored a range of Machine Learning (ML) models in pursuit of solutions to
                                this problem. This paper aims to answer critical questions as part of the Information Retrieval
                                in Software Engineering (IRSE) shared task at the Forum for Information Retrieval Evaluation
                                (FIRE) 2022 [5], conducted under the team name FaultySegment:


                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                ∗
                                    Corresponding author.
                                Envelope-Open aritramitra2002@gmail.com (A. Mitra)
                                GLOBE https://cse.iitkgp.ac.in/~aritra.mitra/ (A. Mitra)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    • What level of complexity is necessary for a Machine Learning model to reliably distinguish
      useful comments from useless ones?
    • How do well-known general-purpose models like SVM and Naïve Bayes Classifier perform
      in this context, even if they are not built for this particular scenario?
   This paper aims to demonstrate that models such as SVM or Naïve Bayes Classifier can serve
as effective starting points for tackling this problem. More complex models can be built upon
these foundations, while also considering the risk of overfitting.
   In addition to the aforementioned objectives, this paper also addresses the crucial question
of whether the dataset for the learning task can be augmented with data generated by large
language models, such as GPT-3. This consideration plays a pivotal role in exploring the
potential of leveraging artificial intelligence for improving comment quality assessment. By
evaluating the impact of augmenting the dataset with AI-generated data, the study aims to shed
light on the benefits and challenges of integrating advanced language models into the machine
learning pipeline. This inquiry represents a significant aspect of the research, emphasizing the
intersection of human-written and AI-generated content in the context of comment quality
evaluation.


2. Related Work
Software metadata [6] plays a crucial role in the maintenance of code and its subsequent
understanding. Numerous tools have been developed to assist in extracting knowledge from
software metadata, which includes runtime traces and structural attributes of code [7, 8, 9, 10,
11, 12, 13, 14, 15].
   In the realm of mining code comments and assessing their quality, several authors have
conducted research. Steidl et al. [16] employ techniques such as Levenshtein distance and
comment length to gauge the similarity of words in code-comment pairs, effectively filtering
out trivial and non-informative comments. Rahman et al. [17] focus on distinguishing useful
from non-useful code review comments within review portals, drawing insights from attributes
identified in a survey conducted with Microsoft developers [18]. Majumdar et al. [19, 20, 21, 22]
have introduced a framework for evaluating comments based on concepts crucial for code
comprehension. Their approach involves the development of textual and code correlation
features, utilizing a knowledge graph to semantically interpret the information within comments.
These approaches employ both semantic and structural features to address the prediction
problem of distinguishing useful from non-useful comments, ultimately contributing to the
process of decluttering codebases
   In light of the emergence of large language models, such as GPT-3.5 or llama [23], it becomes
crucial to assess the quality of code comments and compare them to human interpretation. The
IRSE track at FIRE 2023 [5] expands upon the approach presented in a prior work [19]. It delves
into the exploration of various vector space models [24] and features for binary classification
and evaluation of comments, specifically in the context of their role in comprehending code.
Furthermore, this track conducts a comparative analysis of the prediction model’s performance
when GPT-generated labels for code and comment quality, extracted from open-source software,
are included.
3. Task and Dataset Description
In this section, a description of the task at hand and the dataset provided are given. The task at
IRSE, FIRE 2023 was as follows: A binary code comment quality classification model needs to be
augmented with generated code and comment pairs that can improve the accuracy of the model.
The corresponding dataset was split into two:
    • The training dataset with 8048 entries, and
    • The testing dataset with 1000 entries.
The training dataset was shuffled, and split into 70% for training the models, and 30% for
cross-validation. The data was labelled as follows:
    • Useful: Comments that are useful for code comprehension
    • Not Useful: Comments that are not useful for code comprehension

Table 1
Description of the Dataset for the Task
                 Label                               Example
                Useful      /*not interested in the downloaded bytes, return the size*/
                Useful                     /*Fill in the file upload part*/
               Not Useful    /*The following works both in 1.5.4 and earlier versions:*/
               Not Useful                            /*lock_time*/


4. Augmentation
The dataset augmentation process involved the integration of data generated by GPT-3, a
powerful language model, to enrich the existing dataset. By leveraging GPT-3’s natural language
generation capabilities, additional comment data was created to expand the diversity and scale
of the dataset. This augmentation strategy aimed to introduce a broader spectrum of comments,
encompassing a wide range of writing styles, structures, and content. The incorporation of
GPT-3 generated data was carried out to assess its potential in enhancing the training of machine
learning models for comment quality evaluation. This approach allowed for the exploration
of how AI-generated content could complement human-written data, contributing to a more
comprehensive and robust dataset for improved model performance.


5. System Description
5.1. Text Preprocessing
All the links, punctuations, numbers and stop words have been removed. Then all words which
have a POS tag other than Noun, Verb, Adverb and Adjective are removed. Lemmatization is
used for grouping together the different forms of a word into a single word. NLTK wordnet
[25] is used for lemmatization. Both training and testing datasets use same preprocessing steps.
5.2. Feature Extraction
TfidfVectorizer [26] is used for converting the text into numerical features. Tokenizer by Keras
[27] library is used, along with TfidfVectorizer that was used from SciKit-Learn library.

5.3. Machine Learning Models
Two runs have been submitted for the task: one using Support Vector Machine (SVM) model,
and another with Naïve Bayes classifer model. We have used the SciKit-Learn library for both
of the models, with the parameters for the SVM model as follows:

    • C: (regularization parameter) = 1
    • kernel: (kernel type) = ’linear’


6. Findings
6.1. Without Augmentations
With these parameters set for the SVM model, the validation set gives a 77.26708074534162%
accuracy score, along with an F1 score of 0.786464410735123.
Also, with the Naïve Bayes Classifier, the validation set gives a 60.993788819875775% accuracy
score, along with an F1 score of 0.699233716475096.

Table 2
Results of Classifier Runs
               Run           Macro F1 Score   Macro Precision   Macro Recall   Accuracy%
              SVM               0.786464         0.772345         0.771381      77.2671
           Naïve Bayes          0.699234         0.609193         0.644289      60.9938


6.2. With Augmentation
With these parameters set for the SVM model, the validation set gives a 77.64842840512223%
accuracy score, along with an F1 score of 0.7830939828521743.
Also, with the Naïve Bayes Classifier, the validation set gives a 64.0279394644936% accuracy
score, along with an F1 score of 0.695431994330003.

Table 3
Results of Classifier Runs
               Run           Macro F1 Score   Macro Precision   Macro Recall   Accuracy%
              SVM               0.783094         0.772302         0.774198      77.6484
           Naïve Bayes          0.695432         0.611399         0.659775      64.0279
7. Conclusion
The tasks were accomplished employing basic machine learning models like SVM and the Naïve
Bayes Classifier. The outcomes obtained from the SVM classifier indicate that there is room
for enhancement, enabling the development of more intricate models that align better with
the problem statement and yield superior results. Notably, Srijoni Majumdar et al. [28] have
already achieved superior results using neural networks, and the author anticipates continuous
improvement in these results over time.


Acknowledgments
Thanks to the creators of IRSE FIRE for giving this wonderful opportunity to work on such a
project, and their constant technical support throughout the timespan.


References
 [1] B. Fluri, M. Würsch, H. Gall, Do code and comments co-evolve? on the relation between
     source code and comment changes, 2007, pp. 70–79. doi:10.1109/WCRE.2007.21 .
 [2] M. Kajko-Mattsson, A survey of documentation practice within corrective maintenance,
     Empirical Software Engineering 10 (2005) 31–55. URL: https://doi.org/10.1023/B:LIDA.
     0000048322.42751.ca. doi:10.1023/B:LIDA.0000048322.42751.ca .
 [3] J. Raskin, Comments are more important than code, ACM Queue 3 (2005) 64–. doi:10.
     1145/1053331.1053354 .
 [4] E. Wong, J. Yang, L. Tan, Autocomment: Mining question and answer sites for automatic
     comment generation, in: 2013 28th IEEE/ACM International Conference on Automated
     Software Engineering (ASE), 2013, pp. 562–567. doi:10.1109/ASE.2013.6693113 .
 [5] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. P. Das, P. D.
     Clough, P. Majumder, Generative ai for software metadata: Overview of the information
     retrieval in software engineering track at fire 2023, in: Forum for Information Retrieval
     Evaluation, ACM, 2023.
 [6] S. C. B. de Souza, N. Anquetil, K. M. de Oliveira, A study of the documentation essential to
     software maintenance, Conference on Design of communication, ACM, 2005, pp. 68–75.
 [7] L. Tan, D. Yuan, Y. Zhou, Hotcomments: how to make program comments more useful?,
     in: Conference on Programming language design and implementation (SIGPLAN), ACM,
     2007, pp. 20–27.
 [8] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Smartkt: a search framework to assist
     program comprehension using smart knowledge transfer, in: 2019 IEEE 19th International
     Conference on Software Quality, Reliability and Security (QRS), IEEE, 2019, pp. 97–108.
 [9] N. Chatterjee, S. Majumdar, S. R. Sahoo, P. P. Das, Debugging multi-threaded applications
     using pin-augmented gdb (pgdb), in: International conference on software engineering
     research and practice (SERP). Springer, 2015, pp. 109–115.
[10] S. Majumdar, N. Chatterjee, S. R. Sahoo, P. P. Das, D-cube: tool for dynamic design discovery
     from multi-threaded applications using pin, in: 2016 IEEE International Conference on
     Software Quality, Reliability and Security (QRS), IEEE, 2016, pp. 25–32.
[11] S. Majumdar, N. Chatterjee, P. P. Das, A. Chakrabarti, A mathematical framework for design
     discovery from multi-threaded applications using neural sequence solvers, Innovations in
     Systems and Software Engineering 17 (2021) 289–307.
[12] S. Majumdar, N. Chatterjee, P. Pratim Das, A. Chakrabarti, Dcube_ nn d cube nn: Tool
     for dynamic design discovery from multi-threaded applications using neural sequence
     models, Advanced Computing and Systems for Security: Volume 14 (2021) 75–92.
[13] J. Siegmund, N. Peitek, C. Parnin, S. Apel, J. Hofmeister, C. Kästner, A. Begel, A. Bethmann,
     A. Brechmann, Measuring neural efficiency of program comprehension, in: Proceedings
     of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp. 140–150.
[14] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5+: Open code large
     language models for code understanding and generation, arXiv preprint arXiv:2305.07922
     (2023).
[15] J. L. Freitas, D. da Cruz, P. R. Henriques, A comment analysis approach for program
     comprehension, Annual Software Engineering Workshop (SEW), IEEE, 2012, pp. 11–20.
[16] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, International
     Conference on Program Comprehension (ICPC), IEEE, 2013, pp. 83–92.
[17] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using
     textual features and developer experience, International Conference on Mining Software
     Repositories (MSR), IEEE, 2017, pp. 215–226.
[18] A. Bosu, M. Greiler, C. Bird, Characteristics of useful code reviews: An empirical study at
     microsoft, Working Conference on Mining Software Repositories, IEEE, 2015, pp. 146–156.
[19] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation
     of comments to aid software maintenance, Journal of Software: Evolution and Process 34
     (2022) e2463.
[20] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search
     approach to program comprehension from code comments, in: Advanced Computing and
     Systems for Security, Springer, 2020, pp. 29–42.
[21] S. Majumdar, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder,
     Overview of the irse track at fire 2022: Information retrieval in software engineering, in:
     Forum for Information Retrieval Evaluation, ACM, 2022.
[22] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder,
     Can we predict useful comments in source codes?-analysis of findings from information
     retrieval in software engineering track@ fire 2022, in: Proceedings of the 14th Annual
     Meeting of the Forum for Information Retrieval Evaluation, 2022, pp. 15–17.
[23] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
[24] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An effective low-
     dimensional software code representation using bert and elmo, in: 2022 IEEE 22nd
     International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2022,
     pp. 763–774.
[25] E. Loper, S. Bird, Nltk: The natural language toolkit, 2002. URL: https://arxiv.org/abs/cs/
     0205028. doi:10.48550/ARXIV.CS/0205028 .
[26] V. Kumar, B. Subba, A tfidfvectorizer and svm based sentiment analysis framework for
     text data corpus, in: 2020 National Conference on Communications (NCC), 2020, pp. 1–6.
     doi:10.1109/NCC48643.2020.9056085 .
[27] N. Ketkar, Introduction to Keras, 2017, pp. 95–109. doi:10.1007/978- 1- 4842- 2766- 4_7 .
[28] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Au-
     tomated evaluation of comments to aid software maintenance,                 Journal of
     Software: Evolution and Process 34 (2022) e2463. URL: https://onlinelibrary.
     wiley.com/doi/abs/10.1002/smr.2463.           doi:https://doi.org/10.1002/smr.2463 .
     arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/smr.2463 .

</pre>