A ML-LLM pairing for better code comment
                                classification⋆
                                Hanna Abi Akl1,2
                                1
                                    Data ScienceTech Institute (DSTI), 4 Rue de la Collégiale 75005 Paris, France
                                2
                                    Université Côte d’Azur, Inria, CNRS, I3S


                                                                         Abstract
                                                                         The "Information Retrieval in Software Engineering (IRSE) 1 " at FIRE 2023 shared task introduces code
                                                                         comment classification, a challenging task that pairs a code snippet with a comment that should be
                                                                         evaluated as either useful or not useful to the understanding of the relevant code. We answer the code
                                                                         comment classification shared task challenge by providing a two-fold evaluation: from an algorithmic
                                                                         perspective, we compare the performance of classical machine learning systems and complement our
                                                                         evaluations from a data-driven perspective by generating additional data with the help of large language
                                                                         model (LLM) prompting to measure the potential increase in performance. Our best model, which took
                                                                         second place in the shared task, is a Neural Network with a Macro-F1 score of 88.401% on the provided
                                                                         seed data and a 1.5% overall increase in performance on the data generated by the LLM.

                                                                         Keywords
                                                                         Natural Language Processing, Machine Learning, Information Retrieval, Large Language Models, Code
                                                                         Comprehension, Comment Quality


                                1. Introduction
                                In software development, code and documentation go hand-in-hand. Writing code is crucial to
                                maintaining existing code bases, developing new features and fixing bugs. Documentation helps
                                developers make sense of the logic behind written code and provides a steady set of guidelines
                                to iterate over it [1]. Code commenting is a form of documentation whereby comments written
                                in natural language are inserted in the code [1]. The advantage of this method is that it helps
                                clarify parts of the code without affecting performance since comments are ignored by compilers
                                [1]. It also provides an easy way to reflect updates on code changes without having to modify
                                the entire documentation [1].
                                   From the practice of writing code comments comes the challenge of identifying useful
                                comments [2]. Writing comments is not always accurate science, and some comments can
                                be outdated or ambiguous [2]. This can be problematic for developers who rely on these
                                comments to understand and alter the code. There is then a real need for code comment

                                1
                                 https://sites.google.com/view/irse2023/home
                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                ⋆
                                  You can use this document as the template for preparing your publication. We recommend using the latest version
                                  of the ceurart style.
                                $ hanna.abi-akl@dsti.institute (H. A. Akl)
                                 0000-0001-9829-7401 (H. A. Akl)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
checking. This need has framed the task of collecting code comments from real projects in
a code-comment database to aid in the task of classifying useful versus not useful comments
[3]. Code comment classification is still a relatively new task that explores the possibility of
accurately discriminating between comments that bring added value to the corresponding code
and comments that are not pertinent with respect to the surrounding code [4].
   Recent research has aimed to answer this challenge by compiling a semantic code-comment
base by scraping and collecting code and surrounding comments from real projects in C [3].
Researchers have also explored applying machine and deep learning techniques to solve this
binary classification problem by considering useful comments (i.e., informative of the surround-
ing code) as a class and non-useful comments (i.e., redundant, uninformative or ambiguous) as
another [5, 6].
   On the other hand, the rise of large language models (LLM) [7] and their ability to pose as
a jack-of-all-trades by solving a wide range of machine learning and deep learning problems,
coupled with their wealth of training data, make them an interesting entry point for the code
comment classification task [8]. Based on the Transformers model [9], they are able to create
robust embeddings from text, which helps them tackle problems based on natural language
[10]. Another recent breakthrough in LLMs is in generative artificial intelligence, where users
combine pre-trained models with different prompting techniques to generate output data (e.g.,
text) [11]. This prompting ability is at the heart of prompt engineering, a method that can
redirect a LLM into focusing its generation on a specific need. This need can be in the form
of answering specific questions, solving certain tasks (e.g., a classification problem) or even
producing data in a pre-defined format [12]. The latter use case plays a detrimental role in
data augmentation, whereby users can couple the power of LLMs with a pre-existing dataset to
enrich it and overcome data scarcity [13, 14].
   The IRSE at FIRE 2023 shared task proposes to measure the effects of leveraging LLMs in the
context of solving the code comment classification problem [15]. Specifically, challengers are
asked to use the generative capabilities of LLMs to enrich an existing dataset of code comments
and compare the performance of classical machine learning models on the classification task
before and after data augmentation [15].
   In this paper, we show how prompting LLMs effectively can increase model performance
on the code comment classification problem. The rest of the paper is organized as follows. In
section 2, we discuss some of the related work. In section 3, we present the experimental setup.
In section 4, we discuss the results. Finally, we present our conclusions in section 5.


2. Related Work
This section discusses some of the proposed strategies in the literature to classify code comments
by quality.

2.1. Baseline models for code comment classification
Paul [16] leveraged classical machine learning models to solve the code comment classification
task on a C language dataset of code and comments. They extracted text-level features like
comment length and comment position within the source code and found a comparable perfor-
mance between a logistic regression and a support vector machine binary classifier [16]. Das
and Chatterjee [17] studied the performance of deep learning models by proposing a fusion
transformer system based on BERT and CodeBERT. Their system combined text-based features
with dense embeddings and outperformed all other baseline models on the code comment
classification task [17].

2.2. Embedding techniques for code comment classification
Basu et al. [18] compared both classical machine learning models and transformer-based mod-
els with different embedding techniques and found that the bag-of-word representation can
outperform transformer-based embeddings on the code comment classification problem. Their
findings could not be generalized and were limited by the size of the dataset they used for
their runs [18]. Majumdar et al. [19] examined the effects of using embeddings to tackle the
code comment pair classification challenge by developing and training a low-dimensional con-
textualized word embeddings model based on masked language models. The resulting model
captured semantic code concepts better and resulted in a boost in their binary classification
systems when compared to vanilla word embeddings models [19]. Other areas of research
suggest an inclination toward specializing software engineering terms and building a domain
vocabulary to produce more representative word models. Mishra and Sharma [20] proposed a
methodology for crawling and scraping Wikipedia as a base for collecting software engineering
terms. Gonzalez-Perez and Henderson-Sellers [21] laid the groundwork for the construction of
such an ontology in terms of completeness, clarity, generalizability and extensibility. Simmons
and Dillon [22] proposed an open-source architecture designed to act as both an ontology and a
knowledge base meta-model for software development semantics.


3. Experiments
This section describes the framework of our experiments in terms of data, models and training
process.

3.1. Dataset description
The dataset considered for this shared task is divided in two parts: a seed data provided by the
task organizers and a LLM-generated dataset to complement it. We introduce both datasets in
the following subsections.

3.1.1. Seed Data
The data provided by the task organizers consists of 11452 pairs of code and comments written
in C, labeled as either Useful or Not Useful. The data contains 7063 Useful rows and 4389 Not
Useful rows. The comments and surrounding code snippets are extracted from GitHub. For
every comment, the label (Useful or Not Useful) was generated by a team of 14 annotators.
Each comment was annotated by 2 annotators. Cohen’s metric was used for inter-annotator
agreement with a kappa score of 0.734. The annotation process was supervised by weekly
meetings and peer review sessions. Sample data is shown in Figure 1.


Figure 1: Example of Seed Data


3.1.2. Data Augmentation
Participants are required to generate an additional dataset to complement the provided seed
data. The generated dataset consists of code and comment pairs with labels generated using
a LLM model of choice. In our experiments, we chose ChatGPT as our LLM and prompted
it to generate data that aligns with the criteria of the given dataset, i.e., the generated code
snippets should be written in the C programming language and the corresponding comments
should be a mixture of useful and not useful. Additionally, we asked ChatGPT 1 to label each
code-comment pair with the corresponding class (Useful or Not Useful). The ablation study
performed on the LLM-generated dataset can be found in the Appendix. Using this method, we
were able to generate 421 new code-comment pairs with 411 being labeled as Useful and 10
labeled as Not Useful by ChatGPT. Figure 2 shows an example output from ChatGPT.

3.2. System description
This section introduces the methodology used in our experimental runs. It describes the machine
learning models as well as the features employed in our experiments.

3.2.1. Model Choice
Since the challenge explicitly limits participants to classical machine learning models (including
neural networks but not extending to recurrent neural networks or more modern architectures
like large language models), we based our experiments on 3 systems: Random Forest (RF),
Voting Classifier (VC) and Neural Network (NN).
   For the Random Forest model, we configured the following parameters: number of estimators
= 100, criterion = gini, minimum samples split = 2, minimum samples leaf = 1, maximum features
= sqrt, and bootstrap = True.
1
    https://chat.openai.com/share/6538a7f4-0a19-4e54-b5d8-d246dac3781a
Figure 2: Example of ChatGPT Gnenerated Data


   The Voting Classifier is based on 3 estimator models: a Random Forest, a Neural Network an
a Linear SVC. The Random Forest has the following parameters: number of estimators = 100,
criterion = gini, minimum samples split = 2, minimum samples leaf = 1, maximum features =
sqrt, bootstrap = True. The Neural Network has the following configuration: number of hidden
layers = 2, hidden layers sizes = (20,10), activation = relu, solver = adam, alpha = 0.0001, learning
rate = constant, initial learning rate = 0.001, maximum iterations = 200, shuffle = True, tolerance
= 0.0001, momentum = 0.9, nesterov’s momentum = True, beta 1 = 0.9, beta 2 = 0.999, epsilon =
0.00000001. The Linear SVC is configured as follows: penalty = L2, loss = squared hinge, dual =
True, tolerance = 0.0001, C = 1.0, fit intercept = True, maximum iterations = 1000. The voting
strategy is set to hard.
   The last model is a standalone Neural Network with the following configuration: number of
hidden layers = 2, hidden layers sizes = (20,10), activation = relu, solver = adam, alpha = 0.0001,
learning rate = constant, initial learning rate = 0.001, maximum iterations = 200, shuffle = True,
tolerance = 0.0001, momentum = 0.9, nesterov’s momentum = True, beta 1 = 0.9, beta 2 = 0.999,
epsilon = 0.00000001.
   All models are implemented using the scikit-learn 2 package in Python.

3.2.2. Features
For the feature engineering phase, we concatenate code-comment pairs and embed the resulting
input strings. We use the flax-sentence-embeddings/st-codesearch-distilroberta-base 3 model
trained with the Hugging Face sentence-transformers 4 library on the CodeSearchNet 5 dataset
compiled from code and documentation strings in the Go, Java, Javascript, PHP, Python and
Ruby programming languages [23]. The result is one 768 dimensional embedding vector for
every code-comment input string. These embeddings constitute our final feature set and are
fed to the different models.

3.2.3. Experimental Setup
We divide our experiment in two phases: seed data run and seed + LLM data run. The setup is
identical for both phases and the only difference is the input data used. In the seed data run,
only the seed data provided by the task organizers is used to assess model performance. In the
seed + LLM data run, the data generated by ChatGPT is added to the seed data and the resulting
augmented dataset is used as the input for our models.
   In both phases, analyzing the data at our disposal shows a class imbalance where the Useful
class is over-represented at 61.6% in the seed data and 97.6% in the LLM-generated data. We
use the SMOTE [24] technique to balance the datasets and restore class parity by synthetically
generating rows of Not Useful data to achieve a 50-50 percent class distribution.
   Next, we split our data using the scikit-learn Repeated Stratified K-Fold cross validator 6
with 10 folds and 3 allowed repetitions. We use the Accuracy, Precision, Recall and F1 scores as
metrics for evaluating our models. All experiments are performed on a Dell G15 Special Edition
5521 hardware with 14 CPU Cores, 32 GB RAM and NVIDIA GeForce RTX 3070 Ti GPU.


4. Results
Table 1 demonstrates the performance of each model on the seed data. For each scoring metric,
the best score is marked in bold for both the Useful (U) and Not Useful (NU) classes. On the
majority of the scoring metrics, the Neural Network outclasses the Random Forest and the
Voting Classifier models. The Voting Classifier outperforms the Neural Network on the Recall
of the Useful class and the Precision of the Not Useful class which shows that applying different
non-linear models together can compensate for the shortcomings of one model’s blind spots
and classify more instances of Useful and Not Useful data correctly.

2
  https://scikit-learn.org/stable/index.h
3
  https://huggingface.co/flax-sentence-embeddings/st-codesearch-distilroberta-base
4
  https://huggingface.co/sentence-transformers
5
  https://huggingface.co/datasets/code_search_net
6
  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html
   The results of Table 2 are consistent with these findings. The Neural Network model is the
overall best model since it outperforms the other systems in 5 scoring metrics out of 8 over both
classes, while the Voting Classifier retains the best scores in F1 (U), Recall (U) and Precision
(UN). We also note that the scores are consistently high for both classes, which is in large part
helped by the SMOTE data augmentation technique.
   Having balanced both classes in our experiments allows us to have a better baseline when
measuring the impact of the additional data generated by ChatGPT. By comparing the scores of
Tables 1 and 2, we see that fixing the models and augmenting the data yields a 1.5% increase in
scores overall. Particularly, this solidifies the claim that the data generated by the LLM aligns
with the data expected for this challenge and can further aid in solving it.

 Model     Macro-F1 (U)    Precision    Recall   Accuracy     Macro-F1 (NU)     Precision    Recall   Accuracy
 RF        84.727          83.263       86.257   84.454       84.168            85.758       82.651   84.454
 VC        88.133          88.071       88.215   88.123       88.111            88.211       88.031   88.123
 NN        88.401          89.484       87.380   88.536       88.664            87.692       89.693   88.536

Table 1
Model performance comparison on seed data


 Model     Macro-F1 (U)    Precision    Recall   Accuracy     Macro-F1 (NU)     Precision    Recall   Accuracy
 RF        85.587          84.397       86.818   85.381       85.168            86.438       83.943   85.381
 VC        88.539          88.705       88.395   88.560       88.578            88.454       88.725   88.560
 NN        88.489          90.007       87.066   88.678       88.856            87.506       90.290   88.678

Table 2
Model performance comparison on seed + LLM data


5. Conclusion
In this shared task, we evaluate the impact of generating LLM data to improve model per-
formance. We explore the effects of this data generation by augmenting the existing code
comment dataset and measuring the increase in the model classification scores. In the future,
we plan to incorporate other data generation mechanisms such as ontology or knowledge graph
integration into our LLM prompting technique to further our study of the impact of a refined
data augmentation pipeline on classification performance.


References
 [1] R. C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship, 1 ed., Prentice
     Hall PTR, USA, 2008.
 [2] Y. Shinyama, Y. Arahori, K. Gondow, Analyzing code comments to boost program compre-
     hension, in: 2018 25th Asia-Pacific Software Engineering Conference (APSEC), IEEE, 2018,
     pp. 325–334.
 [3] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search
     approach to program comprehension from code comments, Advanced Computing and
     Systems for Security: Volume Twelve (2020) 29–42.
 [4] FIRE ’22: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval
     Evaluation, Association for Computing Machinery, New York, NY, USA, 2022.
 [5] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder,
     Can we predict useful comments in source codes?-analysis of findings from information
     retrieval in software engineering track@ fire 2022, in: Proceedings of the 14th Annual
     Meeting of the Forum for Information Retrieval Evaluation, 2022, pp. 15–17.
 [6] K. Ghosh, A. Senapati, Information retrieval in software engineering utilizing a pre-trained
     bert model (2022).
 [7] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong,
     Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, J.-R.
     Wen, A survey of large language models, 2023. arXiv:2303.18223.
 [8] J. Kocoń, I. Cichecki, O. Kaszyca, M. Kochanek, D. Szydło, J. Baran, J. Bielaniewicz, M. Gruza,
     A. Janz, K. Kanclerz, et al., Chatgpt: Jack of all trades, master of none, Information Fusion
     (2023) 101861.
 [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, 2023. arXiv:1706.03762.
[10] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Huggingface’s transformers: State-of-the-art
     natural language processing, 2020. arXiv:1910.03771.
[11] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and pre-
     dict: A systematic survey of prompting methods in natural language processing, 2021.
     arXiv:2107.13586.
[12] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, J. Ba, Large language models
     are human-level prompt engineers, 2023. arXiv:2211.01910.
[13] C. Whitehouse, M. Choudhury, A. F. Aji, Llm-powered data augmentation for enhanced
     crosslingual performance, 2023. arXiv:2305.14288.
[14] J. Yuan, R. Tang, X. Jiang, X. Hu, Large language models for healthcare data augmentation:
     An example on patient-trial matching, 2023. arXiv:2303.16756.
[15] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. P. Das, P. D.
     Clough, P. Majumder, Generative ai for software metadata: Overview of the information
     retrieval in software engineering track at fire 2023, in: Forum for Information Retrieval
     Evaluation, ACM, 2023.
[16] S. Paul, Source code comment classification using logistic regression and support vector
     machine (2022).
[17] M. Das, S. Chatterjee, Exploring transformer-based models for automatic useful code
     comments detection (2022).
[18] T. Basu, et al., Identification of the relevance of comments in codes using bag of words
     and transformer based models, arXiv preprint arXiv:2308.06144 (2023).
[19] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An effective low-
     dimensional software code representation using bert and elmo, in: 2022 IEEE 22nd
     International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2022,
     pp. 763–774.
[20] S. Mishra, A. Sharma, Crawling wikipedia pages to train word embeddings model for
     software engineering domain, in: 14th Innovations in Software Engineering Conference
     (formerly known as India Software Engineering Conference), 2021, pp. 1–5.
[21] C. Gonzalez-Perez, B. Henderson-Sellers, An ontology for software development method-
     ologies and endeavours, in: Ontologies for software engineering and software technology,
     Springer, 2006, pp. 123–151.
[22] G. L. Simmons, T. S. Dillon, Towards an ontology for open source software development,
     in: IFIP International Conference on Open Source Systems, Springer, 2006, pp. 65–75.
[23] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, M. Brockschmidt, Codesearchnet challenge:
     Evaluating the state of semantic code search, 2020. arXiv:1909.09436.
[24] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic minority
     over-sampling technique, Journal of Artificial Intelligence Research 16 (2002) 321–357.
     URL: https://doi.org/10.1613%2Fjair.953. doi:10.1613/jair.953.


A. Appendix
This section presents the ablation study done on the ChatGPT prompt to understand its inherent
data generation mechanisms. Overall, we asked ChatGPT to provide 3000 additional lines of
code-comment data split equally between Useful and Not Useful.
   From the generated dataset, we’ve had to trim down rows for duplicates (i.e., the same
code-comment pairs generated multiple times), incompleteness (i.e., the generation was missing
parts of the code snippet or comment) and ambiguity (i.e., the code was not written in proper C
language or the comment was not in proper English). These rules ensure the generated dataset
mimics as faithfully as possible the original seed data with human supervision.
   Additionally, we’ve attempted to retro-engineer ChatGPT to explain the process of its
generated data. The following extract transcribes the code-comment generation mechanism
adopted by the LLM.

  User: can you explain how you came up with these examples?

  ChatGPT: Certainly! The examples were generated programmatically using a combination of
randomization and predefined patterns. Here’s a breakdown of the process:

  1. Randomization: To create random lines of C code, I utilized random number generators
available in programming languages. For this task, I used the built-in random number generator
provided by the programming environment. The generator is initialized with a seed (which may be
the system time), and it generates random numbers within a specified range. This allowed me to
create random integers for variables, loop limits, and other numerical values.

   2. Predefined Patterns: I used predefined patterns to structure the C code snippets. These patterns
include simple variable declarations, conditional statements, loops, arithmetic operations, function
calls, and string operations. By combining these patterns with the randomly generated numbers, I
formed meaningful C code lines.

  3. Comments: For the comments, I used a mixture of random phrases and predefined descriptions.
Half of the comments were matched with the corresponding lines of code to provide meaningful
explanations of the code’s purpose or functionality. The other half of the comments were randomly
generated phrases, ensuring a mix of relevant and unrelated comments.

  4. Iteration: To generate 3000 examples, I repeated the randomization and predefined patterns
generation process 3000 times. Each iteration produced a unique line of C code along with a
comment, resulting in a diverse set of examples.

   It’s important to note that these examples are randomly generated and may not necessarily
represent valid or complete programs. The goal was to provide a mix of code snippets and comments
to showcase various C language constructs and coding patterns, as well as to demonstrate the use
of comments for code documentation. The examples are intended for illustrative purposes and
should not be used in a production environment without proper validation and testing.