On the Impact of Synthetic Data on Code Comment
                                Usefulness Prediction
                                Vishesh Agarwal1,*,†
                                1
                                    Microsoft Corporation, Redmond, WA 98052, USA


                                                                         Abstract
                                                                         In the domain of software development, the utility of code comments varies, necessitating methodologies
                                                                         capable of distinguishing their substantive value. This study delves into enhancing code comment
                                                                         usefulness classification by adopting a hybrid approach, combining manually tagged datasets with
                                                                         synthetic data augmentation. For augmentation, we employed GPT-3.5-turbo, a state-of-the-art language
                                                                         model, to label additional comment examples. A baseline model was established using random forests for
                                                                         classification. Interestingly, despite the data augmentation, the model performance remained consistent,
                                                                         with an F1 score of approximately 0.79 both before and after the synthetic data integration. This research
                                                                         offers insights into the potential and limitations of synthetic data augmentation in the realm of code
                                                                         comment usefulness classification.

                                                                         Keywords
                                                                         Random Forests, Data Augmentation, Comment Classification, Qualitative Analysis


                                1. Introduction
                                Developers often need to fix bugs, develop new source code, or upgrade already deployed
                                applications on a reduced time frame. This can lead to improper coding practices. As the
                                software changes dynamically, the documentation, such as requirement specification, high-
                                level design etc., becomes outdated and incomplete, and the knowledge transfer process or
                                help from the earlier developers is often unobtainable. These types of situations demand a
                                systematic quality-controlled development process. Automated Program comprehension is one
                                such method of maintaining existing source code in a better way.[1].
                                   Since the software design of a codebase is a moving target, the real source of truth are the
                                traces of test execution, static analysis of the programs and, to a large extent, code comments.
                                This paper focuses on code comments as information about the program design - both for
                                developers, and for automated program comprehension. Code comments offer deep insights
                                into the logic, decisions, and intentions behind the code, thereby aiding in code comprehension,
                                maintenance, and debugging. However, not all comments are equally informative or useful,
                                creating a compelling need to develop automated methods to classify the usefulness of code
                                comments effectively.

                                ⋆
                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                Corresponding author.
                                *

                                $ visagarwal@microsoft.com (V. Agarwal)
                                 0000-0002-4551-748X (V. Agarwal)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   A common hurdle across studies for code comment usefulness is the scarcity of extensive,
well-annotated datasets that encompass the diverse nature of comments in various programming
contexts. This necessitates innovating strategies to enhance available data for improved model
generalization on unseen, real-world comments. Recognizing this gap, we aim to integrates
manual annotation with synthetic data augmentation. We employ GPT-3.5-turbo, a state-of-the-
art language model, to label code comment samples scraped from open source code bases.
   In this paper, we propose a binary classification task to understand the source code comments
present in a program written in C language. We classify each comment into two classes - Useful
and Not Useful. We start with a training data set of over 11000 manually-annotated samples.
We use Random Forests to create a baseline for comment classification. Then, we augment the
data with over 200 GPT-labelled samples to examine the improvement in performance. We
observed that the model performance remained consistent, with an F1 score of 0.79 both for the
baseline and model trained on augmented data.
   By exploring the intricate interplay between manual annotation and synthetic data augmen-
tation, this study aspires to contribute a novel perspective to the existing body of knowledge on
code comment usefulness classification. It strives to offer an innovative solution to the prevailing
challenges in the field, inspiring further exploration and development of robust, scalable models
that can seamlessly adapt to the dynamic, evolving landscape of software development.
   The rest of the paper is organized as follows. Section 2 discusses the background work done in
the domain of comment classification. The task and dataset are described in 3. Our methodology
is discussed in section 4. Results are addressed in section 5. Section 6 concludes the paper.


2. Related Work
Software metadata is integral to code maintenance and subsequent comprehension. A significant
number of tools [2, 3, 4, 5, 6, 7] have been proposed to aid in extracting knowledge from software
metadata [8] like runtime traces or structural attributes of codes.
   In terms of mining code comments and assessing the quality, authors [9, 10, 11, 12, 13, 14]
compare the similarity of words in code-comment pairs using the Levenshtein distance and
length of comments to filter out trivial and non-informative comments. Rahman et al. [15] detect
useful and non-useful code review comments (logged-in review portals) based on attributes
identified from a survey conducted with developers of Microsoft [16]. Majumdar et al. [17, 18]
proposed a framework to evaluate comments based on concepts that are relevant for code
comprehension. They developed textual and code correlation features using a knowledge graph
for semantic interpretation of information contained in comments. These approaches use
semantic and structural features to design features to set up a prediction problem for useful
and not useful comments that can be subsequently integrated into the process of decluttering
codebases.
   With the advent of large language models [19], it is important to compare the quality as-
sessment of code comments by the standard models like GPT 3.5 or llama with the human
interpretation. The IRSE track at FIRE 2023 [20] extends the approach proposed in [17] to
explore various vector space models [21] and features for binary classification and evaluation
of comments in the context of their use in understanding the code. This track also compares
the performance of the prediction model with the inclusion of the GPT-generated labels for the
quality of code and comment snippets extracted from open-source software.


3. Task and Dataset Description
In this section, we have described the task addressed in this paper. We aim to implement a
binary classification system to classify source code comments into useful and not useful. The
procedure takes a code comment with associated lines of code as input. The output will be
a label such as useful or not useful for the corresponding comment, which helps developers
comprehend the associated code. Classical machine learning algorithms such as random forests
can be used to develop the classification system. The two classes of source code comments can
be described as follows:

    • Useful - The given comment is relevant to the corresponding source code.
    • Not Useful - The given comment is not relevant to the corresponding source code.

  A dataset consisting of over 11000 code-comment pairs written in C language is used in our
work. Each instance of data consists of comment text, a surrounding code snippet, and a label
that specifies whether the comment is useful or not. The whole dataset is collected from GitHub
and annotated by a team of 14 annotators. A sample data is illustrated in table 1.
  There is another similar dataset that is created and used in this work. That dataset is created
by getting code-comment pairs from Github, and the label of useful or not useful was given
by GPT. This dataset has a similar structure to the original dataset, and is used to augment the
original dataset later on.


4. Working Principle
We use random forests to implement the binary classification functionality. The system takes
comments as well as surrounding code snippets as input. We create embeddings of each piece
of code and the associated comment using a pre-trained Universal sentence encoder. The output
of the embedding process is used to train both machine learning model. The training dataset
consists of 80% data instances along with their labels. The rest is used for testing, in both
experiments. The description of the model is discussed in the following section.

4.1. Random Forest
Random Forest (RF) is employed for binary comment classification in our study, leveraging an
ensemble of decision trees to improve the model’s predictive accuracy and control overfitting.
The basic premise of Random Forest is to generate numerous decision trees during training, and
output the class that is the mode of the classes output by individual trees during the prediction
phase.
  Each tree in the Random Forest is constructed as follows:
   1. A subset of the training data is selected with replacement (bootstrap sample).
 #                    Comment                                          Code                          Label
                                              -10. int res = 0;
                                              -9. CURL *curl = NULL;
                                              -8. FILE *hd_src = NULL;
                                              -7. int hd;
                                              -6. struct_stat file_info;
                                              -5. CURLM *m = NULL;
 1     /*test 529*/                                                                                Not Useful
                                              -4. int running;
                                              -3. start_test_timing();
                                              -2. if(!libtest_arg2) {
                                              -1. #ifdef LIB529
                                              /*test 529*/
                                              1. fprin
                                              -1. else
                                              /*cr to cr,nul*/
                                              1. newline = 0;
                                              2. }
                                              3. else {
                                              4. if(test->rcount) {
 2     /*cr to cr,nul*/                                                                            Not Useful
                                              5. c = test->rptr[0];
                                              6. test->rptr++;
                                              7. test->rcount–;
                                              8. }
                                              9. else
                                              10. break;
                                              -10. break;
                                              -9. }
                                              -8. gss_release_buffer(&min_stat, &status_string);
                                              -7. }
                                              -6. if(sizeof(buf) > len + 3) {
       /*convert minor status code
 3                                            -5. strcpy(buf + len, ".\n");                        Useful
       (underlying routine error) to text*/
                                              -4. len += 2;
                                              -3. }
                                              -2. msg_ctx = 0;
                                              -1. while(!msg_ctx) {
                                              /*con
Table 1
Sample data instance


     2. A subset of features is randomly chosen at each node.
     3. The best split based on a criterion (such as Gini impurity or entropy) is chosen to partition
        the data.
     4. Steps 2 and 3 are repeated at each node until the tree is fully grown.
   The classification decision is obtained by aggregating the predictions made by all trees in the
forest through majority voting:

                                   𝑅𝐹 (𝑥) = majority ({𝑇𝑖 (𝑥)}𝑛𝑖=1 )                                 (1)
   where 𝑇𝑖 (𝑥) denotes the prediction of the 𝑖-th tree for the input vector 𝑥, and 𝑛 is the number
of trees in the forest. A threshold of 0.5 is conventionally used for binary classification, although
this can be adjusted to favor the useful comment class, similar to the threshold adjustment in
random forests.
   Random Forest inherently handles multi-dimensional feature space and does not require the
feature scaling. It handles missing values by choosing the split that minimizes the impurity
among non-missing values, hence imputing the missing ones based on the majority class or
mean/mode value.
   During training, the out-of-bag (OOB) error, computed on the data not used in bootstrap
samples, serves as an unbiased estimate of the generalization error and can be employed for
hyper-parameter tuning.


5. Results
We train our random forests model on both datasets. The original dataset has 11,452 samples
and the GPT generated data has 233 samples. The first experiment uses only the original data
and produces the following scores.
  After augmenting the original dataset with the GPT generated data, the following results
were seen.
                                   Accuracy          Precision     Recall        F1 Score
           Original Dataset        81.05630729       0.790190835   0.801640488   0.794906015
           Augmented Dataset       81.0012837        0.790785274   0.801383776   0.795175139
Table 2
Results for binary classification on both datasets


  The very slight change in the scores across metrics suggests that the newly generated data
was practically indifferentiable from the original dataset, highlighting the validity of using GPT
generated data for data augmentation.


6. Conclusion
This paper has addressed a binary classification problem in the domain of source code comment
classification. The classification has been done based on the usefulness of the comment present
within a source code written in C language. We have used random forests as our base classi-
fication method. We conducted two experiments, one with the original dataset and another
with the original dataset plus the synthetic GPT generated data. The similar results in both
cases show that the synthetic data falls in line with the original dataset, and how synthetic
data creation can help in effectively increasing data volume required for training models. The
synthetic data’s correctness as compared to the original dataset is proven by the results shown
above. Synthetic data generation can help a lot with data augmentation, finding its use in many
pipelines.
References
 [1] M. Berón, P. R. Henriques, M. J. Varanda Pereira, R. Uzal, G. A. Montejano, A language
     processing tool for program comprehension, in: XII Congreso Argentino de Ciencias de la
     Computación, 2006.
 [2] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Smartkt: a search framework to assist
     program comprehension using smart knowledge transfer, in: 2019 IEEE 19th International
     Conference on Software Quality, Reliability and Security (QRS), IEEE, 2019, pp. 97–108.
 [3] N. Chatterjee, S. Majumdar, S. R. Sahoo, P. P. Das, Debugging multi-threaded applications
     using pin-augmented gdb (pgdb), in: International conference on software engineering
     research and practice (SERP). Springer, 2015, pp. 109–115.
 [4] S. Majumdar, N. Chatterjee, S. R. Sahoo, P. P. Das, D-cube: tool for dynamic design discovery
     from multi-threaded applications using pin, in: 2016 IEEE International Conference on
     Software Quality, Reliability and Security (QRS), IEEE, 2016, pp. 25–32.
 [5] S. Majumdar, N. Chatterjee, P. P. Das, A. Chakrabarti, A mathematical framework for design
     discovery from multi-threaded applications using neural sequence solvers, Innovations in
     Systems and Software Engineering 17 (2021) 289–307.
 [6] S. Majumdar, N. Chatterjee, P. Pratim Das, A. Chakrabarti, Dcube_ nn d cube nn: Tool
     for dynamic design discovery from multi-threaded applications using neural sequence
     models, Advanced Computing and Systems for Security: Volume 14 (2021) 75–92.
 [7] J. Siegmund, N. Peitek, C. Parnin, S. Apel, J. Hofmeister, C. Kästner, A. Begel, A. Bethmann,
     A. Brechmann, Measuring neural efficiency of program comprehension, in: Proceedings
     of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp. 140–150.
 [8] S. C. B. de Souza, N. Anquetil, K. M. de Oliveira, A study of the documentation essential to
     software maintenance, Conference on Design of communication, ACM, 2005, pp. 68–75.
 [9] L. Tan, D. Yuan, Y. Zhou, Hotcomments: how to make program comments more useful?,
     in: Conference on Programming language design and implementation (SIGPLAN), ACM,
     2007, pp. 20–27.
[10] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5+: Open code large
     language models for code understanding and generation, arXiv preprint arXiv:2305.07922
     (2023).
[11] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, International
     Conference on Program Comprehension (ICPC), IEEE, 2013, pp. 83–92.
[12] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder,
     Can we predict useful comments in source codes?-analysis of findings from information
     retrieval in software engineering track@ fire 2022, in: Proceedings of the 14th Annual
     Meeting of the Forum for Information Retrieval Evaluation, 2022, pp. 15–17.
[13] S. Majumdar, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder,
     Overview of the irse track at fire 2022: Information retrieval in software engineering, in:
     Forum for Information Retrieval Evaluation, ACM, 2022.
[14] J. L. Freitas, D. da Cruz, P. R. Henriques, A comment analysis approach for program
     comprehension, Annual Software Engineering Workshop (SEW), IEEE, 2012, pp. 11–20.
[15] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using
     textual features and developer experience, International Conference on Mining Software
     Repositories (MSR), IEEE, 2017, pp. 215–226.
[16] A. Bosu, M. Greiler, C. Bird, Characteristics of useful code reviews: An empirical study at
     microsoft, Working Conference on Mining Software Repositories, IEEE, 2015, pp. 146–156.
[17] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation
     of comments to aid software maintenance, Journal of Software: Evolution and Process 34
     (2022) e2463.
[18] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search
     approach to program comprehension from code comments, in: Advanced Computing and
     Systems for Security, Springer, 2020, pp. 29–42.
[19] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
[20] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. P. Das, P. D.
     Clough, P. Majumder, Generative ai for software metadata: Overview of the information
     retrieval in software engineering track at fire 2023, in: Forum for Information Retrieval
     Evaluation, ACM, 2023.
[21] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An effective low-
     dimensional software code representation using bert and elmo, in: 2022 IEEE 22nd
     International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2022,
     pp. 763–774.