=Paper= {{Paper |id=Vol-3681/T7-8 |storemode=property |title=A study of the impact of generative AI-based data augmentation on software metadata classification |pdfUrl=https://ceur-ws.org/Vol-3681/T7-8.pdf |volume=Vol-3681 |authors=Tripti Kumari,Chakali Sai Charan,Ayan Das |dblpUrl=https://dblp.org/rec/conf/fire/KumariC023 }} ==A study of the impact of generative AI-based data augmentation on software metadata classification== https://ceur-ws.org/Vol-3681/T7-8.pdf
                                A study of the impact of generative AI-based data
                                augmentation on software metadata classification
                                Tripti Kumari1,∗ , Chakali Sai Charan1 and Ayan Das1
                                Department of Computer Science and Engineering
                                Indian Institute of Technology (ISM) Dhanbad, Jharkhand, 826004, India


                                                                         Abstract
                                                                         This paper presents the system submitted by the team from IIT(ISM) Dhanbad in FIRE IRSE 2023
                                                                         shared task 1 on the automatic usefulness prediction of code-comment pairs as well as the impact of
                                                                         Large Language Model(LLM) generated data on original base data towards an associated source code.
                                                                         We have developed a framework where we train a machine learning-based model using the neural
                                                                         contextual representations of the comments and their corresponding codes to predict the usefulness of
                                                                         code-comments pair and performance analysis with LLM-generated data with base data. In the official
                                                                         assessment, our system achieves a 4% increase in F1-score from baseline and quality of generated data.

                                                                         Keywords
                                                                         Comment-code pairs, LLM-generated data, Support vector machine, ELMO




                                1. Introduction
                                In the rapidly developing world of software development, comments play a crucial role in
                                enhancing code readability and maintainability of the corresponding codes in the code bases [1].
                                Before executing any software maintenance-related task, or doing any kind of modification and
                                enhancement, developers usually spend a significant amount of time reading and understanding
                                the codes. This process is very time-consuming, particularly, in the case of source codes that
                                implement complex functionalities. So, it is common practice among developers to write
                                comments for code snippets to enhance the comprehensibility of the code. The comments are
                                expected to be helpful in capturing the complete structure and functionality of the codes. This
                                makes commenting one of the most commonly employed documentation methods for software
                                maintenance tasks [2] on condition that the comments are elaborate and expressive enough to
                                capture the functionality of the programs and that the quality of the comments is maintained
                                throughout the code base.
                                   However, sometimes the comments themselves may be incomplete, inconsistent, and difficult
                                to relate to the source code [3]. Such comments may result in a waste of effort in the interpre-
                                tation of the corresponding code and even may result in a complete misinterpretation of the
                                purpose of the program.
                                   Thus understanding the relevance of a comment to a piece of code is crucial before actually

                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                ∗
                                    Corresponding author.
                                Envelope-Open 22dr0264@iitism.ac.in (T. Kumari); 22mt0348@iitism.ac.in (C. S. Charan); ayandas@iitism.ac.in (A. Das)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1
Samples of original base data
    Comment                  Surrounding code context       Label        Explanation
                                                                         The code does not exist
                                                                         for this comment.
    /*deal with it later*/   -1. /*deal with it later*/1.   Not Useful   So, comment is Not Useful
                                                                         The comment correctly
                             -1.f(toggle)                                describes the code and
    /*switch on*/            /*switch on*/1.else            Useful       hence Useful


using it to understand the purpose of the program. However, given the volume of source
code in a standard software project, it is a laborious task to manually verify the usefulness of
each comment to their corresponding code. Thus, a system that can automatically predict the
usefulness of a comment to its related code snippet may significantly speed up the process of
source code analysis. Furthermore, the comment may be rewritten to make it more relevant
and informative in case the system predicts the comment to be unuseful.
   Recently, artificial intelligence-based interactive systems, such as ChatGPT [4] are being
widely used to generate texts for different real-time purposes. These systems are also being used
by programmers to generate comments for their programs to save time and effort. However,
no work has been reported in the literature on the quality of the comments generated by such
systems. Given a code-comment pair, these systems may also be used to predict the usefulness
of the comment to the corresponding code. However, the accuracy of the predictions of the
AI-based systems is not reported in the literature. Thus, it is an open and interesting research
area to explore the efficacy of such AI-based systems in automatic comment generation or
prediction of the usefulness of a comment for a given code snippet.
   The Task-1 of FIRE 2023 IRSE shared task[5], mainly focuses on two subtasks. The first
subtask is comment classification. It involves automatically predicting the usefulness of a given
comment to the corresponding source code snippet. It is a binary classification task, that requires
us to develop a system, which takes a source code snippet and their associated comment as
input. The proposed system automatically classifies whether the comment corresponding to the
source code is ”Useful” or ”Not Useful”. The overview paper of IRSE2022 contains information
about the shared task in detail[6]. We have proposed a system, which takes a code-comment
pair as input and generates their representations using a pre-trained neural encoder uses these
representations to predict whether the comment is relevant to the code.
   The second subtask is to study the impact of large language models in comments. In this
subtask, the participants are required to augment the base data provided for Subtask 1 with
additional data and to carry out a comparative study of the performance of the models trained
using the base data and augmented data. The additional data for augmentation is expected
to comprise the code-comment pairs obtained from different sources with their usefulness
labels predicted using large language models (LLMs)[7]. For this purpose, we manually collect
code-comment pairs from different data resources such as GitHub, stack overflow, computer
vision, Curl, etc., and then queried the ChatGPT[4] with each code-comment pair to get the
usefulness label. We augmented this data with the original seed data and trained some models
using different combinations of the additional dataset. We carried out a set of experiments to
study the effect of data augmentation on the system performance.
   This paper reported the comprehensive explanation of the proposed system submitted to FIRE
IRSE2023 for task 1[5]. We have conducted some experiments and trained the machine learning
models, which take the representations of a snippet of source code and their corresponding
comments as input. These trained models made predictions about the relevance of the comment
to the associated source code with original base data and augmented datasets.
   The remaining sections are arranged as follows. Section 2, presented the related works where
we have done some literature surveys on previous work. In this section 3, we have presented
a description of different types of LLM-generated datasets and the data made available for
the shared task. We have also reported a brief description of data representation and system
specification of the system submitted for the shared task. Section 4, presented a comprehensive
analysis of the results on different runs with the dataset. In Section 5, we have concluded our
work on shared tasks.


2. Related work
We have done some surveys on the usefulness of code-comment pairs as well as the impact of
ChatGPT[4] generated comments. We found out some important studies.
   Majumdar et al.[8], proposed a survey paper, which is based on the IRSE track (FIRE 2022),
and developed solutions for automated evaluation of code comments and classifying comments
as useful or not Useful. Rahman et al.[9] did a comparative study on usefulness and developed
a RevHelper for automatic usefulness prediction. Soni et al.[10] developed an automatic text
classifier to identify ChatGPT-generated summaries. Shinyama et al.[1]propose a model i.e.C4.5
for code-comments analysis. Naili et al.[11]developed a generator network with a coverage
mechanism. Pre-trained ELMo contextual embedding was used to generate the highlights
of this research paper. Majumdar et al.[12] have proposed a COMMENT-MINE semantic
search architecture. this architecture is mainly used to extract knowledge based on the design,
implementation, and development of software in the form of a knowledge graph. Majumdar
et al.[13], developed features to semantically analyze the comments to concepts based on
categories of usefulness. They have used Neural networks(NN) to know the usefulness of code
comment pairs. Majumdar et al.[14] search for contextualized embeddings for code search and
classification and developed a system for generating contextualized representations for codes
and comments by training ELMo from scratch.


3. Experiment design
This section presents a detailed discussion of the proposed system developed for automatically
predicting the relevance of code-comment pairs and the experiments carried out on the different
combinations of the data sets. In Subsection 3.1 we present the details of the prediction system.
The details for the datasets used for the experiments are reported in Subsection 3.2.
3.1. System description
Our prediction system is a supervised machine-learning-based system that consists of a support
vector machine (SVM) [15] trained on the distributed representations of the code-comment
pairs. It takes the representations of the code-comment pair as input and predicts whether the
comment is relevant to the corresponding code snippet.
   The distributed representations of code-comment pairs are obtained from a pre-trained
ELMO-based model [16]. We have used the ELMO code1 provided by the Information Retrieval
in Software Engineering (IRSE) team. For a given code comment pair, we separately pass the
code and the comment to the ELMO model[16] as input as a sequence of tokens. For an input
sequence, the ELMO model[16] generates 200-dimensional contextual embeddings for each
space-separated token in the input sequence. The representation for an input sequence is
then obtained by taking the mean of all the token representations. So, the representation of a
given input sequence is a 200-dimensional embedding. The 200-dimensional representations
of the code and the comment sequences are then concatenated into a joint 400-dimensional
representation.
   During training, we generate the representations for all the code-comment pairs in the training
data and use them to train the support vector machine using the radial basis function (RBF)
kernel[17]. During testing, the model saved during the training phase takes the representation
of the code-comment pair as input and predicts the usefulness of the comment.
   The details of the working of our prediction system is presented in Figure 1.




Figure 1: Block diagram of proposed model



3.2. Data description
Here we present a description of different combinations of the used datasets for our experiments.

3.2.1. Original data
The original data for task 1[5] was shared by the FIRE IRSE 2023. It contained 11,452 pairs of
comments, surrounding code snippets, and their class labels. i.e. if a comment is relevant to the
corresponding source code then the corresponding pair is labeled as ”Useful” otherwise it is
1
    ELMO Code link to generate word embeddings-https://github.com/SMARTKT/WordEmbeddings
labeled as ”Not Useful”. A total of 11,452 rows of comments were written in text format and
their surrounding source codes. A total of 4,389 code-comment pairs are labeled as ”Not Useful”
and 7,063 code-comment pairs are labeled as ”Useful”, which is mentioned in a sample example
in Table 1 and Table 2.

Table 2
Sample of Original base dataset
  Comment                                  Surrounding code context           Label
  /*upper 8 bit CLASS*/                    -7.if(dot) -6.host p++;            Useful
  /*need expand*/                          -1.png set background fixed(png    Not Useful
                                           ptr,c;



3.2.2. LLM-generated data
For Subtask 2, we have manually collected a total of 510 code-comments pairs from different
data resources such as GitHub, stack overflow, computer vision, curl, etc., and then query the
ChatGPT[4] with each code-comments pair to get the usefulness label. Then we augment this
data with the original base data seen in Table 3 and we re-trained the model with the augmented
data.

3.2.3. Extra-generated data
We have experimented with another set of data where we have randomly extracted a subset of
250 ”Useful” and 250 ”Not Useful” code-comment pairs from the original seed data and altered
their labels using the following strategy. We converted the ”Useful” pairs into ”Not Useful” by
randomly shuffling the comments and we ensured that at the end of the shuffling none of the
codes had their original comments. We labeled such pairs as ”Not useful”. To convert the ”Not
Useful” comments to ”Useful”, we queried the ChatGPT[4] with the code snippets and got the
comments synthetically generated. This set of code and synthetically generated comment pairs
were labeled as ”Useful”. Table 3 gives an explanation of the datasets.
  For the sake of convenience, we referred to the original data, LLM-generated data, and
extra-generated data as Data1, Data2, and Data3 respectively as shown in Table 3.

Table 3
Different types of data with their size after train test split
             Dataset description               Train dataset size   Test dataset size
             Original data: Data1              9162                 2290
             LLM-generated data: Data2         408                  102
             Extra-generated data: Data3       400                  100

  We have followed the following steps to split the original and the LLM-generated[7] data: (i)
We have separated out the ”Useful” and ”Not Useful” code-comment pairs from the data into
two groups. (ii) We then split each group in an 80:20 ratio.
  Thus, the training data comprised a combination of 80% of the code-comment pairs with
”Useful” labels and 80% of the code-comment pairs with ”Not Useful” labels. The selection of the
80% of the samples in both cases was done randomly. The test data consisted of the remaining
20% of the samples from both groups.
  We followed the same splitting procedure for the ”LLM-generated data”[7] and ”Extra gener-
ated data” as well. The train and test split size of the dataset is shown in Table 3.

3.3. Combination of all datasets
We have created different combinations of datasets by combining different types of datasets.
The dataset description is given in this subsection ( 3.2.1, 3.2.2, and 3.2.3). Here, we have
combined the different data as shown in Table 4 with train test split sizes of data. The purpose
is to create new datasets to understand the impact of system performance on original data,
LLM-generated data[7], extra-generated data, and different combinations among them shown
in Table 4.

Table 4
Different combination of datasets:
         Datasets                            Train dataset size     Test dataset size
         Dataset1: Data1                     9162                   2290
         Dataset2: Data1+Data2               9570                   2392
         Dataset3: Data1+Data3               9562                   2390
         Dataset4: Data1+Data2+Data3         9970                   2492




4. Result Analysis
In this section, we present a discussion of our results. We performed four different experiments
with different combinations of test datasets as shown in Table 5, and Table 6.

4.1. Run1: Original data (Dataset1)
We performed experiments with the original base data of size 11,452. Original data is split into
train and test of sizes 9162 and 2290 in the ratio of 80:20 shown in Table 4. We used only the
test data of the original base data (Dataset1) for usefulness prediction.

4.2. Run2: Combination of original data and LLM-generated data (Dataset2)
Our second experiment was carried out with original data and LLM-generated data[7]. A total
of 510 LLM-generated data are split in the ratio of 80:20 into train and test data sizes are 408
and 102 respectively. Now, the total sum of training data and test data of Dataset2 sizes are 9570
and 2392. To analyze the impact of LLM-generated data[7] on proposed system performance,
we augmented test data of original data and LLM-generated data[7] (Dataset2) are shown in
Table 4.
Table 5
Experiment analysis part-1
    Experiments              Datasets              Algorithm              Accuracy
    Run1                     Dataset1              ELMO, SVM              92.18
    Run2                     Dataset2              ELMO, SVM              92.76
    Run3                     Dataset3              ELMO, SVM              90.696
    Run4                     Dataset4              ELMO, SVM              92.47


4.3. Run3: Original data and extra-generated data (Dataset3)
Our third experiment is with the combination of original base data and extra generated data. A
total of 500 extra-generated data are split in the ratio of 80:20 into train and test data sizes are
400 and 100 respectively. Now, the total sum of training data and test data of Dataset3 sizes are
9562 and 2390.

4.4. Run4: Original data, LLM-generated data, and extra-generated data
     (Dataset4)
We did one more experiment with the combination of original data, LLM-generated data[7],
and extra-generated data. The total sum of training data and test data of Dataset4 sizes are 9970
and 2492.

4.5. Result summary
The overall accuracies corresponding to the experiments carried out for Run1 (Subsection 4.1,
Run2 (Subsection 4.2), Run3 (Subsection 4.3) and Run4 (Subsection 4.4) are 92.18%, 92.76%,
90.696%, and 92.47% respectively. The results are summarized in Table 5. In Run3 (Dataset3),
the accuracy value decreases, and in other Runs (with Dataset1, Dataset2, Dataset3), we are
getting almost the same accuracies with slight variation in decimal fractions value.
   To evaluate the performance of the system with respect to the ”Useful” class, we have used
precision, recall, and F1-score as evaluation metrics. The results are summarized in Table 6. We
have carried out different runs using their corresponding useful class dataset and evaluated
the Useful precision, recall, and F1-score. In Run1 (Useful dataset size -1465) and Run4 (Useful
dataset size- 1578), we are getting the same precision, recall, and F1 score. But, in the case
of Run2 (Useful dataset size- 1542), it gets slightly higher recall than other Runs but other
evaluation parameters remain the same and in Run2 (Useful dataset size- 1470), all evaluation
parameters slightly decrease than other Runs.


5. Conclusion
In this paper, we presented our proposed system submitted for participating in task-1 shared by
IRSE FIRE 2023. The first task of shared task-1 is to build a system that takes a code-comment
pair as input to the encoder, which generates embedding that is passed to the classifier and
the classifier classifies whether the comment that corresponds to the code is ”Useful” or ”Not
Table 6
Experiment analysis part-2 with ”Useful” class
  Experiments       Useful dataset size    Useful precision   Useful recall   Useful F1-score
  Run1              1465                   0.92               0.96            0.94
  Run2              1542                   0.92               0.97            0.94
  Run3              1470                   0.89               0.96            0.93
  Run4              1578                   0.92               0.96            0.94


Useful”. The second task is to make predictions on the augmentation of original seed data
and LLM-generated data. We have also done impact analysis and model performance with an
augmented dataset(original base data and LLM-generated data). All the performance evaluation
metrics parameters are mentioned in Table 5 and Table 6. According to the declared result, our
system achieves a 4% increase in F1-score from baseline and quality of data generated.


References
 [1] Y. Shinyama, Y. Arahori, K. Gondow, Analyzing code comments to boost program compre-
     hension, in: 2018 25th Asia-Pacific Software Engineering Conference (APSEC), IEEE, 2018,
     pp. 325–334.
 [2] S. C. B. de Souza, N. Anquetil, K. M. de Oliveira, A study of the documentation essential
     to software maintenance, Association for Computing Machinery (2005) 68–75. URL:
     https://doi.org/10.1145/1085313.1085331. doi:10.1145/1085313.1085331 .
 [3] L. Tan, D. Yuan, G. Krishna, Y. Zhou, /*icomment: Bugs or bad comments?*/, SIGOPS
     Oper. Syst. Rev. 41 (2007) 145–158. URL: https://doi.org/10.1145/1323293.1294276. doi:10.
     1145/1323293.1294276 .
 [4] OpenAI, Gpt-4 technical report, 2023. arXiv:2303.08774 .
 [5] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. P. Das, P. D.
     Clough, P. Majumder, Generative ai for software metadata: Overview of the information
     retrieval in software engineering track at fire 2023, in: Forum for Information Retrieval
     Evaluation, ACM, 2023.
 [6] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. D Clough, S. Chattopadhyay, P. Majumder,
     Overview of the IRSE track at FIRE 2022: Information Retrieval in Software Engineering,
     in: Forum for Information Retrieval Evaluation, ACM, 2022.
 [7] T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with
     citations, arXiv preprint arXiv:2305.14627 (2023).
 [8] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder,
     Can we predict useful comments in source codes?-analysis of findings from information
     retrieval in software engineering track@ fire 2022, in: Proceedings of the 14th Annual
     Meeting of the Forum for Information Retrieval Evaluation, 2022, pp. 15–17.
 [9] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments
     using textual features and developer experience, in: 2017 IEEE/ACM 14th International
     Conference on Mining Software Repositories (MSR), IEEE, 2017, pp. 215–226.
[10] M. Soni, V. Wade, Comparing abstractive summaries generated by chatgpt to
     real summaries through blinded reviewers and text classification algorithms, 2023.
     arXiv:2303.17650 .
[11] M. Naili, A. H. Chaibi, H. H. B. Ghezala, Comparative study of word embedding methods
     in topic segmentation, Procedia computer science 112 (2017) 340–349.
[12] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search
     approach to program comprehension from code comments, Advanced Computing and
     Systems for Security: Volume Twelve (2020) 29–42.
[13] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation
     of comments to aid software maintenance, Journal of Software: Evolution and Process 34
     (2022) e2463.
[14] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An effective low-
     dimensional software code representation using bert and elmo, in: 2022 IEEE 22nd
     International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2022,
     pp. 763–774.
[15] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297.
[16] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     contextualized word representations, Association for Computational Linguistics (2018)
     2227–2237. URL: https://aclanthology.org/N18-1202. doi:10.18653/v1/N18- 1202 .
[17] K. Thurnhofer-Hemsi, E. López-Rubio, M. A. Molina-Cabello, K. Najarian, Radial ba-
     sis function kernel optimization for support vector machine classifiers, arXiv preprint
     arXiv:2007.08233 (2020).