Generative AI for Software Metadata: Overview of the
                                Information Retrieval in Software Engineering Track
                                at FIRE 2023
                                Srijoni Majumdar1,2,*,† , Soumen Paul1,*,† , Bhargav Dave6,*,† , Debjyoti Paul8 ,
                                Ayan Bandyopadhyay4 , Samiran Chattopadhyay7 , Partha Pratim Das1 ,
                                Paul D Clough4,5 and Prasenjit Majumder3,6
                                1
                                  IIT Kharagpur, West-Bengal, India
                                2
                                  University of Leeds, UK
                                3
                                  TCG CREST, West-Bengal, India
                                4
                                  TPXimpact London, UK
                                5
                                  Sheffield University, Sheffield, UK
                                6
                                  DA-IICT Gandhinagar, Gujarat, India
                                7
                                  Jadavpur University, West-Bengal, India
                                8
                                  Indian Statistical Institute, Kolkata India


                                                                         Abstract
                                                                         The Information Retrieval in Software Engineering (IRSE) track aims to develop solutions for automated
                                                                         evaluation of code comments in a machine learning framework based on human and large language
                                                                         model generated labels. In this track, there is a binary classification task to classify comments as useful
                                                                         and not useful. The dataset consists of 9048 code comments and surrounding code snippet pairs extracted
                                                                         from open source github C based projects and an additional dataset generated individually by teams
                                                                         using large language models. Overall 56 experiments have been submitted by 17 teams from various
                                                                         universities and software companies. The submissions have been evaluated quantitatively using the
                                                                         F1-Score and qualitatively based on the type of features developed, the supervised learning model used
                                                                         and their corresponding hyper-parameters. The labels generated from large language models increase
                                                                         the bias in the prediction model but lead to less over-fitted results.

                                                                         Keywords
                                                                         bert, GPT-2, Stanford POS Tagging, neural networks, abstract syntax tree


                                Forum for Information Retrieval Evaluation (FIRE)- 2023, Indian Statistical Institute, Kolkata, India, 15𝑡ℎ − 18𝑡ℎ
                                December, 2023
                                *
                                  Corresponding author.
                                †
                                  These authors contributed equally.
                                $ majumdar.srijoni@gmail.com (S. Majumdar); soumenpaul165@gmail.com (S. Paul); bhargavdave1@gmail.com
                                (B. Dave); debjyoti93.paul@gmail.com (D. Paul); bandyopadhyay.ayan@gmail.com (A. Bandyopadhyay);
                                samiran.chattopadhyay@jadavpuruniversity.in (S. Chattopadhyay); ppd@cse.iitkgp.ac.in (P. P. Das);
                                p.d.clough@sheffield.ac.uk (P. D. Clough); prasenjit.majumder@gmail.com (P. Majumder)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
1. Introduction
Assessing comment quality can help to de-clutter code bases and subsequently improve code
maintainability. Comments can significantly help to read and comprehend code if they are
consistent and informative.
   The perception of quality in terms of the ’usefulness’ of the information contained in com-
ments is relative and hence is perceived differently based on the context. Bosu et al. [1] attempted
to assess code review comments (logged in a separate tool) in the context of their utility in
helping developers write better code through a detailed survey at Microsoft. A similar quality
assessment model is important to analyse the type of source code comments that can help for
standard maintenance tasks but is largely missing. Majumdar et al. [2] proposed a comment
quality evaluation framework wherein comments were assessed as ’useful’, ’partially useful’,
and ’not useful’ based on whether they increase the readability of the surrounding code snip-
pets. The authors analyse comments for concepts that aid in code comprehension and also the
redundancies or inconsistencies of these concepts with the related code constructs in a machine
learning framework for an overall assessment. The concepts are derived through exploratory
studies with developers across 7 companies and from a larger community using crowd-sourcing.
   The first edition of the IRSE track of FIRE 2023, extends the work in [2] and empirically
investigates comment quality with a larger set of machine learning solvers and features. The
task is based on the quality evaluation of comments into two clusters - ’useful’ and ’not useful’.
A ’useful’ comment (refer Table 1) contains relevant concepts that are not evident from the
surrounding code design, and thus increases the comprehensibility of the code. The suitability
of analysing comment quality using various vector space representations of code and comment
pairs along with standard textual features and code comment correlation links are evaluated.

Table 1
Useful and Not-Useful comments in context of code comprehension
       #   Comment               Code                                                 Label
       1   /* uses png_calloc    /* uses png_calloc defined in pngriv.h*/             U
           defined in pn-        PNG_FUNCTION(png_const_structrp png_ptr)
           griv.h*/
                                 { if (png_ptr == NULL || info_ptr == NULL)
                                       return;
                                   png_calloc(png_ptr); ...}

       2   /* serial bus is      static int bus_reset ( . . . ) /* serial bus         NU
           locked before use     is locked before use*/
           */                    { .. update_serial_bus_lock (bus * busR); }

       3   // integer variable   int Delete\_Vendor; // integer variable              NU


                                           U: Useful; NU: Not Useful

   The 2023 IRSE track extends this challenge to understand the feasibility of using silver
standard quality labels generated from the Large Language Models (LLMs) and understand
how it augments the classification model in terms of prediction. Developing the gold industry
standard for analysing the usefulness of comments that can be relevant for code comprehension
in legacy systems can be challenging and time-consuming. However, to scale the model and use
it on different languages, it is important to generate more data which we attempt to do with
the large language models. The performance of these modes in the context of understating
the relations between code and comment can provide an approximation of the data quality
generated and how it can be used to scale the existing classification mode. This approach can
also be further generalised to any classification model based on software metadata.


2. Related Work
Software metadata is integral to code maintenance and subsequent comprehension. A significant
number of tools [3, 4, 5, 6, 7, 8] have been proposed to aid in extracting knowledge from software
metadata like runtime traces or structural attributes of codes.
   In terms of mining code comments and assessing the quality, authors [9, 10, 11] compare
the similarity of words in code-comment pairs using the Levenshtein distance and length of
comments to filter out trivial and non-informative comments. Rahman et al. [12] detect useful
and non-useful code review comments (logged-in review portals) based on attributes identified
from a survey conducted with developers of Microsoft [1]. Majumdar et al. [2, 13] proposed a
framework to evaluate comments based on concepts that are relevant for code comprehension.
They developed textual and code correlation features using a knowledge graph for semantic
interpretation of information contained in comments.
   These approaches use semantic and structural features to design features to set up a prediction
problem for useful and not useful comments that can be subsequently integrated into the process
of decluttering codebases.
   With the advent of large language models [14], it is important to compare the quality as-
sessment of code comments by the standard models like GPT 3.5 or llama with the human
interpretation. The IRSE track at FIRE 2023 extends the approach proposed in [2] to explore var-
ious vector space models [15] and features for binary classification and evaluation of comments
in the context of their use in understanding the code. This track also compares the performance
of the prediction model with the inclusion of the GPT-generated labels for the quality of code
and comment snippets extracted from open-source software.


3. IRSE Track Overview and Data Set
The following section outlines the task descriptions and the characteristics of the dataset.

3.1. Task Description
Comment Classification: A binary classification task to classify source code comments as Useful
or Not Useful for a given comment and associated code pair as input.
Input: A code comment with surrounding code snippet (written in C)
Output: A label (Useful or Not Useful) that characterizes whether the comment helps developers
comprehend the associated code
   Therefore, in this classification task, the output is based on whether the information contained
in the comment is relevant and would help to comprehend the surrounding code, i.e., it is useful.
   Useful: Comments have sufficient software development concept → Comment is Relevant,
and these concepts are not mostly present in the surrounding code → Comment is not Redundant,
hence the comment is Useful
   Not Useful: Comments have sufficient software development concept → Comment is Rele-
vant, and these concepts are mostly present in the surrounding code → Comment is Redundant,
hence the comment is Not Useful
   It may also be the case that comments do not contain sufficient software development concepts
→ Comment is Not Relevant, hence the comment is Not Useful.
   It is left to the participants to decide on the threshold value for how many concepts retrieved
make a comment relevant or how many matches with surrounding code make a comment
redundant.
   The notion of relevant comments refers to those that developers perceive as important in
comprehending the associated or surrounding lines of code. These concepts are related to the
outline of the algorithm, data-structure descriptions, mapping to user interface details, possible
exceptions, version details, etc. In the below examples, the comments highlight useful details
about the input data to the function, which is not evident from the associated code itself.
   Dataset: For the IRSE track, we use a set of 9048 comments (from Github) with comment
text, surrounding code snippets, and a label that specifies whether the comment is useful or not.
Sample data has been characterised in Table 1.

    • The development dataset contains 8048 rows of comment text, surrounding code snippets,
      and labels (Useful and Not useful).

    • The test dataset contains 1,000 rows of comment text, surrounding code snippets, and
      labels (Useful and Not useful).


4. Participation and Evaluation
IRSE 2023 received a total of 56 experiments from 17 teams for the two tasks. As this track
is related to software maintenance, we received participation from companies like Microsoft,
Amazon, American Express, Bosch Research along with several research labs of educational
institutes.
   The various teams with the details of their submissions are characterised in Table 2.
   Evaluation Procedure: Candidates submitted the prediction metrics (precision, recall, F1-Score)
for the classification model with the Gold labels dataset (referred to as the Seed Dataset) and
combined dataset (Seed + LLM generated labels - Silver labels dataset). The difference in the F1
score was evaluated by us.
   Features: Apart from evaluating the prediction metrics, we analysed the types of features the
teams have used to devise the machine learning pipeline. The teams have performed routine
pre-processing and have retained the significant words or letters only for both the code and
comment pairs. Further, some of the teams have also used morphological features of a comment
Table 2
Characterizations of the Submissions: test Data Predictions
                                     Seed data              Seed data + LLM-generated data
           Affiliation
                            Precision Recall F1Score Precision Recall F1Score
                            0.8326      0.8626 0.8473       0.844       0.8682 0.8559
         DSTI, France       0.8948      0.8738 0.884        0.9         0.8707 0.885
                            0.8807      0.8822 0.8813       0.8871      0.8839 0.8854
                            0.8         0.8      0.8        0.8021      0.81   0.73
         SSN-1 (RAM)
                            0.72        0.71     0.74       0.7         0.73   0.74
                            0.788       0.7363 0.7613       0.89        0.8802 0.8846
                            0.7994      0.7994 0.7994       0.89        0.8795 0.8841
         SSN-2 (Aloy)       0.7993      0.9352 0.8619       0.839       0.9199 0.8776
                            0.7842      0.8453 0.8136       0.8154      0.8823 0.8475
                            0.7572      0.8637 0.807        0.7785      0.9003 0.835
            IIT (ISM)
                            0.92        0.96     0.94       0.92        0.97   0.97
            Dhanbad
                            0.7916      0.8446 0.8172       0.7886      0.847  0.8167
                            0.763       0.8696 0.813        0.7655      0.8724 0.8154
         SSN-3 (Black)
                            0.705       0.9387 0.8052       0.6994      0.9041 0.7887
                            0.7292      0.856    0.7875     0.7374      0.8533 0.7911
           Microsoft-
                            0.7902      0.8016 0.7949       0.7908      0.8014 0.7952
       American Express
             DDU-1          0.895       0.891    0.893      0.890       0.894  0.892
             DDU-2          0.875       0.872    0.874      0.870       0.875  0.880
           IIT KGP-1        0.8283      0.804    0.8141     0.8322      0.8086 0.8185
               SRM          0.8283      0.804    0.8141     0.8178      0.7906 0.8013
           IIT KGP-2        0.78        0.85     0.8        0.77        0.85   0.8
            DA-IICT         0.81        0.8      0.8        0.58        0.58   0.58
             IIT Goa        0.6087      0.6526 0.6321       0.6114      0.6598 0.6403
               TCS          0.778       0.753    0.74       0.645       0.6598 0.650
           IIT KGP-3        0.631       0.645    0.639      0.6114      0.6598 0.631
             Amazon         0.659       0.672    0.666      0.656       0.635  0.645


like a length, significant words ratio, parts of speech characteristics, or occurrence of words from
an enumerated set as textual features. To correlate code and comment and detect redundancies,
the teams mostly used grep-like string match to find similar words.
   Vector Space Representations: Code and comments belong- to different semantic granularity
which is unified by a vector space representation. The participants have used various pre-trained
embeddings to generate vectors for the words like those based on one hot encoding, tf-idf based,
word2vec or context aware like ELMo and BERT. Each of the employed embedding models are
trained or finetuned using software development corpora.
   Results: The participants are able to achieve a slight increase (in the range of 2%-4%) in the
test prediction metrics and in many cases the performance decrease. The statistics of LLM
generated data submitted by each team is illustrated in Table 3. However, the increase in bias
due to the incorporation of silver standard data reduces the over-fitting of the models.
Table 3
Characterizations of the LLM Generated datasets
           Team name                     Total entry   Useful entry   Not useful entry
           DSTI, France                  421           412            9
           SSN-1 (RAM)                   1238          740            497
           SSN-2 (Alloy)                 1510          24             1486
           IIT (ISM) Dhanbad             199           182            17
           SSN 3 (Black)                 738           80             658
           Microsoft - American Express 233            92             141
           DDU-1                         8588          4649           3939
           DDU-2                         332           311            21
           IIT KGP-1                     334           309            25
           SRM                           217           196            21
           IITKGP-2                      263           130            133
           DA-IICT                       150           65             85
           IIT-Goa                       543           460            83
           TCS                           282           61             221
           IITKGP-3                      570           450            120
           IITKGP-3                      412           345            67


5. Conclusions
The IRSE 2023 track empirically investigates the feasibility of augmenting existing classification
models using datasets with labels generated from LLM’s. A total of 17 teams participated and
submitted 56 experiments that used various types of machine learning models, embedding
spaces, features and different LLMs to generate data. The LLM-generated labels reduce the
overfitting of the overall classification model and also improve the F1 score when the combined
data from all participants were used to augment the existing data with gold standard labels
from industry practitioners.


References
 [1] A. Bosu, M. Greiler, C. Bird, Characteristics of useful code reviews: An empirical study at
     microsoft, Working Conference on Mining Software Repositories, IEEE, 2015, pp. 146–156.
 [2] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation
     of comments to aid software maintenance, Journal of Software: Evolution and Process 34
     (2022) e2463.
 [3] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Smartkt: a search framework to assist
     program comprehension using smart knowledge transfer, in: 2019 IEEE 19th International
     Conference on Software Quality, Reliability and Security (QRS), IEEE, 2019, pp. 97–108.
 [4] N. Chatterjee, S. Majumdar, S. R. Sahoo, P. P. Das, Debugging multi-threaded applications
     using pin-augmented gdb (pgdb), in: International conference on software engineering
     research and practice (SERP). Springer, 2015, pp. 109–115.
 [5] S. Majumdar, N. Chatterjee, S. R. Sahoo, P. P. Das, D-cube: tool for dynamic design discovery
     from multi-threaded applications using pin, in: 2016 IEEE International Conference on
     Software Quality, Reliability and Security (QRS), IEEE, 2016, pp. 25–32.
 [6] S. Majumdar, N. Chatterjee, P. P. Das, A. Chakrabarti, A mathematical framework for design
     discovery from multi-threaded applications using neural sequence solvers, Innovations in
     Systems and Software Engineering 17 (2021) 289–307.
 [7] S. Majumdar, N. Chatterjee, P. Pratim Das, A. Chakrabarti, Dcube_ nn d cube nn: Tool
     for dynamic design discovery from multi-threaded applications using neural sequence
     models, Advanced Computing and Systems for Security: Volume 14 (2021) 75–92.
 [8] M. P. O’brien, Software comprehension–a review and research direction, Technical Report
     Technical Report, Department of Computer Science & Information Systems University of
     Limerick, Ireland, 2003.
 [9] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, International
     Conference on Program Comprehension (ICPC), IEEE, 2013, pp. 83–92.
[10] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder,
     Can we predict useful comments in source codes?-analysis of findings from information
     retrieval in software engineering track@ fire 2022, in: Proceedings of the 14th Annual
     Meeting of the Forum for Information Retrieval Evaluation, 2022, pp. 15–17.
[11] S. Majumdar, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder,
     Overview of the irse track at fire 2022: Information retrieval in software engineering, in:
     Forum for Information Retrieval Evaluation, ACM, 2022.
[12] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using
     textual features and developer experience, International Conference on Mining Software
     Repositories (MSR), IEEE, 2017, pp. 215–226.
[13] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine - a semantic search
     approach to program comprehension from code comments, in: Advanced Computing and
     Systems for Security, Springer, 2020, pp. 29–42.
[14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
[15] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An effective low-
     dimensional software code representation using bert and elmo, in: 2022 IEEE 22nd
     International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2022,
     pp. 763–774.