<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applications of LLMs for Code Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adwita Deshpande</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology Goa</institution>
          ,
          <addr-line>India - 403401</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The Software Engineering Information Retrieval (IRSE) track specializes in the construction of automated methods to assess code comments with a machine learning paradigm. In 2022, the track consisted of two main tasks: (i) estimating code comment usefulness, and (ii) estimating code quality. The first task is to classify code comments as useful and not useful. We created a dataset consisting of 9,048 pairs of code comments from open-source, C-based projects hosted on Github, as well as a second dataset created with people's help using large language models (LLMs). Specifically, 12 teams from universities completed this work; some of these teams planned and executed experiments with both quantitative and qualitative measurements. Code comments, while created with LLMs, introduce bias into the prediction model, but help reduce overfitting, resulting in more generalizable results. The second sub-track code quality estimation was created this year. The goal of the task is to auto-estimate the functional correctness of each code when given a problem description, and a set of code generated by large language models. For the purpose of evaluation, each problem-solution pair is then ranked by these estimated probabilities of functional correctness, the quality of which is then reported with standard ranking performance measures.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Comment Usefulness Prediction</kwd>
        <kwd>Code Quality Estimation</kwd>
        <kwd>bert</kwd>
        <kwd>GPT-2</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Efective software maintenance relies heavily on code comprehensibility, where high-quality comments
play a pivotal role. Well-written, informative comments can significantly improve code readability
and ease the maintenance burden. However, the "usefulness" of a comment is often subjective and
context-dependent. Previous research, such as Bosu et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], has focused on evaluating code review
comments from external tools. Yet, a clear need remains for models that can assess the quality of inline
source code comments, which are fundamental to day-to-day maintenance activities.
      </p>
      <p>
        Building on the work of Majumdar et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], who proposed a framework for classifying comments
based on their contribution to code comprehension, the IRSE track has evolved. The inaugural track at
FIRE 2023 broadened the investigation into comment quality with diverse machine learning methods.
In 2024, the focus shifted to incorporating "silver standard" quality labels generated by Large Language
Models (LLMs).
      </p>
      <p>
        This year, the IRSE track continues this exploration with two distinct sub-tasks. The first task
challenges participants to build a binary classifier for comment usefulness, with an emphasis on using
LLMs to augment the training data. The second, a new pilot sub-task, addresses the emerging challenge
of automatically estimating the quality of LLM-generated code. Given a programming problem, the
goal is to predict the functional correctness of multiple code solutions generated by an LLM. This task
is analogous to query performance prediction (QPP) in traditional IR [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where "relevance" is replaced
by "functional correctness."
      </p>
      <p>This paper summarizes the design, participation, and outcomes of both sub-tasks, ofering insights
into the current state of automated analysis for software engineering artifacts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The automatic analysis of software metadata is a well-established research area, with numerous tools
developed to extract knowledge from artifacts like runtime traces and code structure [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7 ref8 ref9">4, 5, 6, 7, 8, 9,
10, 11, 12</xref>
        ]. Within this domain, assessing comment quality has been a persistent challenge. Early
approaches by Steidl et al. [13] used lexical similarity and comment length to filter out trivial comments.
Later work, such as Rahman et al. [14], focused on predicting the utility of code review comments in
external portals.
      </p>
      <p>
        Majumdar et al. [
        <xref ref-type="bibr" rid="ref2">2, 15</xref>
        ] introduced a more nuanced framework by evaluating comments based on
concepts central to code comprehension, employing both textual and code correlation features. These
foundational eforts paved the way for building predictive models to declutter codebases by identifying
and flagging unhelpful comments. Similarly, [16, 17] tackles this task with the help of LLMs.
      </p>
      <p>The advent of large language models [18] has opened new avenues for this research. It is now
imperative to benchmark automated assessments from models like GPT-3.5 against human judgment.
The IRSE track at FIRE builds directly on these advancements, investigating modern vector space models
[19] and the impact of including LLM-generated labels to enhance classification performance.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task and Datasets</title>
      <p>We now describe the task and the dataset details of the two sub-tracks (ST) for IRSE.</p>
      <sec id="sec-3-1">
        <title>3.1. ST-1: Comment Usefulness Prediction</title>
        <p>This task requires participants to build a binary classification model to determine if a source code
comment is Useful or Not Useful. Given a comment and its related code snippet, the model must
assess whether the comment’s information would genuinely help a developer understand the code.</p>
        <p>The core of the classification logic rests on avoiding redundancy. A Useful comment must be
relevant and provide insights that are not immediately obvious from the code itself. In contrast, a Not
Useful comment, while potentially relevant, is redundant because it simply restates what the code
already clearly communicates. To support this task, we provide the IRSE track dataset, which contains
9,048 annotated code-comment pairs from GitHub.
0.8570
0.8610
0.7240
0.8110
0.8450
0.8040
0.8760
0.8530
0.8050
0.8050
0.7940
0.8240</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. ST-2: Code Quality Estimation</title>
        <p>Task and Evaluation Measures The objective of the code quality estimation task is to predict the
functional correctness of code snippets that have been generated by Large Language Models (LLMs)
in response to a programming prompt. For this purpose, we utilize the HumanEval dataset, which
contains 161 programming problems.</p>
        <p>Formally, given a programming task description  and a list of  generated solutions  =
{1 , . . . , }, a prediction model  is expected to produce a vector of likelihood scores. Each score
represents the estimated functional correctness for a corresponding solution, such that  : (,   ) ↦→ R.</p>
        <p>To evaluate the model, we frame this as a ranking problem, drawing an analogy from Information
Retrieval. The problem description  acts as a query, the solutions  are analogous to a set of retrieved
documents, and the notion of ‘functional correctness’ replaces ‘relevance’. This analogy allows us to
use standard ranking metrics. We primarily report the Normalized Discounted Cumulative Gain at 
(nDCG@), where we use  = 10 solutions per problem.</p>
        <p>We employ two distinct nDCG measures to capture diferent aspects of ranking performance:
1. Local nDCG (l-nDCG): This metric assesses the average per-problem ranking quality. For each
problem  in the set of all problems , we rank its  solutions based on the predicted scores and
compute nDCG@. The final l-nDCG is the average of these scores over all problems, defined as:
l-nDCG =
|| =1
1 ∑|︁| nDCG
2. Global nDCG (g-nDCG): This metric evaluates the model’s overall ranking ability across the
entire benchmark. We pool all  × || solutions from all problems into a single list, rank them
using the predicted scores, and calculate a single nDCG score for this global list, denoted as
nDCG@(||).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Participation and Evaluation</title>
      <p>In this section, we detail the participation and methodologies for the two sub-tasks.</p>
      <sec id="sec-4-1">
        <title>ST-1: Comment Usefulness Prediction</title>
        <p>The IRSE 2024 track attracted significant interest from the academic community, receiving 12 submissions
from various research labs, reflecting the importance of software maintenance.</p>
        <p>Approaches and Methodologies Participants employed a diverse range of techniques to tackle the
classification problem. The provided training dataset was well-balanced, containing 4015 useful and
4033 not useful comments. For feature representation, teams utilized everything from traditional
methods like TF-IDF and word2vec to context-aware embeddings from models like ELMo and BERT.
The classification models were similarly varied, including support vector machines, logistic regression,
and deep-learning architectures such as RNNs and BERT-based classifiers.</p>
        <p>Impact of LLM-Generated Data A key aspect of this track was evaluating the efect of data
augmentation using LLM-generated "silver standard" labels. The results were nuanced: while some
teams saw a slight improvement in test accuracy, many observed a minor decrease (2-4%). This behavior
is interpreted as a positive outcome, suggesting that the synthetic data acts as a regularizer, reducing the
model’s tendency to overfit to the original training set and potentially leading to better generalization.
Table 2 characterizes the LLM-generated datasets contributed by each team.</p>
        <p>Team Name</p>
      </sec>
      <sec id="sec-4-2">
        <title>ST-2: Code Quality Estimation</title>
        <p>For the pilot task on code quality estimation, we received a submission from a team at IIT-KGP.
Submitted System The team’s approach was based on the methodology proposed by [20], employing
zero-shot inference with GPT-3.5 Turbo. For a given problem-solution pair, the model was prompted
to generate a likelihood score indicating the solution’s functional correctness. They submitted three
distinct runs by varying the GPT decoder’s temperature setting ( ∈ {0.7, 0.8, 0.9} ). The simple and
efective prompt used is shown in Figure 1.</p>
        <p>Given the problem and the solution, generate a likelihood score between 0 and 1
indicating how relevant the solution is to the problem. Only state the score.
Baseline for Comparison To provide a reference point, we developed a heuristic-based baseline.
This baseline estimates quality by measuring the variance of semantic similarities across all solution
pairs for a given problem. The core assumption is that for a well-defined problem with a fixed function
signature (as in the HumanEval dataset), correct solutions should be semantically similar. A high
variance, therefore, may indicate a lack of consensus in the generated solutions, correlating with a
higher probability of incorrectness. Semantic similarity was calculated using CLS embeddings from
CodeBERT [21].</p>
        <p>Table 3 shows that the GPT-based 0-shot inference produced better results than the in-house
heuristicbased baseline of estimating code quality as a measure of the topical diversity between the
LLMgenerated solutions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The first sub-task of the IRSE track centered on the automated evaluation of code comment quality.
Across the 12 participating teams, a central finding was the significant benefit of data augmentation
using Large Language Models. The top-performing system’s F1-score increased from 0.853 to 0.892
when its training data was supplemented with LLM-generated labels. This improvement highlights that
synthetic annotations can efectively mitigate model overfitting and enhance generalization. The value
of this approach was further reinforced when a combined dataset, incorporating submissions from all
teams and gold-standard labels, demonstrated even greater performance.</p>
      <p>In the second sub-task, which focused on predicting the functional correctness of LLM-generated code,
the results were equally decisive. Evaluation methods that themselves leveraged LLMs substantially
outperformed embedding-based baseline models. This outcome underscores the advanced capabilities of
LLMs to analyze the deep contextual and functional semantics of code, a task where simpler vector-space
models are less efective.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publicationâsÂ content.
[10] S. Paul, S. Majumdar, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. Das, P. D. Clough, P.
Majumder, Eficiency of large language models to scale up ground truth: Overview of the irse track
at forum for information retrieval 2023, in: Proceedings of the 15th Annual Meeting of the Forum
for Information Retrieval Evaluation, 2023, pp. 16–18.
[11] N. Chatterjee, S. Majumdar, P. P. Das, A. Chakrabarti, Tool assisted agile approach for legacy
application migration, International Journal of System Assurance Engineering and Management
(2025) 1–16.
[12] S. Majumdar, P. P. Das, Smart knowledge transfer using google-like search, arXiv preprint
arXiv:2308.06653 (2023).
[13] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, International</p>
      <p>Conference on Program Comprehension (ICPC), IEEE, 2013, pp. 83–92.
[14] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using textual
features and developer experience, International Conference on Mining Software Repositories
(MSR), IEEE, 2017, pp. 215–226.
[15] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine - a semantic search approach to
program comprehension from code comments, in: Advanced Computing and Systems for Security,
Springer, 2020, pp. 29–42.
[16] A. Deshpande, A. Maji, D. Mondol, P. P. Das, P. D. Clough, S. Majumdar, The code–llm handshake:
Smarter maintenance through ai, in: Proceedings of the 17th annual meeting of the Forum for
Information Retrieval Evaluation, 2025, pp. 9–12.
[17] A. Mitra, S. Majumdar, A. Mukhopadhyay, P. P. Das, P. D. Clough, P. P. Chakrabarti,
Operationalizing large language models with design-aware contexts for code comment generation, arXiv
preprint arXiv:2510.22338 (2025).
[18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information
processing systems 33 (2020) 1877–1901.
[19] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An efective low-dimensional
software code representation using bert and elmo, in: 2022 IEEE 22nd International Conference
on Software Quality, Reliability and Security (QRS), IEEE, 2022, pp. 763–774.
[20] T. Y. Zhuo, Ice-score: Instructing large language models to evaluate code, 2024.</p>
      <p>arXiv:2304.14317.
[21] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou,
Codebert: A pre-trained model for programming and natural languages, CoRR abs/2002.08155
(2020). URL: https://arxiv.org/abs/2002.08155. arXiv:2002.08155.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Greiler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bird</surname>
          </string-name>
          ,
          <article-title>Characteristics of useful code reviews: An empirical study at microsoft</article-title>
          ,
          <source>Working Conference on Mining Software Repositories, IEEE</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Datta</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <article-title>Automated evaluation of comments to aid software maintenance</article-title>
          ,
          <source>Journal of Software: Evolution and Process</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <article-title>e2463</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Greene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <article-title>Deep-qpp: A pairwise interaction-based deep learning model for supervised query performance prediction</article-title>
          , in: WSDM, ACM,
          <year>2022</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papdeja</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <article-title>Smartkt: a search framework to assist program comprehension using smart knowledge transfer</article-title>
          ,
          <source>in: 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
          </string-name>
          ,
          <article-title>Debugging multi-threaded applications using pin-augmented gdb (pgdb)</article-title>
          ,
          <source>in: International conference on software engineering research and practice (SERP)</source>
          . Springer,
          <year>2015</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <article-title>A mathematical framework for design discovery from multi-threaded applications using neural sequence solvers</article-title>
          ,
          <source>Innovations in Systems and Software Engineering</source>
          <volume>17</volume>
          (
          <year>2021</year>
          )
          <fpage>289</fpage>
          -
          <lpage>307</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>P. P.</given-names>
          </string-name>
          <string-name>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <article-title>Comprehending c codes with llms: Efective comment generation through retrieval and reasoning, Pattern Recognition Letters (</article-title>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Paul</surname>
          </string-name>
          , S. Majumdar,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Calikli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sanyal</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the âinformation retrieval in software engineeringâ(irse) track at forum for information retrieval 2024</article-title>
          ,
          <source>in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <article-title>Parallelc-assist: Productivity accelerator suite based on dynamic instrumentation</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>73599</fpage>
          -
          <lpage>73612</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>