1. Introduction

Applications of LLMs for Code Analysis

Adwita Deshpande

0 0 Indian Institute of Technology Goa , India - 403401

2026

The Software Engineering Information Retrieval (IRSE) track specializes in the construction of automated methods to assess code comments with a machine learning paradigm. In 2022, the track consisted of two main tasks: (i) estimating code comment usefulness, and (ii) estimating code quality. The first task is to classify code comments as useful and not useful. We created a dataset consisting of 9,048 pairs of code comments from open-source, C-based projects hosted on Github, as well as a second dataset created with people's help using large language models (LLMs). Specifically, 12 teams from universities completed this work; some of these teams planned and executed experiments with both quantitative and qualitative measurements. Code comments, while created with LLMs, introduce bias into the prediction model, but help reduce overfitting, resulting in more generalizable results. The second sub-track code quality estimation was created this year. The goal of the task is to auto-estimate the functional correctness of each code when given a problem description, and a set of code generated by large language models. For the purpose of evaluation, each problem-solution pair is then ranked by these estimated probabilities of functional correctness, the quality of which is then reported with standard ranking performance measures.

eol>Large Language Models Comment Usefulness Prediction Code Quality Estimation bert GPT-2

1. Introduction

Efective software maintenance relies heavily on code comprehensibility, where high-quality comments play a pivotal role. Well-written, informative comments can significantly improve code readability and ease the maintenance burden. However, the "usefulness" of a comment is often subjective and context-dependent. Previous research, such as Bosu et al. [ 1 ], has focused on evaluating code review comments from external tools. Yet, a clear need remains for models that can assess the quality of inline source code comments, which are fundamental to day-to-day maintenance activities.

Building on the work of Majumdar et al. [ 2 ], who proposed a framework for classifying comments based on their contribution to code comprehension, the IRSE track has evolved. The inaugural track at FIRE 2023 broadened the investigation into comment quality with diverse machine learning methods. In 2024, the focus shifted to incorporating "silver standard" quality labels generated by Large Language Models (LLMs).

This year, the IRSE track continues this exploration with two distinct sub-tasks. The first task challenges participants to build a binary classifier for comment usefulness, with an emphasis on using LLMs to augment the training data. The second, a new pilot sub-task, addresses the emerging challenge of automatically estimating the quality of LLM-generated code. Given a programming problem, the goal is to predict the functional correctness of multiple code solutions generated by an LLM. This task is analogous to query performance prediction (QPP) in traditional IR [ 3 ], where "relevance" is replaced by "functional correctness."

This paper summarizes the design, participation, and outcomes of both sub-tasks, ofering insights into the current state of automated analysis for software engineering artifacts.

2. Related Work

The automatic analysis of software metadata is a well-established research area, with numerous tools developed to extract knowledge from artifacts like runtime traces and code structure [ 4, 5, 6, 7, 8, 9, 10, 11, 12 ]. Within this domain, assessing comment quality has been a persistent challenge. Early approaches by Steidl et al. [13] used lexical similarity and comment length to filter out trivial comments. Later work, such as Rahman et al. [14], focused on predicting the utility of code review comments in external portals.

Majumdar et al. [ 2, 15 ] introduced a more nuanced framework by evaluating comments based on concepts central to code comprehension, employing both textual and code correlation features. These foundational eforts paved the way for building predictive models to declutter codebases by identifying and flagging unhelpful comments. Similarly, [16, 17] tackles this task with the help of LLMs.

The advent of large language models [18] has opened new avenues for this research. It is now imperative to benchmark automated assessments from models like GPT-3.5 against human judgment. The IRSE track at FIRE builds directly on these advancements, investigating modern vector space models [19] and the impact of including LLM-generated labels to enhance classification performance.

3. Task and Datasets

We now describe the task and the dataset details of the two sub-tracks (ST) for IRSE.

3.1. ST-1: Comment Usefulness Prediction

This task requires participants to build a binary classification model to determine if a source code comment is Useful or Not Useful. Given a comment and its related code snippet, the model must assess whether the comment’s information would genuinely help a developer understand the code.

The core of the classification logic rests on avoiding redundancy. A Useful comment must be relevant and provide insights that are not immediately obvious from the code itself. In contrast, a Not Useful comment, while potentially relevant, is redundant because it simply restates what the code already clearly communicates. To support this task, we provide the IRSE track dataset, which contains 9,048 annotated code-comment pairs from GitHub. 0.8570 0.8610 0.7240 0.8110 0.8450 0.8040 0.8760 0.8530 0.8050 0.8050 0.7940 0.8240

3.2. ST-2: Code Quality Estimation

Task and Evaluation Measures The objective of the code quality estimation task is to predict the functional correctness of code snippets that have been generated by Large Language Models (LLMs) in response to a programming prompt. For this purpose, we utilize the HumanEval dataset, which contains 161 programming problems.

Formally, given a programming task description and a list of generated solutions = {1 , . . . , }, a prediction model is expected to produce a vector of likelihood scores. Each score represents the estimated functional correctness for a corresponding solution, such that : (, ) ↦→ R.

To evaluate the model, we frame this as a ranking problem, drawing an analogy from Information Retrieval. The problem description acts as a query, the solutions are analogous to a set of retrieved documents, and the notion of ‘functional correctness’ replaces ‘relevance’. This analogy allows us to use standard ranking metrics. We primarily report the Normalized Discounted Cumulative Gain at (nDCG@), where we use = 10 solutions per problem.

We employ two distinct nDCG measures to capture diferent aspects of ranking performance: 1. Local nDCG (l-nDCG): This metric assesses the average per-problem ranking quality. For each problem in the set of all problems , we rank its solutions based on the predicted scores and compute nDCG@. The final l-nDCG is the average of these scores over all problems, defined as: l-nDCG = || =1 1 ∑|︁| nDCG 2. Global nDCG (g-nDCG): This metric evaluates the model’s overall ranking ability across the entire benchmark. We pool all × || solutions from all problems into a single list, rank them using the predicted scores, and calculate a single nDCG score for this global list, denoted as nDCG@(||).

4. Participation and Evaluation

In this section, we detail the participation and methodologies for the two sub-tasks.

ST-1: Comment Usefulness Prediction

The IRSE 2024 track attracted significant interest from the academic community, receiving 12 submissions from various research labs, reflecting the importance of software maintenance.

Approaches and Methodologies Participants employed a diverse range of techniques to tackle the classification problem. The provided training dataset was well-balanced, containing 4015 useful and 4033 not useful comments. For feature representation, teams utilized everything from traditional methods like TF-IDF and word2vec to context-aware embeddings from models like ELMo and BERT. The classification models were similarly varied, including support vector machines, logistic regression, and deep-learning architectures such as RNNs and BERT-based classifiers.

Impact of LLM-Generated Data A key aspect of this track was evaluating the efect of data augmentation using LLM-generated "silver standard" labels. The results were nuanced: while some teams saw a slight improvement in test accuracy, many observed a minor decrease (2-4%). This behavior is interpreted as a positive outcome, suggesting that the synthetic data acts as a regularizer, reducing the model’s tendency to overfit to the original training set and potentially leading to better generalization. Table 2 characterizes the LLM-generated datasets contributed by each team.

Team Name

ST-2: Code Quality Estimation

For the pilot task on code quality estimation, we received a submission from a team at IIT-KGP. Submitted System The team’s approach was based on the methodology proposed by [20], employing zero-shot inference with GPT-3.5 Turbo. For a given problem-solution pair, the model was prompted to generate a likelihood score indicating the solution’s functional correctness. They submitted three distinct runs by varying the GPT decoder’s temperature setting ( ∈ {0.7, 0.8, 0.9} ). The simple and efective prompt used is shown in Figure 1.

Given the problem and the solution, generate a likelihood score between 0 and 1 indicating how relevant the solution is to the problem. Only state the score. Baseline for Comparison To provide a reference point, we developed a heuristic-based baseline. This baseline estimates quality by measuring the variance of semantic similarities across all solution pairs for a given problem. The core assumption is that for a well-defined problem with a fixed function signature (as in the HumanEval dataset), correct solutions should be semantically similar. A high variance, therefore, may indicate a lack of consensus in the generated solutions, correlating with a higher probability of incorrectness. Semantic similarity was calculated using CLS embeddings from CodeBERT [21].

Table 3 shows that the GPT-based 0-shot inference produced better results than the in-house heuristicbased baseline of estimating code quality as a measure of the topical diversity between the LLMgenerated solutions.

5. Conclusions

The first sub-task of the IRSE track centered on the automated evaluation of code comment quality. Across the 12 participating teams, a central finding was the significant benefit of data augmentation using Large Language Models. The top-performing system’s F1-score increased from 0.853 to 0.892 when its training data was supplemented with LLM-generated labels. This improvement highlights that synthetic annotations can efectively mitigate model overfitting and enhance generalization. The value of this approach was further reinforced when a combined dataset, incorporating submissions from all teams and gold-standard labels, demonstrated even greater performance.

In the second sub-task, which focused on predicting the functional correctness of LLM-generated code, the results were equally decisive. Evaluation methods that themselves leveraged LLMs substantially outperformed embedding-based baseline models. This outcome underscores the advanced capabilities of LLMs to analyze the deep contextual and functional semantics of code, a task where simpler vector-space models are less efective.

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publicationâsÂ content. [10] S. Paul, S. Majumdar, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. Das, P. D. Clough, P. Majumder, Eficiency of large language models to scale up ground truth: Overview of the irse track at forum for information retrieval 2023, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, 2023, pp. 16–18. [11] N. Chatterjee, S. Majumdar, P. P. Das, A. Chakrabarti, Tool assisted agile approach for legacy application migration, International Journal of System Assurance Engineering and Management (2025) 1–16. [12] S. Majumdar, P. P. Das, Smart knowledge transfer using google-like search, arXiv preprint arXiv:2308.06653 (2023). [13] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, International

Conference on Program Comprehension (ICPC), IEEE, 2013, pp. 83–92. [14] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using textual features and developer experience, International Conference on Mining Software Repositories (MSR), IEEE, 2017, pp. 215–226. [15] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine - a semantic search approach to program comprehension from code comments, in: Advanced Computing and Systems for Security, Springer, 2020, pp. 29–42. [16] A. Deshpande, A. Maji, D. Mondol, P. P. Das, P. D. Clough, S. Majumdar, The code–llm handshake: Smarter maintenance through ai, in: Proceedings of the 17th annual meeting of the Forum for Information Retrieval Evaluation, 2025, pp. 9–12. [17] A. Mitra, S. Majumdar, A. Mukhopadhyay, P. P. Das, P. D. Clough, P. P. Chakrabarti, Operationalizing large language models with design-aware contexts for code comment generation, arXiv preprint arXiv:2510.22338 (2025). [18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [19] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An efective low-dimensional software code representation using bert and elmo, in: 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2022, pp. 763–774. [20] T. Y. Zhuo, Ice-score: Instructing large language models to evaluate code, 2024.

arXiv:2304.14317. [21] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou, Codebert: A pre-trained model for programming and natural languages, CoRR abs/2002.08155 (2020). URL: https://arxiv.org/abs/2002.08155. arXiv:2002.08155.

[1]

Bosu ,

Greiler ,

Bird , Characteristics of useful code reviews: An empirical study at microsoft , Working Conference on Mining Software Repositories, IEEE , 2015 , pp. 146 - 156 .

[2]

Majumdar ,

Bansal , P. P. Das , P. D.

Clough , K.

Datta , S. K.

Ghosh , Automated evaluation of comments to aid software maintenance , Journal of Software: Evolution and Process 34 ( 2022 ) e2463 .

[3]

Datta ,

Ganguly ,

Greene ,

Mitra , Deep-qpp: A pairwise interaction-based deep learning model for supervised query performance prediction , in: WSDM, ACM, 2022 , pp. 201 - 209 .

[4]

Majumdar ,

Papdeja , P. P. Das , S. K. Ghosh , Smartkt: a search framework to assist program comprehension using smart knowledge transfer , in: 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS) , IEEE, 2019 , pp. 97 - 108 .

[5]

Chatterjee ,

Majumdar ,

S. R.

Sahoo , P. P. Das , Debugging multi-threaded applications using pin-augmented gdb (pgdb) , in: International conference on software engineering research and practice (SERP) . Springer, 2015 , pp. 109 - 115 .

[6]

Majumdar ,

Chatterjee , P. P. Das , A. Chakrabarti , A mathematical framework for design discovery from multi-threaded applications using neural sequence solvers , Innovations in Systems and Software Engineering 17 ( 2021 ) 289 - 307 .

[7]

Majumdar ,

Deshpande , P. P. Das , P. P. Chakrabarti , Comprehending c codes with llms: Efective comment generation through retrieval and reasoning, Pattern Recognition Letters ( 2025 ).

[8]

Paul , S. Majumdar,

Shah ,

Das ,

Ghosh ,

Ganguly ,

Calikli ,

Sanyal , P. P. Das , P. D. Clough , et al., Overview of the âinformation retrieval in software engineeringâ(irse) track at forum for information retrieval 2024 , in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation , 2024 , pp. 18 - 21 .

[9]

Chatterjee ,

Majumdar , P. P. Das , A. Chakrabarti , Parallelc-assist: Productivity accelerator suite based on dynamic instrumentation , IEEE Access 11 ( 2023 ) 73599 - 73612 .