<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the IRSE track at FIRE 2025: Information Retrieval in Software Engineering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adwita Deshpande</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aritra Maji</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diganta Mondal</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Partha Pratim Das</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul D. Clough</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Srijoni Majumdar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology Goa</institution>
          ,
          <addr-line>Goa</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Indian Institute of Technology Kharagpur</institution>
          ,
          <addr-line>Kharagpur</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Leeds</institution>
          ,
          <addr-line>Leeds</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Shefield</institution>
          ,
          <addr-line>Shefield</addr-line>
          ,
          <country country="UK">UK;</country>
          <institution>TPXimpact</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The Information Retrieval in Software Engineering (IRSE) track focuses on the automated assessment of software artifacts using machine learning techniques. The 2025 edition emphasizes two tasks: (i) binary classification of source code comments into useful and not useful categories, and (ii) a pilot task on estimating the functional correctness of code generated by Large Language Models (LLMs). The primary dataset comprises 9,048 commentcode pairs extracted from open-source C projects on GitHub. Fourteen teams from academia and industry submitted a total of 45 experimental runs. Results indicate that transformer-based models with software-specific embeddings outperform traditional classifiers, while LLM-based inference proves highly efective for code quality estimation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval in Software Engineering</kwd>
        <kwd>Comment Usefulness Prediction</kwd>
        <kwd>Code Quality Estimation</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Steidl et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduced a method for detecting comment quality by measuring the similarity between
code and comment text using the Levenshtein distance, along with comment length, to eliminate
trivial or non-informative comments. Rahman et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] identified useful and non-useful code review
comments (logged in review portals) using features derived from a developer survey conducted at
Microsoft [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. They applied textual features and trained classifiers—specifically decision tree and naive
Bayes models—on a dataset of 1,200 review comments for automated quality evaluation. In more recent
work, Liu et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] addressed the Declutter Challenge of DocGen2, where they identified ‘not useful’
comments using textual and structural features within a machine learning framework. Majumdar et
al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed a framework to assess comment quality by focusing on concepts that enhance code
comprehension. Their approach utilized textual and code correlation features, leveraging a knowledge
graph for semantic interpretation of the information embedded in comments, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] deploy similar
ideas, and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] follows the same idea as in this paper.
      </p>
      <p>
        Existing methods largely focus on evaluating comment quality by identifying irrelevant or repetitive
words and phrases in relation to nearby code constructs. However, the definition of quality is inherently
context-dependent. For instance, Rahman et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] restricted their assessment to code review comments,
while Majumdar et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] analyzed source code comments by extracting concepts that facilitate
comprehension and aid in maintenance. The IRSE track extends the work of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] by exploring diverse
vector space models and feature representations to evaluate comments in the context of their contribution
to code understanding.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task and Dataset</title>
      <p>The task is defined as a binary classification problem aimed at categorizing source code comments as
either Useful or Not Useful based on a given comment and its corresponding code snippet.</p>
      <p>A Useful comment contains meaningful software development concepts that are not explicitly
reflected in the surrounding code, thereby enhancing its comprehensibility. Conversely, a Not Useful
comment either contains very few relevant concepts or includes suficient software development
information that is already evident in the surrounding code, making it redundant.
/* Initializes the buffer only once to avoid redundant
allocations */
---- USEFUL
void init_buffer() {
static int initialized = 0;
if (!initialized) {
buffer = malloc(BUFFER_SIZE);
initialized = 1;
}</p>
      <p>}
/* Function to initialize buffer */
---- NOT USEFUL, REDUNDANT
void init_buffer() {
static int initialized = 0;
if (!initialized) {
buffer = malloc(BUFFER_SIZE);
initialized = 1;
}</p>
      <p>}</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset:</title>
        <p>The IRSE track utilizes a robust dataset comprising 9,048 comment-code pairs extracted from
opensource C-based projects on GitHub. Each data instance is a tuple consisting of the comment text, the
associated surrounding code snippet (function or block), and a binary ground-truth label indicating
whether the comment is useful or not useful. The rigorous data collection process ensured that the
comments evaluated were pertinent to real-world software maintenance scenarios.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Code Quality Estimation</title>
        <p>The pilot task focuses on predicting functional correctness of LLM-generated solutions using the
HumanEval benchmark. Given a problem description and multiple candidate solutions, systems rank
solutions by estimated correctness using IR-style ranking metrics such as nDCG.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Participation and Evaluation</title>
      <p>The 2025 edition of the IRSE track witnessed active engagement from the software engineering research
community, receiving experimental submissions from 14 diverse teams. The participation featured a
strong mix of academia and industry, reflecting the universal relevance of the problem statement. The
cohort comprised 7 teams from the Indian Institute of Technology (IIT) Kharagpur, 4 from IIT Goa, 2
from SSRN College, and 1 from Salesforce. The track focuses on software maintenance, a critical phase
in the software lifecycle, and participant details are outlined in Table 1.</p>
      <p>The provided dataset was meticulously balanced to prevent class bias during training, containing
4,208 instances labeled as “useful” and 4,235 instances labeled as “not useful.” This balance allowed for a
fairer evaluation of model performance using metrics like the F1-score. In terms of feature extraction,
participants primarily employed feature engineering techniques centered on text mining to isolate
significant keywords and phrases. Additionally, almost all teams utilized string matching and overlap
coeficients to quantify the redundancy between the comment text and the source code tokens.</p>
      <sec id="sec-4-1">
        <title>4.1. Machine Learning Architectures:</title>
        <p>A significant number of teams utilized state-of-the-art transformer-based models such as BERT and
GPT to learn the comment quality labels. These models were typically fine-tuned over 50 to 75
epochs, utilizing binary cross-entropy loss functions and the Adam optimizer to minimize classification
error. While the GPT architecture achieved the highest F1 score—demonstrating superior capability
in understanding context—its deployment requires substantial computational resources, which are
typically more accessible to large software companies like Salesforce.</p>
        <p>In contrast, other teams employed more traditional architectures including Recurrent Neural Networks
(RNNs), Support Vector Machines (SVMs), Random Forests, and Logistic Regression. These models often
relied on a hybrid of textual features (e.g., length, keyword frequency) and code-correlation features.
Interestingly, the F1 scores obtained from RNNs and SVMs were comparable to those obtained from the
more complex BERT variants. This competitiveness is attributed to the balanced nature of the dataset
and the high discriminative power of explicit textual features when combined with numerical vectors,
suggesting that lighter models can still be efective for this specific task.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Pre-Trained Embeddings:</title>
        <p>
          The experiments involved a comprehensive exploration of embedding spaces, utilizing both
contextaware and context-independent embeddings. These were either trained from scratch on specific corpora
or fine-tuned with software development concepts. The best results were consistently achieved using
the recently released contextualized CodeBERT embeddings [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], which are pre-trained on bimodal
data (natural language and programming language), allowing them to capture the semantic relationship
between code and comments efectively.
        </p>
        <p>
          Comparable results were also obtained by using CodeELMo [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which trains ELMo embeddings from
scratch using large-scale software development corpora from books, journals, and code repositories.
Conversely, while TF-IDF vectorization was frequently used by several teams to generate word vectors,
it generally did not generate high scores. This underperformance highlights the limitation of statistical
counting methods in capturing the deep semantic context required to distinguish between a useful
explanation and a redundant restatement of code.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The Information Retrieval in Software Engineering (IRSE) track represents a significant step forward
in empirically investigating a diverse range of approaches within a machine-learning framework to
automate the evaluation of comment quality. Central to this evaluation is the definition of quality,
which is rigorously assessed based on whether a comment provides semantic information that aids in
the comprehension of the surrounding code, rather than mere syntactic correctness.</p>
      <p>This iteration of the track saw robust engagement from the research community, with 14 teams
participating and submitting a total of 45 distinct experiments. This diverse participation facilitated a
comprehensive comparative analysis of various machine learning models, embedding spaces, and feature
engineering strategies, ranging from traditional statistical classifiers to advanced neural networks.</p>
      <p>The results highlight a clear trend toward the eficacy of large language models in software
maintenance tasks. The highest F1-Score of 0.9102 was achieved by the Salesforce team using the GPT-2
architecture, combined with hybrid textual and numerical features derived from CodeBERT vector space
embeddings. This superior performance in identifying ‘useful’ versus ‘not useful’ comments underscores
the potential of combining generative pre-trained transformers with domain-specific embeddings to
build the next generation of intelligent automated software maintenance tools.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>In the course of preparing this manuscript, the author(s) employed the generative AI tool ChatGPT. Its
use was limited to performing checks for grammar and spelling. Following this, the author(s) conducted
a thorough review and revision of the text and assume full responsibility for the final published content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Greiler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bird</surname>
          </string-name>
          ,
          <article-title>Characteristics of useful code reviews: An empirical study at microsoft</article-title>
          ,
          <source>in: Proceedings of the 12th Working Conference on Mining Software Repositories (MSR)</source>
          , IEEE,
          <year>2015</year>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Datta</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <article-title>Automated evaluation of comments to aid software maintenance</article-title>
          ,
          <source>Journal of Software: Evolution and Process</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <article-title>e2463</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <article-title>Tool assisted agile approach for legacy application migration</article-title>
          ,
          <source>International Journal of System Assurance Engineering and Management</source>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Paul</surname>
          </string-name>
          , S. Majumdar,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Calikli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sanyal</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the “information retrieval in software engineering”(irse) track at forum for information retrieval 2024</article-title>
          ,
          <source>in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>P. P.</given-names>
          </string-name>
          <string-name>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <article-title>Comprehending c codes with llms: Efective comment generation through retrieval and reasoning, Pattern Recognition Letters (</article-title>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <article-title>Parallelc-assist: Productivity accelerator suite based on dynamic instrumentation</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>73599</fpage>
          -
          <lpage>73612</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Steidl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hummel</surname>
          </string-name>
          , E. Juergens,
          <article-title>Quality analysis of source code comments</article-title>
          ,
          <source>in: Proceedings of the International Conference on Program Comprehension (ICPC)</source>
          , IEEE,
          <year>2013</year>
          , pp.
          <fpage>83</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>M. M. Rahman</surname>
            ,
            <given-names>C. K.</given-names>
          </string-name>
          <string-name>
            <surname>Roy</surname>
          </string-name>
          , R. G. Kula,
          <article-title>Predicting usefulness of code review comments using textual features and developer experience</article-title>
          ,
          <source>in: Proceedings of the International Conference on Mining Software Repositories (MSR)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <article-title>Learning based and context aware non-informative comment detection</article-title>
          ,
          <source>in: Proceedings of the International Conference on Software Maintenance and Evolution (ICSME)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>866</fpage>
          -
          <lpage>867</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukhopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>P. P.</given-names>
          </string-name>
          <string-name>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <article-title>Operationalizing large language models with design-aware contexts for code comment generation</article-title>
          ,
          <source>arXiv preprint arXiv:2510.22338</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
          </string-name>
          ,
          <article-title>Smart knowledge transfer using google-like search</article-title>
          ,
          <source>arXiv preprint arXiv:2308.06653</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mondol</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>S. Majumdar,</given-names>
          </string-name>
          <article-title>The code-llm handshake: Smarter maintenance through ai</article-title>
          ,
          <source>in: Proceedings of the 17th annual meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>