<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scoring with Intelligence: Prompting GPT-3.5 Turbo for Better Code Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ritabrata Bharati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>4th year Dual Degree Student, Department of Computer Science &amp; Engineering, Indian Institute of Technology Kharagpur (IIT Kharagpur)</institution>
          ,
          <addr-line>Kharagpur</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Accessing pertinent code fragments from large repositories remains a persistent dificulty in modern development environments, where teams navigate through diverse documentation artifacts, implementation files, revision histories, and project management systems without suficient support for contextual understanding. This research addresses an information retrieval task wherein problem descriptions paired with partial code implementations require evaluation of candidate solutions, with each candidate receiving a relevance score relative to the problem statement. Our methodology employs temperature-controlled prompting with GPT-3.5 Turbo to compute such assessment metrics and evaluates the influence of parameter variations on result consistency. Across multiple experimental configurations, the second variant demonstrates superior efectiveness, achieving a local nDCG value of 0.6615 coupled with a global nDCG metric of 0.9109, thereby validating that precisely engineered prompt specifications can facilitate accurate diferentiation among solution candidates across extensive test collections.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Information Retrieval in Software Development</kwd>
        <kwd>Solution Assessment and Ranking</kwd>
        <kwd>Source Code Understanding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Contemporary software development involves navigating exponentially growing collections of
artifacts—implementations, specifications, release notes, bug tracking systems—creating unprecedented
information access challenges [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Practitioners regularly struggle with codebases of considerable
magnitude, shifting architectural documentation, and comprehensive issue metadata (such as feature
requests and community communications), which collectively obstruct timely discovery of
dependable problem-solving approaches [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Strengthening how teams extract, organize, and leverage this
knowledge base is vital for maintaining engineering velocity and fostering technological progress [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Conventional retrieval mechanisms struggle to grasp implicit problem requirements or code-specific
constraints, frequently identifying matches through superficial textual correspondence rather than
actual utility [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This deficiency creates opportunity for methodologies that integrate deeper semantic
awareness and furnish assessments exceeding basic lexical similarity [5].
      </p>
      <p>This investigation examines GPT-3.5 Turbo [6] functioning as a relevance assessment mechanism
within software-centric information retrieval pipelines. Given textual problem specifications and skeletal
code structures, we instruct the system to emit numerical likelihood assessments for each solution
candidate, quantifying its fit relative to the stated requirement. Our exploration further investigates
temperature parameter choices and their implications for evaluation stability. Through three distinct
submission runs, the primary alternative achieves optimal results, obtaining a local nDCG score of
0.6615 alongside a global nDCG score of 0.9109.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Software-focused information retrieval has transitioned away from syntactic text matching toward
mechanisms that capture developer objectives and code organization with greater precision [7]. Seminal
investigations reveal the obstacles practitioners encounter during code navigation and documentation
exploration, establishing the imperative for IR methodology specifically engineered for programming
contexts [8, 9, 10, 11].</p>
      <p>Foundational search techniques. Established approaches replicate standard text search
functionality and depend substantially on expression-based queries and logical filtering [ 9, 12]. Despite
implementation simplicity, these methods frequently omit situational specifics, creating tedious and
cognitively demanding lookup experiences [13]. To overcome these shortcomings, the field has advanced
toward situation-sensitive discovery incorporating semantic foundations and vocabulary schemas
customized for development artifacts [14].</p>
      <p>Context-aware and formal representations. Contemporary investigation enhances discovery
mechanisms by integrating contextual relationships and formal structure descriptions of source
implementations, encompassing parse trees and execution dependency mappings, to strengthen alignment
among user requirements and retrievable items [15, 16].</p>
      <p>Language models applied to code. Sophisticated linguistic processing has accelerated numerous
SE endeavors, namely automated documentation synthesis, source code summarization, and automated
program fixing [ 17, 18, 19, 20, 21, 22, 23]. Attention-based sequential models, particularly those following
BERT and GPT paradigms, have established their capability for capturing code structure semantics
beneficial to ranking and filtering tasks [24].</p>
      <p>Intelligent systems for developers. Concurrent innovation in leveraging developer behavior
patterns has enabled tools that forecast information requirements and customize solution suggestions
[25, 26]. Technologies including Codex and programming co-pilots demonstrate the prospect of real-time,
circumstance-sensitive assistance diminishing information overload during implementation [27, 28].
This contribution participates in this progression by utilizing GPT-3.5 Turbo to compute relevance
assessments for solution candidates.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The evaluation corpus comprises 164 individual queries, each with exactly 10 competing solution
proposals. The responsibility involves estimating a numerical relevance assessment for all (query,
solution) combinations that mirrors the degree to which the proposal adequately handles the query
requirements.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Task Definition</title>
      <p>When presented with a specification containing a requirement outline and fragmentary code,
supplemented with ten alternative approaches per requirement, the goal centers on generating a numerical
relevance weight for each (specification, approach) combination expressing the prospect that the
approach would successfully fulfill the specification’s objective.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Method</title>
      <sec id="sec-5-1">
        <title>5.1. Advantages of the prompting technique</title>
        <p>Instruction-based inference [29] supplies an economical approach for transferring mission context and
evaluation specifications into system behavior:
• Well-defined inputs. An instruction that couples the requirement statement with a partial
implementation restricts the system’s computational focus to the essential aspects of the assignment
[30].
• Directed generation. Specific command language in instructions stimulates narrow,
goaloriented outputs and diminishes unintended deviations from the assignment target [31].
• Computational expedience. Straightforward instructions minimize the solution space and
hasten convergence to pertinent determinations, economizing specialist efort [32].
• Evaluation framework. A singular numerical directive yields harmonized relevance weights
among possibilities, allowing efortless sequencing and investigation [33].
• Problem decomposition. Instructions may guide the system toward aspects of the specification
exhibiting maximal separability in intricate development circumstances [34].
• Broad applicability. Identical prompt patterns generalize across numerous contexts and code
repositories, enabling efective mass evaluation procedures [35].
• Extraction of learned patterns. The system synthesizes expertise acquired from comparable
undertakings, generating more thoughtful conclusions with limited direction [36].
• Rapid customization. Instruction patterns allow swift modification for novel code environments
or issue configurations with no requirement for model retraining [37].</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Prompting implementation strategy</title>
        <p>We engage GPT-3.5 Turbo utilizing single-shot learning without training to calculate solution
applicability measures. The system works via the subsequent progression:
(i) Input consumption: the specification text flows into the system.
(ii) Lexical segmentation: text becomes symbol sequences appropriate for system operations.
(iii) Representation formation: the symbol progression transforms into numerical vectors capturing
adjacent and distant semantic associations.
(iv) Focus mechanism: numerical matrices emphasize the most consequential sections relative to
the appraisal goal.
(v) Sequential production: the system sequentially creates result symbols reflecting the assessment
outcome.
(vi) Inverse lexical transformation: symbols reconvert to linguistic form.
(vii) Result transmission: the assessment score returns to the originating program.</p>
        <p>A schematic representation of the methodology appears in Figure 1.</p>
        <p>We activate GPT-3.5 Turbo operating in a single-shot paradigm to produce relevance scores
quantifying solution appropriateness. Three variations were tested with thermal parameters of 0.7, 0.8, and 0.9
employing the prompt: “Analyze the requirement &lt;Problem&gt; and the proposed solution &lt;Solution&gt; then
output a relevance metric from 0 to 1 indicating solution adequacy relative to the requirement. Supply only
the numerical result".”</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>As presented in Table 1, the secondary iteration demonstrates strongest ranking fidelity, with
enhancements evident in both local and global nDCG relative to the parallel configurations.
The investigation positions solution appraisal as a search methodology challenge and illustrates that
prompt-based prompting with GPT-3.5 Turbo furnishes dependable, quantitative evaluation across
solution options. By embedding mission context and assessment directives into the specification, we
acquire uniform assessment measures amenable to subsequent filtering operations. Among configurations
examined, Iteration 2 demonstrates leading performance, garnering a local nDCG of 0.6615 and a global
nDCG of 0.9109. These conclusions imply functional techniques for incorporating
neural-networkbased assessment mechanisms into engineering team procedures to diminish exploration overhead and
accelerate identification of superior implementations.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this manuscript, assistance from ChatGPT was used for tasks including
drafting, editing, and language refinement. The author reviewed and revised all content and accepts
full responsibility for the final text.
[5] A. Sadeghi, H. Bagheri, J. Garcia, S. Malek, A taxonomy and qualitative comparison of program
analysis techniques for security assessment of android software, IEEE Transactions on Software
Engineering 43 (2016) 492–530.
[6] B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G.
Sastry, A. Askell, S. Agarwal, et al., Language models are few-shot learners, arXiv preprint
arXiv:2005.14165 1 (2020).
[7] V. Garousi, M. Borg, M. Oivo, Practical relevance of software engineering research: synthesizing
the community’s voice, Empirical Software Engineering 25 (2020) 1687–1754.
[8] W. Scacchi, Understanding the requirements for developing open source software systems, IEE</p>
      <p>Proceedings-Software 149 (2002) 24–39.
[9] Z. Sharafi, Z. Soh, Y.-G. Guéhéneuc, A systematic literature review on the usage of eye-tracking in
software engineering, Information and Software Technology 67 (2015) 79–107.
[10] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, Large language
models for software engineering: A systematic literature review, ACM Transactions on Software
Engineering and Methodology (2023).
[11] S. Majumdar, P. P. Das, Smart knowledge transfer using google-like search, arXiv preprint
arXiv:2308.06653 (2023).
[12] D. Binkley, D. Lawrie, P. Laplante, Applications of information retrieval to software development,</p>
      <p>Encyclopedia of Software Engineering (P. Laplante, ed.),(to appear) (2010).
[13] A. Abogdera, Exploring Information-Seeking Strategies College Students Use to Improve the
Relevance of Retrieval from Online Information Retrieval Systems, Ph.D. thesis, Colorado Technical
University, 2022.
[14] A. D. Dave, N. P. Desai, A comprehensive study of classification techniques for sarcasm detection
on textual data, in: 2016 International Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT), IEEE, 2016, pp. 1985–1991.
[15] M. Fernández, I. Cantador, V. López, D. Vallet, P. Castells, E. Motta, Semantically enhanced
information retrieval: An ontology-based approach, Journal of Web Semantics 9 (2011) 434–452.
[16] J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, X. Liu, A novel neural source code representation
based on abstract syntax tree, in: 2019 IEEE/ACM 41st International Conference on Software
Engineering (ICSE), IEEE, 2019, pp. 783–794.
[17] C. Watson, N. Cooper, D. N. Palacio, K. Moran, D. Poshyvanyk, A systematic literature review
on the use of deep learning in software engineering research, ACM Transactions on Software
Engineering and Methodology (TOSEM) 31 (2022) 1–58.
[18] S. Panichella, A. Panichella, M. Beller, A. Zaidman, H. C. Gall, The impact of test case summaries
on bug fixing performance: An empirical investigation, in: Proceedings of the 38th international
conference on software engineering, 2016, pp. 547–558.
[19] S. Gupta, S. Gupta, Natural language processing in mining unstructured data from software
repositories: a review, Sa¯dhana¯ 44 (2019) 244.
[20] Y. Zhu, M. Pan, Automatic code summarization: A systematic literature review, arXiv preprint
arXiv:1909.04352 (2019).
[21] E. Dehaerne, B. Dey, S. Halder, S. De Gendt, W. Meert, Code generation using machine learning: A
systematic review, Ieee Access 10 (2022) 82434–82455.
[22] A. Deshpande, A. Maji, D. Mondol, P. P. Das, P. D. Clough, S. Majumdar, The code–llm handshake:
Smarter maintenance through ai, in: Proceedings of the 17th annual meeting of the Forum for
Information Retrieval Evaluation, 2025, pp. 9–12.
[23] A. Mitra, S. Majumdar, A. Mukhopadhyay, P. P. Das, P. D. Clough, P. P. Chakrabarti,
Operationalizing large language models with design-aware contexts for code comment generation, arXiv
preprint arXiv:2510.22338 (2025).
[24] D. Drain, C. Wu, A. Svyatkovskiy, N. Sundaresan, Generating bug-fixes using pretrained
transformers, in: Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming,
2021, pp. 1–8.
[25] M. Borg, Advancing trace recovery evaluation-applied information retrieval in a software
engineering context, arXiv preprint arXiv:1602.07633 (2016).
[26] Z. Batmaz, A. Yurekli, A. Bilge, C. Kaleli, A review on deep learning for recommender systems:
challenges and remedies, Artificial Intelligence Review 52 (2019) 1–37.
[27] S. Tatineni, K. Allam, Ai-driven continuous feedback mechanisms in devops for proactive
performance optimization and user experience enhancement in software development, Journal of AI in
Healthcare and Medicine 4 (2024) 114–151.
[28] M.-F. Wong, S. Guo, C.-N. Hang, S.-W. Ho, C.-W. Tan, Natural language generation and
understanding of big code for ai-assisted programming: A review, Entropy 25 (2023) 888.
[29] L. Wang, X. Chen, X. Deng, H. Wen, M. You, W. Liu, Q. Li, J. Li, Prompt engineering in consistency
and reliability with the evidence-based guideline for llms, npj Digital Medicine 7 (2024) 41.
[30] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, Large language
models for software engineering: A systematic literature review, ACM Transactions on Software
Engineering and Methodology (2023).
[31] E. A. Siverling, T. J. Moore, E. Suazo-Flores, C. A. Mathis, S. S. Guzey, What initiates evidence-based
reasoning?: Situations that prompt students to support their design ideas and decisions, Journal
of Engineering Education 110 (2021) 294–317.
[32] L. Belzner, T. Gabor, M. Wirsing, Large language model assisted software engineering: prospects,
challenges, and a case study, in: International Conference on Bridging the Gap between AI and
Reality, Springer, 2023, pp. 355–374.
[33] H. A. Diefes-Dux, J. S. Zawojewski, M. A. Hjalmarson, M. E. Cardella, A framework for analyzing
feedback in a formative assessment system for mathematical modeling problems, Journal of
Engineering Education 101 (2012) 375–406.
[34] B. Mirel, Interaction design for complex problem solving: Developing useful and usable software,</p>
      <p>Morgan Kaufmann, 2004.
[35] T. Stober, U. Hansmann, Best practices for large software development projects, Springer, 2010.
[36] L. Reynolds, K. McDonell, Prompt programming for large language models: Beyond the few-shot
paradigm, in: Extended abstracts of the 2021 CHI conference on human factors in computing
systems, 2021, pp. 1–7.
[37] A. Kleppe, Software language engineering: creating domain-specific languages using metamodels,
Pearson Education, 2008.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cleland-Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. C.</given-names>
            <surname>Gotel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Hufman</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mäder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisman</surname>
          </string-name>
          ,
          <article-title>Software traceability: trends and future directions</article-title>
          ,
          <source>in: Future of software engineering proceedings</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. H.</given-names>
            <surname>David</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Famiglietti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Habets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Maidment</surname>
          </string-name>
          ,
          <article-title>A decade of rapid-reflections on the development of an open source geoscience code</article-title>
          ,
          <source>Earth and Space Science</source>
          <volume>3</volume>
          (
          <year>2016</year>
          )
          <fpage>226</fpage>
          -
          <lpage>244</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nambisan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tanniru</surname>
          </string-name>
          ,
          <article-title>Organizational mechanisms for enhancing user innovation in information technology</article-title>
          ,
          <source>MIS quarterly</source>
          (
          <year>1999</year>
          )
          <fpage>365</fpage>
          -
          <lpage>395</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Drury-Grogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Conboy</surname>
          </string-name>
          , T. Acton,
          <article-title>Examining decision characteristics &amp; challenges for agile software development</article-title>
          ,
          <source>Journal of Systems and Software</source>
          <volume>131</volume>
          (
          <year>2017</year>
          )
          <fpage>248</fpage>
          -
          <lpage>265</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>