<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DeepCodeSeek: Real-Time API Retrieval for Context-Aware Code Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Esakkivel Esakkiraja</string-name>
          <email>esakkivel.esakkiraja@servicenow.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denis Akhiyarov</string-name>
          <email>denis.akhiyarov@servicenow.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aditya Shanmugham</string-name>
          <email>aditya.shanmugham@servicenow.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chitra Ganapathy</string-name>
          <email>chitra.ganapathy@servicenow.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ServiceNow, Inc</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Current search techniques are limited to standard RAG query-document applications. In this paper, we propose a novel technique to expand the code and index for predicting the required APIs, directly enabling high-quality, end-to-end code generation for auto-completion and agentic AI applications. We address the problem of API leaks in current code-to-code benchmark datasets by introducing a new dataset built from real-world ServiceNow Script Includes that capture the challenge of unclear API usage intent in the code. Our evaluation metrics show that this method achieves 87.86% top-40 retrieval accuracy, allowing the critical context with APIs needed for successful downstream code generation. To enable real-time predictions, we develop a comprehensive post-training pipeline that optimizes a compact 0.6B reranker through synthetic dataset generation, supervised ifne-tuning, and reinforcement learning. This approach enables our compact reranker to outperform a much larger 8B model while maintaining 2.5x reduced latency, efectively addressing the nuances of enterprise-specific code without the computational overhead of larger models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Retrieval-Augmented Generation</kwd>
        <kwd>API Prediction</kwd>
        <kwd>Context-Aware Code Generation</kwd>
        <kwd>Enterprise Code Completion</kwd>
        <kwd>Reinforcement Learning</kwd>
        <kwd>ServiceNow</kwd>
        <kwd>Real-Time Code Search</kwd>
        <kwd>Query Enhancement</kwd>
        <kwd>Fine-Tuning</kwd>
        <kwd>Embedding</kwd>
        <kwd>Reranker</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) have become integral to modern developer workflows through
AIassisted code completion. In specialized enterprise environments like ServiceNow, model efectiveness
depends heavily on context quality, particularly for custom APIs called Script Includes. Script Includes
in ServiceNow are reusable JavaScript components that serve as a centralized repository for storing
functions and classes, enabling developers to encapsulate complex business logic [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>This paper addresses the critical challenge of context retrieval for LLM-powered code generation in
ServiceNow’s code completion and Build Agent tasks.</p>
      <p>The core problem is accurately retrieving relevant Script Includes from partial developer code without
explicit queries. Traditional methods like keyword search or basic vector search fail to capture nuanced
developer intent and lack awareness of complex hierarchical relationships across the ServiceNow
platform. General-purpose LLMs also lack domain-specific knowledge, making high-quality retrieval
essential for reusing instance-specific Script Includes.</p>
      <p>We propose DeepCodeSeek, a multi-stage retrieval pipeline that maximizes context relevance for
LLMs. Our main contributions are: (1) a search pipeline using platform metadata and advanced IR
techniques to significantly improve retrieval accuracy over baselines; (2) a comprehensive post-training
pipeline optimizing compact reranker models through synthetic dataset generation, supervised
finetuning, and reinforcement learning; and (3) empirical validation showing our optimized 0.6B reranker
surpasses 8B models while maintaining significantly reduced latency for real-time applications.</p>
      <p>The rest of this paper is organized as follows: Section 3 details our multi-stage retrieval method,
Section 4 describes dataset construction and indexing, Section 5 presents experimental setup and
evaluation methodology, Section 6 shows main results and ablation studies, and Section 7 details our
post-training pipeline for optimizing compact reranker models.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Our work builds on recent advances in neural code retrieval, retrieval-augmented generation (RAG),
structural code analysis, and search refinement techniques, adapting them to a large-scale enterprise
environment.</p>
      <sec id="sec-2-1">
        <title>2.1. Neural Code Retrieval</title>
        <p>
          Code search has evolved from keyword-based methods like BM25 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] to dense retrieval models such as
CodeBERT [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], which embed queries and code into a shared semantic space. Yet recent evaluations
(e.g., CoIR [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]) show that general-purpose dense retrievers degrade in domains diferent from what
they were trained on, mainly because their pretraining rarely covers such niche knowledge. To control
for this efect, we adopt BM25 as a strong, domain-agnostic baseline that remains competitive under
out-of-domain conditions. Our work then targets the missing piece: a domain-aware retrieval pipeline
tailored to ServiceNow Script Includes.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. RAG with Filtering and Query Enhancement</title>
        <p>
          Retrieval-Augmented Generation (RAG) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] improves LLM outputs by dynamically providing relevant
context, now common in coding assistants [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. A key challenge is ensuring retrieved context relevance,
which can be addressed by leveraging code structure [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and query enhancement techniques. Inspired
by Hypothetical Document Embeddings [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], LLMs can generate complete hypothetical code snippets
from partial code, creating richer queries.
        </p>
        <p>
          While many systems build Code Knowledge Graphs from source code [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ] to enable filtering,
our RAG pipeline constructs a Knowledge Graph from ServiceNow platform metadata for scope-level
ifltering. This constrains the search space and enables eficient retrieval of relevant Script Includes
within our enterprise-specific context.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Reranking</title>
        <p>
          Following retrieval, a cross-encoder or long-context LLM reranker [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] can re-order top candidates,
ensuring the most relevant results are prioritized for the final generation step. Recent
reinforcementlearning approaches explicitly inject reasoning steps to boost reranking quality: REARANK [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
introduces a list-wise reasoning reranking agent, while SWE-RL [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] shows that RL on large-scale
software-evolution data substantially improves LLM reasoning for code-centric tasks. Our reranker
adopts a similar RL fine-tuning strategy but is trained on enterprise-specific Script Include pairs, enabling
higher precision in our domain.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <p>Our approach is a multi-stage retrieval pipeline designed to provide highly relevant Script Includes
for code completion. The pipeline begins with setting a baseline and incorporates several techniques
to progressively refine the search space and improve accuracy. The overall architecture is depicted in
Figure 1.</p>
      <p>The core components of our method are as follows:
• Knowledge Graph for Search Space Reduction: We leverage a Knowledge Graph (KG)
constructed from platform metadata to prune the search space. This pre-filtering step significantly
narrows the field of potential candidates before the main retrieval stage.</p>
      <p>• Enriched Indexing: Rather than indexing the raw code, we created a structured index. All
methods belonging to a single Script Include are grouped under their parent namespace. This index
is further enriched with SI code metadata and their corresponding structured JSDoc, including
API usage example. This organization helps the embedding model better distinguish between
diferent functionalities and reduces ambiguity during retrieval.
• LLM-Powered Code Expansion: Developer’s partial code often lacks suficient context for
efective retrieval. To address this, we experimented using a Large Language Model (LLM) at
runtime to generate more descriptive and efective queries. By analyzing the partial code, the
LLM can infer the developer’s intent and produce a more complete code expansion, which in turn
leads to more accurate results from the embedding model.
• Reranking: The initial retrieval stage may return the correct Script Include but not necessarily
at the top of the list (e.g., within the top-5 results). For efective code generation, the downstream
LLM needs a small, highly relevant set of options. Therefore, we employ a reranking stage using
a cross-encoder or LLM reranker to improve the position of the most relevant candidates, aiming
to move them from higher K values to lower K values (e.g., top-40 into the top-5). This ensures
better performance, as it is easier for the code generation model to process fewer, higher-quality
context options.
• Post-training optimization: We develop a comprehensive training pipeline that optimizes
compact reranker models through synthetic dataset generation, supervised fine-tuning, and
reinforcement learning, enabling smaller models to achieve performance comparable to much
larger models while maintaining significantly reduced latency.</p>
      <p>
        This multi-stage process, combining a knowledge-informed search space, enriched indexing, advanced
query generation, and re-ranking, forms a robust pipeline that significantly outperforms vanilla retrieval
methods([
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) for code generation tasks.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset and Index Construction</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset Construction</title>
        <p>We constructed a custom evaluation dataset from real-world ServiceNow development scenarios to
capture the challenge of API retrieval from partial code. Our dataset consists of 850 code completion
scenarios, each containing a partial JavaScript code snippet and the corresponding ground truth Script
Include that should be used to complete the code.</p>
        <p>To explain the terms used in our dataset:
• code_middle: Autocompletion span where the target API is invoked.
• code_before / code_after: Prefix and sufix around code_middle; the prefix omits the target
Script Include so retrieval must rely on context, while the sufix adds extra lines without exposing
the API.</p>
        <p>This setup makes the retrieval task realistic and challenging, because the model must understand the
code context without direct hints. The “incomplete code” provided as the input for completion can be
FIM (fill-in-the-middle) or non-FIM format depending on whether code_after is available in the input.</p>
        <p>To ensure data quality, we employed an LLM judge (Gemini 2.5 Flash) to evaluate the clarity of
developer intent in each sample. The judge identified 705 samples (83%) as having clear intent, which
we use for our primary evaluation. The remaining 145 samples with ambiguous intent are excluded
from our main results but analyzed separately to understand failure cases.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Index Construction</title>
        <p>Our search index covers 277 distinct Script Include namespaces containing 3,337 individual APIs. The
index documents vary significantly in length: full scripts range from 66 to 10,407 tokens (mean: 2,280),
while corresponding JSDoc summaries are more concise, ranging from 157 to 5,368 tokens (mean: 807).</p>
        <p>Through extensive experimentation, we found that JSDoc summaries provide superior retrieval
performance compared to raw code. This is attributed to their structured nature and focused representation
of API functionality. Consequently, we use JSDoc summaries for our final optimized index, which
provides a cleaner signal for retrieval while maintaining comprehensive coverage of API capabilities.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Knowledge Graph Construction</title>
        <p>We constructed a hierarchical Knowledge Graph from ServiceNow platform metadata to enable eficient
search space pruning. The graph captures the relationship between packages, scopes, and Script Includes,
allowing us to filter candidates based on contextual relevance before expensive retrieval operations.
Appendix B details how this metadata-based graph helps reduce search space in our use case.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <sec id="sec-5-1">
        <title>5.1. Evaluation Metrics</title>
        <p>We evaluate our retrieval pipeline using two primary metrics:
• Top-K Accuracy: The percentage of queries where the correct Script Include appears in the
top-K retrieved results. We report results for K = @5, @10, @20, and @40.
• Mean Reciprocal Rank (MRR): The average of the reciprocal ranks of the correct Script Include
across all queries, providing a more nuanced view of ranking quality.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Baselines and Implementation</title>
        <p>
          We compare our proposed pipeline against BM25 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] as the primary baseline, which achieved 53.02%
top-40 accuracy on our dataset. Our implementation uses the following components:
• Embedding Model: Linq-AI-Research/Linq-Embed-Mistral (7B parameters, 32K context length)
• Reranker Models: Qwen-8B (baseline) and our optimized Qwen-0.6B models
• Judge Model: Gemini 2.5 Flash (1M context length) for intent clarity evaluation
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Pretraining Knowledge Check</title>
        <p>To test whether Script Include knowledge was already present in model pretraining corpora, we
prompted LLMs to autocomplete our evaluation samples without any retrieval context (non-FIM, no
KG, no index). The model produced the correct Script Include namespace in only 5% of cases, indicating
limited memorization/coverage and motivating retrieval for this domain.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments and Results</title>
      <sec id="sec-6-1">
        <title>6.1. Main Results</title>
        <p>We evaluate three primary retrieval methods against our BM25 baseline: (1) Prefix Code Embed
(Non-FIM), which uses embeddings of the code preceding the cursor; (2) LLM Description, which
generates a natural language description of user intent; and (3) Hypothetical Code Generation,
which generates hypothetical code completions for retrieval. For concise method prompts and working
examples, see Appendix J and Section I, respectively.</p>
        <p>Table 1 shows the performance of these methods on our clear-intent evaluation subset (705 samples).
The Hypothetical Code Generation method consistently outperforms all other approaches, achieving
87.86% top-40 accuracy, more than doubling the BM25 baseline performance. Table 2 presents the Mean
Reciprocal Rank (MRR) results, confirming the superior ranking quality of our approach. Appendix
A, C and D have various ablation studies showing how each design choice (e.g., FIM vs. non-FIM
formatting, context length, etc.) impacts accuracy in our dataset.
26.69
58.21
63.35
63.93
34.16
0.43
0.48
0.51
0.44
0.49
0.52
0.45
0.49
0.52
0.45
0.49
0.53</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Latency Analysis</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Post Training and Optimization</title>
      <p>To bridge the performance gap between the 0.6B and 8B Qwen reranker for our use case while enabling
real-time predictions, we developed a comprehensive post-training pipeline that optimizes compact
reranker models to achieve performance comparable to much larger models.</p>
      <p>The goal is to match or surpass an 8B reranker’s ranking quality at much lower latency and cost.
This matters in production: a 0.6B model fits tighter memory budgets, runs with higher concurrency,
and reduces tail latency. Our SFT+RL results now exceed the 8B baseline while keeping the 2.5x latency
gain, which makes the approach deployment-ready.</p>
      <sec id="sec-7-1">
        <title>7.1. Training Dataset</title>
        <sec id="sec-7-1-1">
          <title>7.1.1. Dataset for SFT</title>
          <p>A critical challenge in training our reranker was ensuring no training contamination from our evaluation
datasets. To address this, we constructed completely fresh training datasets using previously unused
Script Include namespaces.</p>
          <p>We constructed a comprehensive dataset for supervised fine-tuning using subsets of the CodeR-Pile
dataset, focusing on JavaScript and TypeScript samples to establish a robust foundation for code
understanding. This dataset provides the necessary diversity and scale for efective SFT training.</p>
        </sec>
        <sec id="sec-7-1-2">
          <title>7.1.2. Synthetic Dataset for RL</title>
          <p>For reinforcement learning we reused Script Include namespaces that never appeared in our training or
evaluation data. First, we pulled 892 such namespaces (about 5.9K methods) that had been left out of
earlier extraction runs. We then generated fresh JSDoc signatures with an LLM so every namespace
had structured documentation in the index. Claude 3.7 analyzed each script and produced synthetic
triplets—code_before, code_middle, code_after—with the target usage placed in code_middle.</p>
          <p>We cleaned the pool in three passes. We removed 30 samples where the ground-truth namespace
leaked into code_before or code_after, dropped another 10–15 samples that still mentioned the
namespace nearby, and finally used fuzzy matching to cut 894 close variants. This left 285 strong
examples. From these we kept 204 samples that ofered enough hard negatives via the sentence-transformers
mine_hard_negatives helper; the remaining 81 lacked suitable negatives, so we discarded them.</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Training Pipeline</title>
        <p>Our training approach addressed the challenge of training a small model on limited data without
overfitting or catastrophic forgetting. The pipeline consisted of three main stages:</p>
        <sec id="sec-7-2-1">
          <title>7.2.1. Supervised Fine-Tuning (SFT)</title>
          <p>We began with supervised fine-tuning using the dataset described in Section 7:
• Parameter-Eficient Training: Experiments with full fine-tuning and PEFT + LoRA revealed
that LoRA adapters provided the best performance improvements for the 0.6B reranker while
maintaining eficiency.
• Balanced Training: We ensured balanced training by randomly selecting either positive or
negative samples during training, preventing class imbalance bias.
• Loss Function: We employed negative log-likelihood loss (nll_loss) to optimize for the true
document label (1 for positive, 0 for negative).</p>
          <p>Training exclusively on synthetic samples led to rapid overfitting. The CodeR-Pile dataset alone
provided better generalization and superior performance compared to mixed training approaches,
indicating that combining synthetic and open-source data did not improve results.
7.2.2. SFT + RL
To further improve ranking performance, we implemented a GRPO-based reinforcement learning
pipeline:
• Reward Function: Training uses a completion-aware binary reward that reads the first token
produced in each GRPO rollout; matching the supervision label with “yes” yields +1, while
an incorrect “no” response receives − 1. Sampling eight completions per prompt injects the
variance GRPO needs to shift the policy instead of collapsing toward the reference distribution.</p>
          <p>See Appendix G for the full setup and diagnostics.
• Training Strategy: We observed that RL training on the small synthetic dataset alone led to
catastrophic forgetting. The optimal approach involved applying RL to a checkpoint from the
SFT model trained on the CodeR-Pile dataset, then fine-tuning with our synthetic dataset.
• Performance Achievement: This two-stage approach enabled the 0.6B model to achieve
performance very close to the 8B reranker on our evaluation benchmark dataset.</p>
        </sec>
        <sec id="sec-7-2-2">
          <title>7.2.3. Extended SFT</title>
          <p>As an alternative to RL, we experimented with extended supervised fine-tuning where we took the SFT
checkpoint trained on the CodeR-Pile dataset and further fine-tuned it separately with our synthetic
dataset. While this approach provided decent results, it did not show noticeable improvements over the
previous SFT checkpoint.</p>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. Training Results and Validation</title>
        <p>Our comprehensive evaluation demonstrates the efectiveness of the post-training pipeline. Figure 3
shows the complete post-training results across all stages, while Table 4 presents the detailed
performance comparison:</p>
        <p>The post-training pipeline successfully bridges the performance gap between the 0.6B and 8B models,
with the SFT + RL optimized 0.6B model achieving 68.58% top-5 accuracy compared to 66.10% for the
8B model—outperforming it by 2.48 percentage points. Detailed out-of-distribution evaluation metrics
appear in Table 7 in Appendix F.
66.10</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>We present DeepCodeSeek, a comprehensive solution for real-time API retrieval in enterprise code
completion scenarios. Our multi-stage retrieval pipeline achieves 87.86% top-40 accuracy, more than
doubling BM25 baseline performance while addressing the critical challenge of inferring developer
intent from partial code.</p>
      <p>Our key contributions are: (1) a novel retrieval pipeline combining knowledge graph filtering, enriched
indexing with JSDoc documentation, and advanced query enhancement techniques; (2) a comprehensive
post-training pipeline optimizing compact reranker models through synthetic dataset generation,
supervised fine-tuning, and reinforcement learning; and (3) demonstration that our optimized 0.6B
reranker now outperforms the 8B model (68.58% vs 66.10% top-5 accuracy) while maintaining 2.5x
reduced latency.</p>
      <p>Ablation studies show significant component contributions: knowledge graph filtering reduces
search space by 59%, enhanced indexing improves accuracy by 31 percentage points, and LLM reranking
provides an additional 7 percentage point boost, enabling real-time code completion in production
environments.</p>
      <sec id="sec-8-1">
        <title>8.1. Limitations and Future Work</title>
        <p>Our evaluation has limitations: dataset focus on Script Includes limits generalization to other code
completion contexts, synthetic data generation may not capture real-world complexity, and small
synthetic dataset size (204 samples) constrains training experiments.</p>
        <p>Future work will focus on expanding data coverage through larger synthetic dataset generation and
real-world data collection, refining the knowledge graph for enhanced filtering, and specializing the
reranker for specific code completion tasks.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>We thank the ServiceNow AI team for their support and feedback throughout this project. We also
acknowledge the contributions of the open-source community for providing the foundational models
and tools that made this research possible.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any generative AI tools in the preparation of this work.</p>
    </sec>
    <sec id="sec-11">
      <title>A. Ablation Studies</title>
      <p>We conduct comprehensive ablation studies to understand the contribution of each pipeline component.</p>
      <sec id="sec-11-1">
        <title>A.1. Knowledge Graph Filtering</title>
        <p>Our Knowledge Graph, constructed from ServiceNow platform metadata, captures hierarchical
relationships across 17,701 Script Includes. Analysis reveals that 84% of new SI usages conform to existing
patterns, enabling efective search space reduction. By prioritizing globally scoped Script Includes, we
reduce candidate sets by approximately 59% before expensive retrieval operations.</p>
      </sec>
      <sec id="sec-11-2">
        <title>A.2. Indexing Strategy Impact</title>
      </sec>
      <sec id="sec-11-3">
        <title>A.3. Reranking Analysis</title>
        <p>Qwen Reranker (8B)
LLM (Gemini 2.5 Flash)
65.84
72.60</p>
      </sec>
      <sec id="sec-11-4">
        <title>A.4. Code Trimming and Context Length</title>
        <p>We analyzed the impact of code trimming and context length on retrieval performance. To avoid
bloating the embedding model with excessive or noisy context, we experimented with various lengths
of prefix code. Our experiments show that a context of 8-10 lines before the cursor yields the best
performance, gaining a 1.82% relative increase over using a larger context in our Prefix Code Embed
(Non-FIM) Search. This optimal context length balances the need for suficient information to infer
developer intent while avoiding noise from distant code that may not be relevant to the current retrieval
task. Note that the downstream code generation task may require more context. Figure 4 illustrates the
relationship between context length and retrieval accuracy.</p>
      </sec>
      <sec id="sec-11-5">
        <title>A.5. Code Before vs. After Analysis</title>
        <p>We investigated whether using code that comes before the cursor (prefix) or after the cursor (sufix)
provides better retrieval performance for user code in the middle. Our analysis revealed that the prefix
code consistently outperforms sufix code for Script Include retrieval. This is likely because prefix code
better captures the developer’s intent and the context in which they are working, while sufix code
often contains implementation details that are less useful for API retrieval. Sufix code still provides
additional boost to the retrieval. Figure 5 shows the performance comparison between using code
before and after the cursor position.</p>
      </sec>
      <sec id="sec-11-6">
        <title>A.6. Code Proximity and Relevance</title>
        <p>We conducted an ablation study to examine how the proximity of code elements afects retrieval
accuracy. Our experiments revealed a critical finding: the maximum information for API retrieval is
contained in the last 1-2 lines immediately preceding the API invocation.</p>
        <p>Our analysis compared diferent context trimming strategies, including using all lines before the
cursor, excluding the last line, excluding the last two lines, and limiting context to ten lines with various
exclusions. The results consistently showed that removing the last 1-2 lines before the API invocation
leads to significant performance degradation in both Top-K accuracy and Mean Reciprocal Rank (MRR),
regardless of the overall context length.</p>
        <p>This finding, as illustrated in 6 suggests that while broader context provides some benefit, the
immediate preceding lines contain disproportionately valuable information for predicting the appropriate
API. This aligns with the intuitive understanding that developers typically write code in a sequential
manner, where the most recent lines provide the strongest signals about the intended functionality and
API requirements.</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>B. Knowledge Graph Analysis</title>
      <p>An analysis of the Script Include (SI) Knowledge Graph on our developer instance reveals several key
insights. The instance contains 2,516 global SIs and 1,744 non-global SIs. By focusing the search on
non-global packages and scopes, the search space is reduced by approximately 59%. This is a significant
improvement, and the search can be narrowed even further. Approximately 97% of non-global SIs follow
a one-to-one mapping, meaning they are used in only a single package and scope. As a result, many
package-scope pairs map to a single SI, often eliminating the need for a deeper search within those
contexts.</p>
    </sec>
    <sec id="sec-13">
      <title>C. Ablation Study on Model Selection</title>
      <p>As a preliminary step, we conducted an ablation study on an older version of our dataset to select
a foundational embedding model. This initial evaluation compared several models, most notably
linq-embed-mistral against Jina, which is a widely used embedding model for code retrieval. The
study was performed without any advanced indexing or retrieval techniques to purely assess the baseline
performance of the models. The results, shown in Figure 7, demonstrated that linq-embed-mistral
performed significantly better than Jina. Based on these preliminary findings, we chose it for all
subsequent experiments in our pipeline.</p>
    </sec>
    <sec id="sec-14">
      <title>D. Code Summarization Strategy</title>
      <p>Chunking the raw Script Include (SI) code proved to be inefective and, in some cases, degraded retrieval
performance. Given the availability of large-context models, we explored alternative summarization
techniques. This led to the development of automated JSDoc signature generation, which creates JSDoc
signatures from SI code. Using these JSDoc signatures as a concise summary of the script’s functionality
proved to be a more efective strategy, improving the relevance of our retrieval results.</p>
      <p>Figure 8 demonstrates the performance improvement achieved by using JSDoc documentation
compared to raw code descriptions. The comparison shows that JSDoc-based indexing consistently
outperforms raw code indexing across diferent retrieval methods, with particularly significant
improvements in top-5 and top-10 accuracy metrics. This improvement is attributed to JSDoc’s structured
nature, which provides cleaner, more focused representations of API functionality while eliminating
noise from implementation details.</p>
    </sec>
    <sec id="sec-15">
      <title>E. Training Loss Analysis</title>
      <sec id="sec-15-1">
        <title>E.1. Supervised Fine-Tuning (SFT) Loss</title>
      </sec>
      <sec id="sec-15-2">
        <title>E.2. Reinforcement Learning Reward Progression</title>
        <p>Figure 11 shows the reward progression during reinforcement learning training. The completion-aware
yes/no reward captures how frequently sampled responses align with the supervision label, so rising
curves indicate the model is pushing more of its rollouts toward the correct decision token.</p>
      </sec>
      <sec id="sec-15-3">
        <title>E.3. Reinforcement Learning Training Loss</title>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>F. Out-of-Distribution Generalization</title>
      <p>To validate that our training pipeline preserves the base model’s generalization capabilities, we evaluated
all trained models on an out-of-distribution dataset that none of the models had seen during training.
Table 7 reports the results.
85.50
85.30
85.10
86.70</p>
      <p>These results demonstrate that our trained models maintain strong generalization capabilities without
sufering from catastrophic forgetting. The Qwen 0.6B models neither significantly outperform nor
degrade compared to the base model performance, indicating successful specialization for our specific
use case while preserving general code understanding abilities.</p>
    </sec>
    <sec id="sec-17">
      <title>G. Reward Function for Reinforcement Learning</title>
      <p>Our reinforcement learning stage uses completion-aware supervision that inspects the first decision
token produced by the reranker. Multiple samples per prompt provide the variance GRPO needs, while
auxiliary diagnostics monitor the log-odds gap between “yes” and “no” generations.</p>
      <sec id="sec-17-1">
        <title>G.1. Completion-Based Training Reward</title>
        <p>The default reward passed to GRPOTrainer is does the following:
• For every sampled completion, we read the first generated token, case-fold it, and test whether it
starts with "yes" or "no", since that is what the Qwen 0.6B reranker outputs.
• A correct match against the supervision label yields +1.0; an incorrect answer yields − 1.0.</p>
      </sec>
      <sec id="sec-17-2">
        <title>G.2. Logit-Based Diagnostics</title>
        <p>We track the mean log-probability of answering "yes" on positive and negative labels, respectively.</p>
        <p>These diagnostics run alongside the completion-based reward and should trend upward for positive
documents once the policy improves.</p>
      </sec>
    </sec>
    <sec id="sec-18">
      <title>H. Analysis of Query Enhancement Techniques</title>
      <p>Our results show that while generating a description with an LLM can help, it does not always work as
well as using the code itself for the search. The main reason is that when the LLM creates a natural
language summary of the user’s code, it can sometimes miss important details or keywords. In contrast,
the other two methods (Prefix Code Embed (Non-FIM) and Hypothetical Code Generation)
use the actual code for retrieval. This provides the search model with more specific information, which
likely explains why they perform better in some situations.</p>
    </sec>
    <sec id="sec-19">
      <title>I. Qualitative Examples of Retrieval Techniques</title>
      <p>To illustrate the diferences between our retrieval methods, this section provides a concrete example of
how each technique processes the same partial code snippet.</p>
      <sec id="sec-19-1">
        <title>I.1. Shared Context: User’s Partial Code</title>
        <p>The following JavaScript code snippet is used as the input for all three retrieval techniques discussed
below. The developer’s intent is to find common elements between two arrays, a task for which the
ArrayUtil Script Include is the correct tool.
v a r p r e v G r p = [ ] ;
v a r c u r r e n t G r p = [ ] ;
v a r commonGrp = [ ] ;
v a r manager ;
v a r backupmgr ;
c u r r e n t G r p . push ( e v e n t . parm1 ) ;
v a r c u r r e n t G r p L i s t = c u r r e n t G r p . t o S t r i n g ( ) . s p l i t ( " , " ) ;
v a r g r p = new G l i d e R e c o r d ( ’ s y s _ u s e r _ g r o u p ’ ) ;
g r p . addQuery ( ’ s y s _ i d ’ , e v e n t . parm2 ) ;
g r p . q u e r y ( ) ;
i f ( g r p . n e x t ( ) ) {
manager = g r p . manager ;
backupmgr = g r p . u_backup_manager ;
}
v a r g r p 1 = new G l i d e R e c o r d ( ’ s y s _ u s e r _ g r m e m b e r ’ ) ;
g r p 1 . addQuery ( ’ group ’ , e v e n t . parm2 ) ;
g r p 1 . que ry ( ) ;
w h i l e ( g r p 1 . n e x t ( ) ) {</p>
        <p>prevGrp . push ( g r p 1 . u s e r + ’ ’ ) ;</p>
      </sec>
      <sec id="sec-19-2">
        <title>I.2. Technique 1: Prefix Code Embed (Non-FIM)</title>
        <p>This method uses a trimmed portion of the user’s code directly as the search query.
Results The model successfully retrieves the correct ArrayUtil API as the top result.
1. ArrayUtil (Correct)
2. Difer
3. LiveFeedCommon
4. XMLDocument
5. OCGroup</p>
      </sec>
      <sec id="sec-19-3">
        <title>I.3. Technique 2: LLM Description</title>
        <p>This method uses an LLM to generate a natural language description of the user’s intent, which is then
used as the search query.</p>
        <p>Results The abstraction to natural language causes the correct API to be ranked second.
1. GlideRecordUtil
2. ArrayUtil (Correct)
3. LiveFeedCommon
4. OCGroup
5. LabelUpdate</p>
      </sec>
      <sec id="sec-19-4">
        <title>I.4. Technique 3: Hypothetical Code Generation</title>
        <p>This method uses an LLM to generate a hypothetical completion for the user’s code. The original code
context combined with this hypothetical code forms the search query.
/ ∗ ∗
∗ @ d e s c r i p t i o n F i n d s e l e m e n t s t h a t a r e i n b o t h c u r r e n t G r p L i s t and
prevGrp a r r a y s and a d d s them t o t h e commonGrp a r r a y .
∗ C o m p l e t e s t h e n e s t e d l o o p c o m p a r i s o n t o i d e n t i f y common group
members between p r e v i o u s and c u r r e n t g r o u p s .</p>
        <p>∗ /
i f ( c u r r e n t G r p L i s t [ i ] === prevGrp [ j ] ) {</p>
        <p>commonGrp . push ( c u r r e n t G r p L i s t [ i ] + ’ ’ ) ;
Results This method also retrieves the correct ArrayUtil API as the top result, demonstrating the
efectiveness of using code-based context for our search index.</p>
        <p>1. ArrayUtil (Correct)
2. LiveFeedCommon
3. Difer
4. GlideRecordUtil
5. IdentificationLookUpTables</p>
      </sec>
    </sec>
    <sec id="sec-20">
      <title>J. Prompts for AI Models</title>
      <p>This appendix details the prompts used for the various models in our pipeline.</p>
      <sec id="sec-20-1">
        <title>J.1. Instructor-based Embedding Model</title>
        <p>The following prompt is used to instruct the embedding model to find relevant APIs based on JSDoc for
a given code snippet.</p>
        <p>I n s t r u c t : Given t h e code , f i n d APIs b a s e d on t h e i r JSDoc t h a t t h i s
code might need t o c o m p l e t e i t s i n t e n d e d p u r p o s e .</p>
        <p>Code :</p>
      </sec>
      <sec id="sec-20-2">
        <title>J.2. Reranker Model</title>
        <p>The reranker model uses a prefix, sufix, and an instruction to judge whether a document meets the
query requirements.
J.2.2. Sufix</p>
        <sec id="sec-20-2-1">
          <title>J.2.3. Instruction</title>
          <p>J u d g e whether t h e Document meets t h e r e q u i r e m e n t s b a s e d on t h e Query
and t h e I n s t r u c t p r o v i d e d . Answer " yes " or " no " .
&lt;| im_end | &gt; \ n &lt; | i m _ s t a r t | &gt; a s s i s t a n t \ n&lt; t h i n k &gt; \ n \ n &lt;/ t h i n k &gt; \ n \ n
Using t h e API ’ s JSDoc , d e c i d e whether t h i s API i s d i r e c t l y u s e f u l
f o r t h e c a l l e r − code t o c o m p l e t e i t s i n t e n d e d t a s k .</p>
        </sec>
      </sec>
      <sec id="sec-20-3">
        <title>J.3. LLM Judge for Dataset Intent</title>
        <p>To judge the intent of a dataset, we use a system and user prompt pair.</p>
        <sec id="sec-20-3-1">
          <title>J.3.1. System Prompt</title>
          <p>The user prompt is a formatted string.
user_prompt = (
f " ### CODE : \ n { code } \ n \ n "
f " ### NAMESPACE : \ n { namespace } \ n \ n "
f " ### API DESCRIPTIONS ( C o n t e x t ) : \ n { a p i _ d e s c r i p t i o n } \ n \ n "
f " Does t h i s namespace f i t t h e code ’ s i n t e n t ? "</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] ServiceNow, Script includes, https://www.servicenow.com/docs/bundle/zurich-api-reference/ page/script/server-scripting/concept/c_ScriptIncludes.html,
          <year>2025</year>
          .
          <article-title>ServiceNow Zurich API Reference Documentation</article-title>
          .
          <source>Released: July</source>
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <article-title>The probabilistic relevance framework: Bm25 and beyond</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Codebert: A pre-trained model for programming and natural languages</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>1536</fpage>
          -
          <lpage>1547</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Q.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Coir: A comprehensive benchmark for code information retrieval models</article-title>
          ,
          <source>arXiv preprint arXiv:2407.02883</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          , W.-t. Chen,
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          , G. Yu,
          <article-title>Building a coding assistant via the retrieval-augmented language model</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>43</volume>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .1145/3695868.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Allamanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. T.</given-names>
            <surname>Barr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Devanbu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <article-title>A survey of machine learning for big code and naturalness</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 51</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Precise zero-shot dense retrieval without relevance labels</article-title>
          ,
          <source>arXiv preprint arXiv:2212.10496</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wen</surname>
          </string-name>
          , Y. Liu,
          <article-title>Kg4py: A toolkit for generating python knowledge graph and code semantic search</article-title>
          ,
          <source>Connection Science</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <fpage>1384</fpage>
          -
          <lpage>1400</lpage>
          . doi:
          <volume>10</volume>
          .1080/09540091.
          <year>2022</year>
          .
          <volume>2072471</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shieh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Codexgraph:
          <article-title>Bridging large language models and code repositories via code graph databases</article-title>
          ,
          <source>arXiv preprint arXiv:2405.13531</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , S. Wang,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          , Is ChatGPT Good at Search?
          <article-title>Investigating Large Language Models as Re-Ranking Agents</article-title>
          ,
          <source>arXiv preprint arXiv:2304.09542</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <surname>REARANK</surname>
          </string-name>
          :
          <article-title>Reasoning re-ranking agent via reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:2505.20046</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Duchenne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Copet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Carbonneaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fried</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. I. Wang</surname>
          </string-name>
          , SWE-RL:
          <article-title>Advancing llm reasoning via reinforcement learning on open software evolution</article-title>
          ,
          <source>arXiv preprint arXiv:2502.18449</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>Baseline 36.71 54.12 Namespace Grouping + JSDoc 58.21 85.36</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>