<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large Language Models for Issue Report Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppe Colavito</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filippo Lanubile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicole Novielli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi Quaranta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bari "Aldo Moro"</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Efective issue classification is crucial for eficient software project management. However, labels assigned to issues are often inconsistent, which can negatively impact the performance of supervised classification models. In this work, we investigate how label consistency and training data size afect automatic issue classification. We first evaluate a few-shot learning approach on a manually validated dataset and compare it to fine-tuning on a larger crowd-sourced set. The results show that our approach achieves higher accuracy when trained and tested on consistent labels. We then examine zero-shot classification using GPT-3.5, finding that its performance is comparable to supervised models despite having no fine-tuning. This suggests that generative models can help classify issues when annotated data is limited. Overall, our findings provide insights into balancing data quantity and quality for issue classification.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Issue classification</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Generative AI</kwd>
        <kwd>Software Maintenance and Evolution</kwd>
        <kwd>Few-Shot Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Ital-IA 2024: 4th National Conference on Artificial Intelligence,
organized by CINI, May 29-30, 2024, Naples, Italy
$ giuseppe.colavito@uniba.it (G. Colavito);
iflippo.lanubile@uniba.it (F. Lanubile); nicole.novielli@uniba.it
(N. Novielli); luigi.quaranta@uniba.it (L. Quaranta)</p>
      <p>0000-0003-3871-401X (G. Colavito); 0000-0003-3373-7589
(F. Lanubile); 0000-0003-1160-2608 (N. Novielli);
0000-0002-9221-0739 (L. Quaranta)</p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)
mains. With the advent of recent GPT-like Large Lan- Table 1
guage Models (LLMs), researchers have started investi- Distribution of labels in the extracted samples.
gating their potential in solving software engineering Label Train set Test set
challenges [16, 17]. To better understand how GPT-like Bug 47 24% 53 27%
LLMs can be leveraged in automated issue labeling in the Documentation 33 17% 32 16%
absence of training data, we formulate and investigate Feature 60 30% 55 28%
our second research question as follows: Question 44 22% 47 24%</p>
      <p>RQ2: To what extent we can leverage GPT-like LLMs to Discarded 16 8% 13 7%
classify issue reports? Total 200 200</p>
      <p>To address it, we evaluate GPT3.5-turbo [18] in a
zeroshot learning scenario, where the model is prompted ifers using the small manually validated training dataset
by only providing the task and label descriptions. We described in Section 2. In particular, we train and evaluate
compare the performance of classifiers based on GPT-like a model based on SETFIT [14] using the manually labeled
LLMs with fine-tuned BERT-like LLMs [19]. train and test sets. Then we compare its performance</p>
      <p>
        In this paper, we discuss our ongoing work on using with the one obtained by fine-tuning RoBERTa [ 15] using
LLMs to address software engineering challenges, with a the full dataset of 1.4M crowd-annotated issues [
        <xref ref-type="bibr" rid="ref40">20</xref>
        ].
particular focus on the automatic classification of issue To address our second research question, we compare
reports in a low-resource setting. Specifically, we sum- the performance of the SETFIT classifier with the
permarize the findings of two recent studies in which we ad- formance achieved by GPT 3.5 in a zero-shot learning
dressed the research questions formulated above [15, 19]. scenario. We highlight that prompting is only used for
The remainder of the paper is organized as follows. In GPT while the SETFIT model is trained on the manually
Sections 2 and 3, we describe the datasets and methodol- labeled data. Both models are evaluated on the test set
ogy adopted in our empirical studies, respectively. Then, partition of manually labeled issues.
we report and discuss the study results in Section 4. The
paper is concluded in Section 5, where we also outline
directions for future work.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <sec id="sec-2-1">
        <title>Preprocessing For our SETFIT model, we preprocess</title>
        <p>our dataset as follows. First, non-textual items, such as
links, code snippets, and images, are identified and
replaced with tokens (e.g., &lt;link&gt; for links) in the dataset.
Next, we use the ekphrasis Text Pre-Processor 1 to
normalize the text by detecting and replacing items such as
URLs, email addresses, symbols, phone numbers,
mentions, time, date, and numbers with specific tokens.</p>
      </sec>
      <sec id="sec-2-2">
        <title>To address our research questions, we use a dataset of</title>
        <p>400 GitHub issues labeled as bug, features, question, and
documentation. The dataset is split into two subsets of 200
issues which we use as train and test sets, respectively.</p>
        <p>
          Both subsets are equally distributed and include 50 issues Choice of GPT-like models Several LLMs have been
per class. Our dataset is obtained by manually labeling proposed in the last few years, with GPT-3 [23] being
the 400 randomly selected items from the dataset of 1.4M one of the most popular. There is a significant prevalence
GitHub issues distributed by the NLBSE’23 tool competi- of studies leveraging GPT3.5-turbo [24], an
instructiontion organizers [
          <xref ref-type="bibr" rid="ref40">20</xref>
          ]. To manually ensure the consistency tuned version of GPT-3, which is able to interact as a
of labels in our dataset, three annotators individually chatbot. For this reason, we select GPT3.5-turbo [18] as
categorized each issue report based on the information representative of GPT-like LLMs. We experiment with
in its title and body. Each issue report was assigned to several versions of GPT3.5-turbo, with varying context
two of the annotators. We observed a Cohen’s  of 0.74, length and date of training. Here we only report the
which indicates a substantial level of interrater agree- results of the model with the best performance. More
ment [21]. The annotators had a joint plenary meeting to details can be found in our original work describing this
discuss and resolve the cases of disagreement. Through study [19].
this procedure, we ensured the reliability and consistency
of the annotations. Table 1 presents the dataset’s
distribution before and after the manual labeling. The manually
annotated sample is publicly available [
          <xref ref-type="bibr" rid="ref35 ref36">22</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Prompting To instruct the model to perform the classification task, we create a prompt that includes the following items:</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>To address our first research question, we investigate the
eficacy of few-shot learning for training robust
classi• Input Format: The format of the input issues,</p>
      <p>
        which includes a title and a body;
1https://github.com/cbaziotis/ekphrasis
• Task Description: A description of the classifica- beled gold standard and tested on the raw test set. When
tion task to be performed, including the possible trained and evaluated on the manually labeled dataset
labels that can be assigned to the issues; (a), SETFIT performs better than RoBERTa (b1 and b2),
• Label Descriptions: A brief description of each la- regardless of whether the training set used for RoBERTa
bel. Label descriptions are generated by ChatGPT is raw or manually labeled. However, when trained on
and then manually reviewed to ensure they are the manually-labeled dataset (b1), RoBERTa struggles to
clear and informative. deliver good performance due to a shortage of training
• Input Issue: The issue to be classified; data. On the other hand, when trained on the raw dataset
• Output format instructions: The desired output (b2), RoBERTa achieves competitive performances, but it
format. We ask the model for a JSON object con- is unable to outperform SETFIT (b).
taining a reasoning and the predicted label. This As the manually-labeled dataset embodies the ideal
is done to inject some Chain-of-Thought reason- labeling criteria for classifiers, comparing SETFIT (a) and
ing into the model, as suggested in previous stud- RoBERTa (b2) provides a practical scenario in which we
ies about prompting LLMs [
        <xref ref-type="bibr" rid="ref19">25, 26</xref>
        ]. However, the must choose either training a classifier on a large volume
reasoning serves as a prompt-engineering strat- of data with disregard for data quality or concentrating
egy and is not used to evaluate the model. on a smaller portion of data and manually improving
label quality. This comparison suggests that data
qualEvaluation In line with previous work [
        <xref ref-type="bibr" rid="ref1 ref28">6, 7, 11, 10</xref>
        ], ity might be crucial for ensuring classification accuracy.
the evaluation of the classifiers on the test set is provided A potential approach could be to start with a few-shot
in terms of precision, recall, and f1-measure [15]. For classifier and gradually switch to a more powerful model
GPT-like LLMs, we parse the JSON response and extract like RoBERTa when a fair amount of manually verified
the predicted label. In cases in which the label is not valid data becomes available. By doing so, we can strike a
or the model did not follow the instructions appropriately, balance between data quantity and quality, ensuring that
we discard the prediction. This process is done with the the classifier performs efectively while minimizing the
use of regular expressions. Both the models are tested possibility of inaccurate results caused by inconsistency
on the manually verified test set [19]. in the labeling.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>4.2. Leveraging GPT for automatic issue
report classification (RQ2)
4.1. Impact of label consistency on the In Table 3, we report the classification performance of
classifier performance (RQ1) GPT compared to the SETFIT model. As already
explained in the previous section, we experimented with
In Table 2, we present the results obtained by training the several versions of GPT 3.5 that were available at the
SETFIT classifier on the hand-labeled gold standard and time of the study. For a full report of the results, see
evaluating it on both the hand-labeled test set (a) and the Colavito et al. [19]. In this paper, we include
considerfull test set distributed for the challenge (c). To ensure a ation of the 16k-0613 model only as this achieves the
fair comparison, we compared the SETFIT model’s per- best performance in terms of a combination of F1 and
formance with the performance obtained by RoBERTa percentage of discarded items due to nonsensical model
on the same test set, when trained on the hand-labeled output. Specifically, none of the predictions from this
gold standard set (b1). Furthermore, we also include the model were discarded. We observe that the Feature class
performance obtained by training the RoBERTa classifier achieves the best F1, while the Documentation class is
on the full train set distributed by the organizers (b2). the most problematic to identify, showing a lower recall</p>
      <p>
        To assess the ability of the models to generalize on than the other classes.
a broader dataset, we also include a comparison with While the zero-shot GPT model achieves a slightly
the NLBSE ’23 challenge baseline [
        <xref ref-type="bibr" rid="ref40">20</xref>
        ] (see row (d) of lower performance (F1 = .8155) than SETFIT (F1 = .8321),
the table) and the SETFIT model’s performance on the the models are still comparable. It’s worth noting that
challenge full test set (see model (c) in the table). It is SETFIT was fine-tuned on a portion of the issue report
worth noting that the SETFIT model is designed to learn gold standard dataset, while GPT was evaluated in a
zerofrom a few examples. As such, it was not possible to train shot setting without any task-specific fine-tuning. This
it on the raw dataset, since it is not optimized for such a implies that GPT is capable of classifying issue reports
setting and it would have been heavily time expensive. with only a minor decrease in accuracy compared to
fineInstead, the RoBERTa baseline is trained on the full set. tuned BERT-like models. This presents a major benefit of
      </p>
      <p>The SETFIT model achieved an F1-micro score of .7767 GPT for this application since it can perform the
classifi(see model (c) in Table 2) when trained on the manually la- cation in absence of labeled data, i.e., without the need
for fine-tuning. This evidence could help maintainers of
new projects, for which historical data is not available or
is scarce. In such cases, API calls to GPT could be used
to classify issue reports, providing a valuable tool for
project management. Once the project has accumulated
enough labeled data, the maintainer could switch to a
ifne-tuned model to improve the classification accuracy.</p>
      <p>Although this could be a viable solution for open-source
projects, it is worth noting that the cost of API calls and
the privacy of data could limit its practical feasibility in
commercial projects. In such cases, project maintainers
might consider using open-source models or building
and deploying a classifier on-premise. Nonetheless, the
construction and maintenance of LLMs is expensive both
in terms of resources and time, and this constitutes a
barrier to their adoption in most cases.
state-of-the-art performance in the absence of manually
annotated issues, i.e. when a gold standard is not
available for fine-tuning state-of-the-art approaches based on
BERT-like models. Our empirical results show that
GPTlike models can achieve a performance comparable to the
state-of-the-art without the need for fine-tuning. This
suggests that when manual annotation is not feasible or
a gold standard for training is not available (i.e., on a
new project), maintainers could rely on generative AI to
successfully address the issue classification task.</p>
      <p>However, using LLMs to build issue classifiers might
pose important challenges due to licensing and
computational limitations. As such, we plan to extend this
benchmark with open-source LLMs, also including issue-report
datasets. This will enable evaluating the generalizability
of our findings.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Works</title>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>
        In this paper, we summarized the outcomes of our re- This research was co-funded by the NRRP Initiative,
cently published studies on the use of large language Mission 4, Component 2, Investment 1.3 - Partnerships
models for automated issue classification. Specifically, extended to universities, research centres, companies
we investigated the impact of improving data quality on and research D.D. MUR n. 341 del 15.03.2022 – Next
issue classification performance. We trained and eval- Generation EU (“FAIR - Future Artificial Intelligence
Reuated a model based on few-shot learning using SET- search”, code PE00000013, CUP H97G22000210007) and
FIT with a subset of manually verified data. The model by the European Union - NextGenerationEU through
achieves better performance when trained and tested the Italian Ministry of University and Research, Projects
on data for which label consistency was manually veri- PRIN 2022 (“QualAI: Continuous Quality Improvement
ifed [
        <xref ref-type="bibr" rid="ref35 ref36">22</xref>
        ], compared to the RoBERTa baseline. However, of AI-based Systems”, grant n. 2022B3BP5S, CUP:
RoBERTa generalizes better on the full test dataset when H53D23003510006).
ifne-tuned on the full crowd-sourced dataset.
      </p>
      <p>Furthermore, we explored the performance of
GPTlike models for automatic issue classification [ 19] to
understand if we can leverage GPT-like LLMs to achieve</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Izadi</surname>
          </string-name>
          ,
          <article-title>CatIss: An Intelligent Tool for Categoriz-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>ing Issues</surname>
          </string-name>
          <article-title>Reports using Transformers</article-title>
          , in: (NLBSE [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Antoniol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ayari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Di</given-names>
            <surname>Penta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khomh</surname>
          </string-name>
          , Y.-
          <year>2022</year>
          ),
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .1145/3528588.3528662.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Guéhéneuc</surname>
          </string-name>
          ,
          <article-title>Is it a bug or an enhancement? a</article-title>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          , D. Chen,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>in: Proc. of the 2008 Conf</source>
          .
          <article-title>of the Center for Ad- Roberta: A robustly optimized bert pretraining ap-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>vanced Studies on Collaborative Research: Meeting proach</source>
          ,
          <year>2019</year>
          . arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>of Minds, CASCON '08</source>
          ,
          <string-name>
            <surname>ACM</surname>
            , New York, NY, USA, [13]
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lo</surname>
          </string-name>
          , Data quality mat-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          2008. doi:
          <volume>10</volume>
          .1145/1463788.1463819.
          <article-title>ters: A case study on data label correctness for [2</article-title>
          ]
          <string-name>
            <given-names>K.</given-names>
            <surname>Herzig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Just</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zeller</surname>
          </string-name>
          ,
          <article-title>It's not a bug, it's a fea- security bug report prediction</article-title>
          , IEEE Transactions
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>ture: How misclassification impacts bug prediction</article-title>
          ,
          <source>on Software Engineering</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .1109/TSE.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>in: 2013 35th Int'l Conf.on Software Engineering</source>
          <year>2021</year>
          .
          <volume>3063727</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>(ICSE)</source>
          ,
          <year>2013</year>
          . doi:
          <volume>10</volume>
          .1109/ICSE.
          <year>2013</year>
          .
          <volume>6606585</volume>
          . [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tunstall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. E. S.</given-names>
            <surname>Jo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Korat</surname>
          </string-name>
          , [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sanyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hudait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Auto- M. Wasserblat</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Pereg</surname>
          </string-name>
          , Eficient
          <string-name>
            <surname>Few-Shot</surname>
          </string-name>
          Learn-
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>mated classification of software issue reports using ing Without Prompts</article-title>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>machine learning techniques: an empirical study</article-title>
          ,
          <volume>2209</volume>
          .
          <fpage>11055</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>Innovations in Systems and Software Engineering</source>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Colavito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lanubile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Novielli</surname>
          </string-name>
          , Few-shot
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          (
          <year>2017</year>
          ). doi:
          <volume>10</volume>
          .1007/s11334-017
          <article-title>-0294-1. learning for issue report classification</article-title>
          , in:
          <year>2023</year>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <source>Neural word embedding as IEEE/ACM 2nd Int'l Work. on Natural Language-</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>implicit matrix factorization</article-title>
          ,
          <source>in: Z. Ghahramani, Based Software Eng. (NLBSE)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Wein-</surname>
          </string-name>
          [16]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>cessing Systems</source>
          , Curran Assoc., Inc.,
          <year>2014</year>
          .
          <article-title>models for software engineering: A systematic lit</article-title>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, erature review,
          <year>2023</year>
          . arXiv:
          <volume>2308</volume>
          .
          <fpage>10620</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distributed representations of words</article-title>
          and [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gokkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harman</surname>
          </string-name>
          , M. Lyubarskiy,
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>26th Int'l Conf.on Neural Inf. Proc. Systems</source>
          - Vol-
          <article-title>models for software engineering: Survey and open</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>ume 2</source>
          , NIPS'13, Curran Associates Inc., Red Hook, problems,
          <year>2023</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>03533</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2013</year>
          . [18] OpenAI, ChatGPT: Optimizing Language Models [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Di</given-names>
            <surname>Sorbo</surname>
          </string-name>
          , G. Canfora,
          <string-name>
            <given-names>S.</given-names>
            <surname>Panichella</surname>
          </string-name>
          , Pre- for
          <string-name>
            <surname>Dialogue</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>dicting issue types on github</article-title>
          , Science of Computer [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Colavito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lanubile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Novielli</surname>
          </string-name>
          , L. Quaranta,
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Programming</surname>
          </string-name>
          (
          <year>2021</year>
          ). doi:https://doi.org/10.
          <string-name>
            <surname>Leveraging</surname>
          </string-name>
          gpt
          <article-title>-like llms to automate issue labeling,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          1016/j.scico.
          <year>2020</year>
          .
          <volume>102598</volume>
          . in: 2024 IEEE/ACM 21th International Conference [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Di</given-names>
            <surname>Sorbo</surname>
          </string-name>
          , G. Canfora, S. Panichella,
          <article-title>on Mining Software Repositories (MSR) (to appear),</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <article-title>Ticket tagger: Machine learning driven issue clas- 2024</article-title>
          . doi:
          <volume>10</volume>
          .1145/3643991.3644903.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          sification, in: 2019
          <source>IEEE Int'l. Conf on Software [20]</source>
          <string-name>
            <given-names>R.</given-names>
            <surname>Kallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Izadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pascarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Chaparro</surname>
          </string-name>
          , P. Rani,
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Maintenance</surname>
          </string-name>
          and
          <string-name>
            <surname>Evolution (ICSME),</surname>
            <given-names>IEEE</given-names>
          </string-name>
          ,
          <year>2019</year>
          .
          <article-title>The nlbse'23 tool competition</article-title>
          ,
          <source>in: Proc. of The 2nd</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>doi:10</source>
          .1109/ICSME.
          <year>2019</year>
          .
          <volume>00070</volume>
          .
          <string-name>
            <surname>Intl</surname>
          </string-name>
          .
          <source>Work. on Natural Language-based Software</source>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT: Engineering (NLBSE'23),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>Pre-training of deep bidirectional transformers for [21]</article-title>
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Viera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Garrett</surname>
          </string-name>
          , Understanding inter-
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <article-title>language understanding, in: Proc. of the 2019 observer agreement: the kappa statistic</article-title>
          , Family
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <article-title>Conf. of the North American Chapter of the As- medicine (</article-title>
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>sociation for Computational Linguistics: Human</source>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Colavito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lanubile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Novielli</surname>
          </string-name>
          , Few-shot learn-
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Language</given-names>
            <surname>Technologies</surname>
          </string-name>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .18653/ ing for issue
          <source>report classification</source>
          ,
          <year>2023</year>
          . doi: 10.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          v1/
          <fpage>N19</fpage>
          -1423. 5281/zenodo.7628150. [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Chaparro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Di</given-names>
            <surname>Sorbo</surname>
          </string-name>
          , S. Panichella, [23]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
          </string-name>
          , M. Subbiah,
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <article-title>Nlbse'22 tool competition</article-title>
          ,
          <source>in: Proc. of The 1st Int'l J</source>
          .
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
          </string-name>
          , P. Shyam,
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <source>(NLBSE'22)</source>
          ,
          <year>2022</year>
          . G. Krueger,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Colavito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lanubile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Novielli</surname>
          </string-name>
          ,
          <string-name>
            <surname>Issue report D. M. Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
          </string-name>
          , M. Chen,
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>in: 2022 IEEE/ACM 1st Int'l Workshop</source>
          on Natural C. Berner,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          , I. Sutskever,
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>puter</surname>
            <given-names>Society</given-names>
          </string-name>
          , USA,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .1145/3528588. in: Proceedings of the 34th International Confer-
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          3528659.
          <source>ence on Neural Information Processing Systems</source>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <source>NIPS'20</source>
          , Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          , NY,
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>USA</surname>
          </string-name>
          ,
          <year>2020</year>
          . [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <source>determinism of chatgpt in code generation</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <source>arXiv:2308</source>
          .
          <fpage>02828</fpage>
          . [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          , M. Bosma, b. ichter,
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <source>Information Processing Systems</source>
          , volume
          <volume>35</volume>
          , Cur-
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          ran Associates, Inc.,
          <year>2022</year>
          , pp.
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          . [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <source>tion Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <string-name>
            <surname>Curran</surname>
          </string-name>
          Asso-
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <string-name>
            <surname>ciates</surname>
          </string-name>
          , Inc.,
          <year>2022</year>
          , pp.
          <fpage>22199</fpage>
          -
          <lpage>22213</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>