=Paper=
{{Paper
|id=Vol-3762/500
|storemode=property
|title=Large Language Models for Issue Report Classification
|pdfUrl=https://ceur-ws.org/Vol-3762/500.pdf
|volume=Vol-3762
|authors=Giuseppe Colavito,Filippo Lanubile,Nicole Novielli,Luigi Quaranta
|dblpUrl=https://dblp.org/rec/conf/ital-ia/ColavitoLNQ24
}}
==Large Language Models for Issue Report Classification==
Large Language Models for Issue Report Classification
Giuseppe Colavito1 , Filippo Lanubile1 , Nicole Novielli1 and Luigi Quaranta1
1
University of Bari "Aldo Moro", Italy
Abstract
Effective issue classification is crucial for efficient software project management. However, labels assigned to issues are often
inconsistent, which can negatively impact the performance of supervised classification models. In this work, we investigate
how label consistency and training data size affect automatic issue classification. We first evaluate a few-shot learning
approach on a manually validated dataset and compare it to fine-tuning on a larger crowd-sourced set. The results show that
our approach achieves higher accuracy when trained and tested on consistent labels. We then examine zero-shot classification
using GPT-3.5, finding that its performance is comparable to supervised models despite having no fine-tuning. This suggests
that generative models can help classify issues when annotated data is limited. Overall, our findings provide insights into
balancing data quantity and quality for issue classification.
Keywords
Issue classification, Large Language Models, Generative AI, Software Maintenance and Evolution, Few-Shot Learning
1. Introduction the task of automatic issue report classification [1]. More
recently, approaches leveraging word embeddings have
Collaborative software development involves complex emerged [4, 5, 6, 7]. In particular, approaches based on
processes and activities to effectively support software BERT [8] and its variants achieved state-of-the-art per-
development and maintenance. In this context, issue- formance [9, 10, 11].
tracking systems are widely adopted to manage requests In our previous work, we conducted an empirical
for changes – such as bug fixes or product enhancements, study to investigate to what extent we can leverage
as well as requests for support from users – and are pre-trained language models for automatic issue label-
regarded as essential tools for maintainers to efficiently ing [10]. We experimented with a dataset of more than
manage software evolution activities. 800K issue reports from GitHub open-source software
Issue reports organized in such systems typically con- projects labeled by project contributors as bug, enhance-
tain information such as an identifier, a description, the ment, or question [9]. We fine-tuned the BERT [8] variant
author, the issue status (e.g., open, assigned, closed), a RoBERTa [12], achieving state-of-the-art performance
comment thread, and a label indicating the type of issue, (F1 = 0.8591).
such as bug, enhancement, or support. Effective labeling Our manual error analysis revealed that the main cause
of issue reports is of paramount importance to support of the misclassification of issues is label inconsistency
prioritization and decision-making. Unfortunately, how- across different projects. Also, several issue reports in the
ever, label misuse is a common problem, as submitters dataset were tagged with more than one label, which is
often confuse improvement requests as bugs and vice indeed a source of noise. This evidence is in line with pre-
versa [1]. For example, Herzig et.al [2] reported that vious studies reporting the impact of data quality on the
approximately 33.8% of all issue reports are incorrectly performance of machine learning models [13]. Informed
labeled. To avoid dealing with incorrect labels, automated by the results of our error analysis and by findings of
classification methods have been proposed. Automatic previous research, we formulate the following research
issue classification can enable effective issue manage- question:
ment and prioritization [3], without the need to instruct RQ1: To what extent does label consistency impact the
developers on how to assign labels correctly. performance of supervised issue classification models?
Early research on this topic proposed exploiting su- To address it, we investigate the efficacy of few-shot
pervised methods that leverage text-based features for learning for training robust classifiers using a small train-
Ital-IA 2024: 4th National Conference on Artificial Intelligence, ing dataset with manually validated labels. Specifically,
organized by CINI, May 29-30, 2024, Naples, Italy we experiment with SETFIT, an effective methodology for
$ giuseppe.colavito@uniba.it (G. Colavito); fine-tuning of transformer-based models using few-shot
filippo.lanubile@uniba.it (F. Lanubile); nicole.novielli@uniba.it learning [14], achieving promising results [15].
(N. Novielli); luigi.quaranta@uniba.it (L. Quaranta) Still, manual annotation can be a costly task, both in
0000-0003-3871-401X (G. Colavito); 0000-0003-3373-7589
(F. Lanubile); 0000-0003-1160-2608 (N. Novielli);
terms of time and resources, even if done on a small
0000-0002-9221-0739 (L. Quaranta) set of manually curated examples. Hence, the need for
© 2024 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
minimizing the effort associated with data labeling re-
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
mains. With the advent of recent GPT-like Large Lan- Table 1
guage Models (LLMs), researchers have started investi- Distribution of labels in the extracted samples.
gating their potential in solving software engineering Label Train set Test set
challenges [16, 17]. To better understand how GPT-like Bug 47 24% 53 27%
LLMs can be leveraged in automated issue labeling in the Documentation 33 17% 32 16%
absence of training data, we formulate and investigate Feature 60 30% 55 28%
our second research question as follows: Question 44 22% 47 24%
RQ2: To what extent we can leverage GPT-like LLMs to Discarded 16 8% 13 7%
classify issue reports? Total 200 200
To address it, we evaluate GPT3.5-turbo [18] in a zero-
shot learning scenario, where the model is prompted fiers using the small manually validated training dataset
by only providing the task and label descriptions. We described in Section 2. In particular, we train and evaluate
compare the performance of classifiers based on GPT-like a model based on SETFIT [14] using the manually labeled
LLMs with fine-tuned BERT-like LLMs [19]. train and test sets. Then we compare its performance
In this paper, we discuss our ongoing work on using with the one obtained by fine-tuning RoBERTa [15] using
LLMs to address software engineering challenges, with a the full dataset of 1.4M crowd-annotated issues [20].
particular focus on the automatic classification of issue To address our second research question, we compare
reports in a low-resource setting. Specifically, we sum- the performance of the SETFIT classifier with the per-
marize the findings of two recent studies in which we ad- formance achieved by GPT 3.5 in a zero-shot learning
dressed the research questions formulated above [15, 19]. scenario. We highlight that prompting is only used for
The remainder of the paper is organized as follows. In GPT while the SETFIT model is trained on the manually
Sections 2 and 3, we describe the datasets and methodol- labeled data. Both models are evaluated on the test set
ogy adopted in our empirical studies, respectively. Then, partition of manually labeled issues.
we report and discuss the study results in Section 4. The
paper is concluded in Section 5, where we also outline
Preprocessing For our SETFIT model, we preprocess
directions for future work.
our dataset as follows. First, non-textual items, such as
links, code snippets, and images, are identified and re-
2. Dataset placed with tokens (e.g., for links) in the dataset.
Next, we use the ekphrasis Text Pre-Processor 1 to nor-
To address our research questions, we use a dataset of malize the text by detecting and replacing items such as
400 GitHub issues labeled as bug, features, question, and URLs, email addresses, symbols, phone numbers, men-
documentation. The dataset is split into two subsets of 200 tions, time, date, and numbers with specific tokens.
issues which we use as train and test sets, respectively.
Both subsets are equally distributed and include 50 issues Choice of GPT-like models Several LLMs have been
per class. Our dataset is obtained by manually labeling proposed in the last few years, with GPT-3 [23] being
the 400 randomly selected items from the dataset of 1.4M one of the most popular. There is a significant prevalence
GitHub issues distributed by the NLBSE’23 tool competi- of studies leveraging GPT3.5-turbo [24], an instruction-
tion organizers [20]. To manually ensure the consistency tuned version of GPT-3, which is able to interact as a
of labels in our dataset, three annotators individually chatbot. For this reason, we select GPT3.5-turbo [18] as
categorized each issue report based on the information representative of GPT-like LLMs. We experiment with
in its title and body. Each issue report was assigned to several versions of GPT3.5-turbo, with varying context
two of the annotators. We observed a Cohen’s 𝜅 of 0.74, length and date of training. Here we only report the
which indicates a substantial level of interrater agree- results of the model with the best performance. More
ment [21]. The annotators had a joint plenary meeting to details can be found in our original work describing this
discuss and resolve the cases of disagreement. Through study [19].
this procedure, we ensured the reliability and consistency
of the annotations. Table 1 presents the dataset’s distribu- Prompting To instruct the model to perform the clas-
tion before and after the manual labeling. The manually sification task, we create a prompt that includes the fol-
annotated sample is publicly available [22]. lowing items:
• Input Format: The format of the input issues,
3. Methodology which includes a title and a body;
To address our first research question, we investigate the
efficacy of few-shot learning for training robust classi- 1
https://github.com/cbaziotis/ekphrasis
• Task Description: A description of the classifica- beled gold standard and tested on the raw test set. When
tion task to be performed, including the possible trained and evaluated on the manually labeled dataset
labels that can be assigned to the issues; (a), SETFIT performs better than RoBERTa (b1 and b2),
• Label Descriptions: A brief description of each la- regardless of whether the training set used for RoBERTa
bel. Label descriptions are generated by ChatGPT is raw or manually labeled. However, when trained on
and then manually reviewed to ensure they are the manually-labeled dataset (b1), RoBERTa struggles to
clear and informative. deliver good performance due to a shortage of training
• Input Issue: The issue to be classified; data. On the other hand, when trained on the raw dataset
• Output format instructions: The desired output (b2), RoBERTa achieves competitive performances, but it
format. We ask the model for a JSON object con- is unable to outperform SETFIT (b).
taining a reasoning and the predicted label. This As the manually-labeled dataset embodies the ideal
is done to inject some Chain-of-Thought reason- labeling criteria for classifiers, comparing SETFIT (a) and
ing into the model, as suggested in previous stud- RoBERTa (b2) provides a practical scenario in which we
ies about prompting LLMs [25, 26]. However, the must choose either training a classifier on a large volume
reasoning serves as a prompt-engineering strat- of data with disregard for data quality or concentrating
egy and is not used to evaluate the model. on a smaller portion of data and manually improving
label quality. This comparison suggests that data qual-
ity might be crucial for ensuring classification accuracy.
Evaluation In line with previous work [6, 7, 11, 10],
A potential approach could be to start with a few-shot
the evaluation of the classifiers on the test set is provided
classifier and gradually switch to a more powerful model
in terms of precision, recall, and f1-measure [15]. For
like RoBERTa when a fair amount of manually verified
GPT-like LLMs, we parse the JSON response and extract
data becomes available. By doing so, we can strike a
the predicted label. In cases in which the label is not valid
balance between data quantity and quality, ensuring that
or the model did not follow the instructions appropriately,
the classifier performs effectively while minimizing the
we discard the prediction. This process is done with the
possibility of inaccurate results caused by inconsistency
use of regular expressions. Both the models are tested
in the labeling.
on the manually verified test set [19].
4.2. Leveraging GPT for automatic issue
4. Results and Discussion report classification (RQ2)
4.1. Impact of label consistency on the In Table 3, we report the classification performance of
classifier performance (RQ1) GPT compared to the SETFIT model. As already ex-
plained in the previous section, we experimented with
In Table 2, we present the results obtained by training the several versions of GPT 3.5 that were available at the
SETFIT classifier on the hand-labeled gold standard and time of the study. For a full report of the results, see
evaluating it on both the hand-labeled test set (a) and the Colavito et al. [19]. In this paper, we include consider-
full test set distributed for the challenge (c). To ensure a ation of the 16k-0613 model only as this achieves the
fair comparison, we compared the SETFIT model’s per- best performance in terms of a combination of F1 and
formance with the performance obtained by RoBERTa percentage of discarded items due to nonsensical model
on the same test set, when trained on the hand-labeled output. Specifically, none of the predictions from this
gold standard set (b1). Furthermore, we also include the model were discarded. We observe that the Feature class
performance obtained by training the RoBERTa classifier achieves the best F1, while the Documentation class is
on the full train set distributed by the organizers (b2). the most problematic to identify, showing a lower recall
To assess the ability of the models to generalize on than the other classes.
a broader dataset, we also include a comparison with While the zero-shot GPT model achieves a slightly
the NLBSE ’23 challenge baseline [20] (see row (d) of lower performance (F1 = .8155) than SETFIT (F1 = .8321),
the table) and the SETFIT model’s performance on the the models are still comparable. It’s worth noting that
challenge full test set (see model (c) in the table). It is SETFIT was fine-tuned on a portion of the issue report
worth noting that the SETFIT model is designed to learn gold standard dataset, while GPT was evaluated in a zero-
from a few examples. As such, it was not possible to train shot setting without any task-specific fine-tuning. This
it on the raw dataset, since it is not optimized for such a implies that GPT is capable of classifying issue reports
setting and it would have been heavily time expensive. with only a minor decrease in accuracy compared to fine-
Instead, the RoBERTa baseline is trained on the full set. tuned BERT-like models. This presents a major benefit of
The SETFIT model achieved an F1-micro score of .7767 GPT for this application since it can perform the classifi-
(see model (c) in Table 2) when trained on the manually la- cation in absence of labeled data, i.e., without the need
Table 2
Performance of the SETFIT model and comparison with the RoBERTa baseline approach. The performance of the model
submitted to the challenge is reported in Italic. In bold, we highlight the best performance obtained with SETFIT.
Model Train Test F1
(a) SETFIT Sampled Manual labels Sampled Manual labels 0.8321
(b1) RoBERTa Sampled Manual labels Sampled Manual labels 0.4348
(b2) RoBERTa Full GitHub labels Sampled Manual labels 0.8182
(c) SETFIT Sampled Manual labels Full GitHub labeling 0.7767
(d) RoBERTa (baseline) Full GitHub labels Full GitHub labels 0.8890
Table 3
Comparison between SETFIT and GPT-3.5.
SETFIT GPT-3.5 (16k-0613), zero-shot
Label Precision Recall F1-Score Precision Recall F1-Score
Bug 0.8723 0.8472 0.8590 0,7133 0,9811 0,8261
Documentation 0.9039 0.6594 0.7616 0,8853 0,6191 0,7285
Feature 0.7494 0.9182 0.8251 0,8861 0,8491 0,8672
Question 0.8754 0.8319 0.8528 0,8668 0,7719 0,8164
Overall 0.8321 0.8321 0.8321 0,8155 0,8155 0,8155
for fine-tuning. This evidence could help maintainers of state-of-the-art performance in the absence of manually
new projects, for which historical data is not available or annotated issues, i.e. when a gold standard is not avail-
is scarce. In such cases, API calls to GPT could be used able for fine-tuning state-of-the-art approaches based on
to classify issue reports, providing a valuable tool for BERT-like models. Our empirical results show that GPT-
project management. Once the project has accumulated like models can achieve a performance comparable to the
enough labeled data, the maintainer could switch to a state-of-the-art without the need for fine-tuning. This
fine-tuned model to improve the classification accuracy. suggests that when manual annotation is not feasible or
Although this could be a viable solution for open-source a gold standard for training is not available (i.e., on a
projects, it is worth noting that the cost of API calls and new project), maintainers could rely on generative AI to
the privacy of data could limit its practical feasibility in successfully address the issue classification task.
commercial projects. In such cases, project maintainers However, using LLMs to build issue classifiers might
might consider using open-source models or building pose important challenges due to licensing and computa-
and deploying a classifier on-premise. Nonetheless, the tional limitations. As such, we plan to extend this bench-
construction and maintenance of LLMs is expensive both mark with open-source LLMs, also including issue-report
in terms of resources and time, and this constitutes a datasets. This will enable evaluating the generalizability
barrier to their adoption in most cases. of our findings.
5. Conclusion and Future Works Acknowledgments
In this paper, we summarized the outcomes of our re- This research was co-funded by the NRRP Initiative,
cently published studies on the use of large language Mission 4, Component 2, Investment 1.3 - Partnerships
models for automated issue classification. Specifically, extended to universities, research centres, companies
we investigated the impact of improving data quality on and research D.D. MUR n. 341 del 15.03.2022 – Next
issue classification performance. We trained and eval- Generation EU (“FAIR - Future Artificial Intelligence Re-
uated a model based on few-shot learning using SET- search”, code PE00000013, CUP H97G22000210007) and
FIT with a subset of manually verified data. The model by the European Union - NextGenerationEU through
achieves better performance when trained and tested the Italian Ministry of University and Research, Projects
on data for which label consistency was manually veri- PRIN 2022 (“QualAI: Continuous Quality Improvement
fied [22], compared to the RoBERTa baseline. However, of AI-based Systems”, grant n. 2022B3BP5S, CUP:
RoBERTa generalizes better on the full test dataset when H53D23003510006).
fine-tuned on the full crowd-sourced dataset.
Furthermore, we explored the performance of GPT-
like models for automatic issue classification [19] to un-
derstand if we can leverage GPT-like LLMs to achieve
References [11] M. Izadi, CatIss: An Intelligent Tool for Categoriz-
ing Issues Reports using Transformers, in: (NLBSE
[1] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, Y.- 2022), 2022. doi:10.1145/3528588.3528662.
G. Guéhéneuc, Is it a bug or an enhancement? a [12] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
text-based approach to classify change requests, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
in: Proc. of the 2008 Conf. of the Center for Ad- Roberta: A robustly optimized bert pretraining ap-
vanced Studies on Collaborative Research: Meeting proach, 2019. arXiv:1907.11692.
of Minds, CASCON ’08, ACM, New York, NY, USA, [13] X. Wu, W. Zheng, X. Xia, D. Lo, Data quality mat-
2008. doi:10.1145/1463788.1463819. ters: A case study on data label correctness for
[2] K. Herzig, S. Just, A. Zeller, It’s not a bug, it’s a fea- security bug report prediction, IEEE Transactions
ture: How misclassification impacts bug prediction, on Software Engineering (2022). doi:10.1109/TSE.
in: 2013 35th Int’l Conf.on Software Engineering 2021.3063727.
(ICSE), 2013. doi:10.1109/ICSE.2013.6606585. [14] L. Tunstall, N. Reimers, U. E. S. Jo, L. Bates, D. Korat,
[3] N. Pandey, D. Sanyal, A. Hudait, A. Sen, Auto- M. Wasserblat, O. Pereg, Efficient Few-Shot Learn-
mated classification of software issue reports using ing Without Prompts, 2022. doi:10.48550/arXiv.
machine learning techniques: an empirical study, 2209.11055.
Innovations in Systems and Software Engineering [15] G. Colavito, F. Lanubile, N. Novielli, Few-shot
(2017). doi:10.1007/s11334-017-0294-1. learning for issue report classification, in: 2023
[4] O. Levy, Y. Goldberg, Neural word embedding as IEEE/ACM 2nd Int’l Work. on Natural Language-
implicit matrix factorization, in: Z. Ghahramani, Based Software Eng. (NLBSE), 2023.
M. Welling, C. Cortes, N. Lawrence, K. Q. Wein- [16] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li,
berger (Eds.), Advances in Neural Information Pro- X. Luo, D. Lo, J. Grundy, H. Wang, Large language
cessing Systems, Curran Assoc., Inc., 2014. models for software engineering: A systematic lit-
[5] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, erature review, 2023. arXiv:2308.10620.
J. Dean, Distributed representations of words and [17] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy,
phrases and their compositionality, in: Proc. of the S. Sengupta, S. Yoo, J. M. Zhang, Large language
26th Int’l Conf.on Neural Inf. Proc. Systems - Vol- models for software engineering: Survey and open
ume 2, NIPS’13, Curran Associates Inc., Red Hook, problems, 2023. arXiv:2310.03533.
NY, USA, 2013. [18] OpenAI, ChatGPT: Optimizing Language Models
[6] R. Kallis, A. Di Sorbo, G. Canfora, S. Panichella, Pre- for Dialogue, 2022.
dicting issue types on github, Science of Computer [19] G. Colavito, F. Lanubile, N. Novielli, L. Quaranta,
Programming (2021). doi:https://doi.org/10. Leveraging gpt-like llms to automate issue labeling,
1016/j.scico.2020.102598. in: 2024 IEEE/ACM 21th International Conference
[7] R. Kallis, A. Di Sorbo, G. Canfora, S. Panichella, on Mining Software Repositories (MSR) (to appear),
Ticket tagger: Machine learning driven issue clas- 2024. doi:10.1145/3643991.3644903.
sification, in: 2019 IEEE Int’l. Conf on Software [20] R. Kallis, M. Izadi, L. Pascarella, O. Chaparro, P. Rani,
Maintenance and Evolution (ICSME), IEEE, 2019. The nlbse’23 tool competition, in: Proc. of The 2nd
doi:10.1109/ICSME.2019.00070. Intl. Work. on Natural Language-based Software
[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Engineering (NLBSE’23), 2023.
Pre-training of deep bidirectional transformers for [21] A. J. Viera, J. M. Garrett, Understanding inter-
language understanding, in: Proc. of the 2019 observer agreement: the kappa statistic, Family
Conf. of the North American Chapter of the As- medicine (2005).
sociation for Computational Linguistics: Human [22] G. Colavito, F. Lanubile, N. Novielli, Few-shot learn-
Language Technologies, ACL, 2019. doi:10.18653/ ing for issue report classification, 2023. doi:10.
v1/N19-1423. 5281/zenodo.7628150.
[9] R. Kallis, O. Chaparro, A. Di Sorbo, S. Panichella, [23] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
Nlbse’22 tool competition, in: Proc. of The 1st Int’l J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
Work. on Natural Language-based Software Eng. G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
(NLBSE’22), 2022. G. Krueger, T. Henighan, R. Child, A. Ramesh,
[10] G. Colavito, F. Lanubile, N. Novielli, Issue report D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
classification using pre-trained language models, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
in: 2022 IEEE/ACM 1st Int’l Workshop on Natural C. Berner, S. McCandlish, A. Radford, I. Sutskever,
Language-Based Software Eng. (NLBSE), IEEE Com- D. Amodei, Language models are few-shot learners,
puter Society, USA, 2022. doi:10.1145/3528588. in: Proceedings of the 34th International Confer-
3528659. ence on Neural Information Processing Systems,
NIPS’20, Curran Associates Inc., Red Hook, NY,
USA, 2020.
[24] S. Ouyang, J. M. Zhang, M. Harman, M. Wang,
Llm is like a box of chocolates: the non-
determinism of chatgpt in code generation, 2023.
arXiv:2308.02828.
[25] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter,
F. Xia, E. Chi, Q. V. Le, D. Zhou, Chain-of-thought
prompting elicits reasoning in large language mod-
els, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Bel-
grave, K. Cho, A. Oh (Eds.), Advances in Neural
Information Processing Systems, volume 35, Cur-
ran Associates, Inc., 2022, pp. 24824–24837.
[26] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa,
Large language models are zero-shot reasoners, in:
S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
K. Cho, A. Oh (Eds.), Advances in Neural Informa-
tion Processing Systems, volume 35, Curran Asso-
ciates, Inc., 2022, pp. 22199–22213.