1. Introduction

Large Language Models for Issue Report Classification

Giuseppe Colavito

Filippo Lanubile

Nicole Novielli

Luigi Quaranta

0 0 University of Bari "Aldo Moro" , Italy

Efective issue classification is crucial for eficient software project management. However, labels assigned to issues are often inconsistent, which can negatively impact the performance of supervised classification models. In this work, we investigate how label consistency and training data size afect automatic issue classification. We first evaluate a few-shot learning approach on a manually validated dataset and compare it to fine-tuning on a larger crowd-sourced set. The results show that our approach achieves higher accuracy when trained and tested on consistent labels. We then examine zero-shot classification using GPT-3.5, finding that its performance is comparable to supervised models despite having no fine-tuning. This suggests that generative models can help classify issues when annotated data is limited. Overall, our findings provide insights into balancing data quantity and quality for issue classification.

eol>Issue classification Large Language Models Generative AI Software Maintenance and Evolution Few-Shot Learning

1. Introduction

Ital-IA 2024: 4th National Conference on Artificial Intelligence, organized by CINI, May 29-30, 2024, Naples, Italy $ giuseppe.colavito@uniba.it (G. Colavito); iflippo.lanubile@uniba.it (F. Lanubile); nicole.novielli@uniba.it (N. Novielli); luigi.quaranta@uniba.it (L. Quaranta)

0000-0003-3871-401X (G. Colavito); 0000-0003-3373-7589 (F. Lanubile); 0000-0003-1160-2608 (N. Novielli); 0000-0002-9221-0739 (L. Quaranta)

© 2024 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) mains. With the advent of recent GPT-like Large Lan- Table 1 guage Models (LLMs), researchers have started investi- Distribution of labels in the extracted samples. gating their potential in solving software engineering Label Train set Test set challenges [16, 17]. To better understand how GPT-like Bug 47 24% 53 27% LLMs can be leveraged in automated issue labeling in the Documentation 33 17% 32 16% absence of training data, we formulate and investigate Feature 60 30% 55 28% our second research question as follows: Question 44 22% 47 24%

RQ2: To what extent we can leverage GPT-like LLMs to Discarded 16 8% 13 7% classify issue reports? Total 200 200

To address it, we evaluate GPT3.5-turbo [18] in a zeroshot learning scenario, where the model is prompted ifers using the small manually validated training dataset by only providing the task and label descriptions. We described in Section 2. In particular, we train and evaluate compare the performance of classifiers based on GPT-like a model based on SETFIT [14] using the manually labeled LLMs with fine-tuned BERT-like LLMs [19]. train and test sets. Then we compare its performance

In this paper, we discuss our ongoing work on using with the one obtained by fine-tuning RoBERTa [ 15] using LLMs to address software engineering challenges, with a the full dataset of 1.4M crowd-annotated issues [ 20 ]. particular focus on the automatic classification of issue To address our second research question, we compare reports in a low-resource setting. Specifically, we sum- the performance of the SETFIT classifier with the permarize the findings of two recent studies in which we ad- formance achieved by GPT 3.5 in a zero-shot learning dressed the research questions formulated above [15, 19]. scenario. We highlight that prompting is only used for The remainder of the paper is organized as follows. In GPT while the SETFIT model is trained on the manually Sections 2 and 3, we describe the datasets and methodol- labeled data. Both models are evaluated on the test set ogy adopted in our empirical studies, respectively. Then, partition of manually labeled issues. we report and discuss the study results in Section 4. The paper is concluded in Section 5, where we also outline directions for future work.

2. Dataset Preprocessing For our SETFIT model, we preprocess

our dataset as follows. First, non-textual items, such as links, code snippets, and images, are identified and replaced with tokens (e.g., <link> for links) in the dataset. Next, we use the ekphrasis Text Pre-Processor 1 to normalize the text by detecting and replacing items such as URLs, email addresses, symbols, phone numbers, mentions, time, date, and numbers with specific tokens.

To address our research questions, we use a dataset of

400 GitHub issues labeled as bug, features, question, and documentation. The dataset is split into two subsets of 200 issues which we use as train and test sets, respectively.

Both subsets are equally distributed and include 50 issues Choice of GPT-like models Several LLMs have been per class. Our dataset is obtained by manually labeling proposed in the last few years, with GPT-3 [23] being the 400 randomly selected items from the dataset of 1.4M one of the most popular. There is a significant prevalence GitHub issues distributed by the NLBSE’23 tool competi- of studies leveraging GPT3.5-turbo [24], an instructiontion organizers [ 20 ]. To manually ensure the consistency tuned version of GPT-3, which is able to interact as a of labels in our dataset, three annotators individually chatbot. For this reason, we select GPT3.5-turbo [18] as categorized each issue report based on the information representative of GPT-like LLMs. We experiment with in its title and body. Each issue report was assigned to several versions of GPT3.5-turbo, with varying context two of the annotators. We observed a Cohen’s of 0.74, length and date of training. Here we only report the which indicates a substantial level of interrater agree- results of the model with the best performance. More ment [21]. The annotators had a joint plenary meeting to details can be found in our original work describing this discuss and resolve the cases of disagreement. Through study [19]. this procedure, we ensured the reliability and consistency of the annotations. Table 1 presents the dataset’s distribution before and after the manual labeling. The manually annotated sample is publicly available [ 22 ].

Prompting To instruct the model to perform the classification task, we create a prompt that includes the following items: 3. Methodology

To address our first research question, we investigate the eficacy of few-shot learning for training robust classi• Input Format: The format of the input issues,

which includes a title and a body; 1https://github.com/cbaziotis/ekphrasis • Task Description: A description of the classifica- beled gold standard and tested on the raw test set. When tion task to be performed, including the possible trained and evaluated on the manually labeled dataset labels that can be assigned to the issues; (a), SETFIT performs better than RoBERTa (b1 and b2), • Label Descriptions: A brief description of each la- regardless of whether the training set used for RoBERTa bel. Label descriptions are generated by ChatGPT is raw or manually labeled. However, when trained on and then manually reviewed to ensure they are the manually-labeled dataset (b1), RoBERTa struggles to clear and informative. deliver good performance due to a shortage of training • Input Issue: The issue to be classified; data. On the other hand, when trained on the raw dataset • Output format instructions: The desired output (b2), RoBERTa achieves competitive performances, but it format. We ask the model for a JSON object con- is unable to outperform SETFIT (b). taining a reasoning and the predicted label. This As the manually-labeled dataset embodies the ideal is done to inject some Chain-of-Thought reason- labeling criteria for classifiers, comparing SETFIT (a) and ing into the model, as suggested in previous stud- RoBERTa (b2) provides a practical scenario in which we ies about prompting LLMs [ 25, 26 ]. However, the must choose either training a classifier on a large volume reasoning serves as a prompt-engineering strat- of data with disregard for data quality or concentrating egy and is not used to evaluate the model. on a smaller portion of data and manually improving label quality. This comparison suggests that data qualEvaluation In line with previous work [ 6, 7, 11, 10 ], ity might be crucial for ensuring classification accuracy. the evaluation of the classifiers on the test set is provided A potential approach could be to start with a few-shot in terms of precision, recall, and f1-measure [15]. For classifier and gradually switch to a more powerful model GPT-like LLMs, we parse the JSON response and extract like RoBERTa when a fair amount of manually verified the predicted label. In cases in which the label is not valid data becomes available. By doing so, we can strike a or the model did not follow the instructions appropriately, balance between data quantity and quality, ensuring that we discard the prediction. This process is done with the the classifier performs efectively while minimizing the use of regular expressions. Both the models are tested possibility of inaccurate results caused by inconsistency on the manually verified test set [19]. in the labeling.

4. Results and Discussion

4.2. Leveraging GPT for automatic issue report classification (RQ2) 4.1. Impact of label consistency on the In Table 3, we report the classification performance of classifier performance (RQ1) GPT compared to the SETFIT model. As already explained in the previous section, we experimented with In Table 2, we present the results obtained by training the several versions of GPT 3.5 that were available at the SETFIT classifier on the hand-labeled gold standard and time of the study. For a full report of the results, see evaluating it on both the hand-labeled test set (a) and the Colavito et al. [19]. In this paper, we include considerfull test set distributed for the challenge (c). To ensure a ation of the 16k-0613 model only as this achieves the fair comparison, we compared the SETFIT model’s per- best performance in terms of a combination of F1 and formance with the performance obtained by RoBERTa percentage of discarded items due to nonsensical model on the same test set, when trained on the hand-labeled output. Specifically, none of the predictions from this gold standard set (b1). Furthermore, we also include the model were discarded. We observe that the Feature class performance obtained by training the RoBERTa classifier achieves the best F1, while the Documentation class is on the full train set distributed by the organizers (b2). the most problematic to identify, showing a lower recall

To assess the ability of the models to generalize on than the other classes. a broader dataset, we also include a comparison with While the zero-shot GPT model achieves a slightly the NLBSE ’23 challenge baseline [ 20 ] (see row (d) of lower performance (F1 = .8155) than SETFIT (F1 = .8321), the table) and the SETFIT model’s performance on the the models are still comparable. It’s worth noting that challenge full test set (see model (c) in the table). It is SETFIT was fine-tuned on a portion of the issue report worth noting that the SETFIT model is designed to learn gold standard dataset, while GPT was evaluated in a zerofrom a few examples. As such, it was not possible to train shot setting without any task-specific fine-tuning. This it on the raw dataset, since it is not optimized for such a implies that GPT is capable of classifying issue reports setting and it would have been heavily time expensive. with only a minor decrease in accuracy compared to fineInstead, the RoBERTa baseline is trained on the full set. tuned BERT-like models. This presents a major benefit of

The SETFIT model achieved an F1-micro score of .7767 GPT for this application since it can perform the classifi(see model (c) in Table 2) when trained on the manually la- cation in absence of labeled data, i.e., without the need for fine-tuning. This evidence could help maintainers of new projects, for which historical data is not available or is scarce. In such cases, API calls to GPT could be used to classify issue reports, providing a valuable tool for project management. Once the project has accumulated enough labeled data, the maintainer could switch to a ifne-tuned model to improve the classification accuracy.

Although this could be a viable solution for open-source projects, it is worth noting that the cost of API calls and the privacy of data could limit its practical feasibility in commercial projects. In such cases, project maintainers might consider using open-source models or building and deploying a classifier on-premise. Nonetheless, the construction and maintenance of LLMs is expensive both in terms of resources and time, and this constitutes a barrier to their adoption in most cases. state-of-the-art performance in the absence of manually annotated issues, i.e. when a gold standard is not available for fine-tuning state-of-the-art approaches based on BERT-like models. Our empirical results show that GPTlike models can achieve a performance comparable to the state-of-the-art without the need for fine-tuning. This suggests that when manual annotation is not feasible or a gold standard for training is not available (i.e., on a new project), maintainers could rely on generative AI to successfully address the issue classification task.

However, using LLMs to build issue classifiers might pose important challenges due to licensing and computational limitations. As such, we plan to extend this benchmark with open-source LLMs, also including issue-report datasets. This will enable evaluating the generalizability of our findings.

5. Conclusion and Future Works Acknowledgments

In this paper, we summarized the outcomes of our re- This research was co-funded by the NRRP Initiative, cently published studies on the use of large language Mission 4, Component 2, Investment 1.3 - Partnerships models for automated issue classification. Specifically, extended to universities, research centres, companies we investigated the impact of improving data quality on and research D.D. MUR n. 341 del 15.03.2022 – Next issue classification performance. We trained and eval- Generation EU (“FAIR - Future Artificial Intelligence Reuated a model based on few-shot learning using SET- search”, code PE00000013, CUP H97G22000210007) and FIT with a subset of manually verified data. The model by the European Union - NextGenerationEU through achieves better performance when trained and tested the Italian Ministry of University and Research, Projects on data for which label consistency was manually veri- PRIN 2022 (“QualAI: Continuous Quality Improvement ifed [ 22 ], compared to the RoBERTa baseline. However, of AI-based Systems”, grant n. 2022B3BP5S, CUP: RoBERTa generalizes better on the full test dataset when H53D23003510006). ifne-tuned on the full crowd-sourced dataset.

Furthermore, we explored the performance of GPTlike models for automatic issue classification [ 19] to understand if we can leverage GPT-like LLMs to achieve

[11]

Izadi , CatIss: An Intelligent Tool for Categoriz-

ing Issues

Reports using Transformers , in: (NLBSE [1]

Antoniol ,

Ayari ,

M. Di

Penta ,

Khomh , Y.- 2022 ), 2022 . doi: 10 .1145/3528588.3528662.

Guéhéneuc , Is it a bug or an enhancement? a [12]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi , D. Chen,

in: Proc. of the 2008 Conf . of the Center for Ad- Roberta: A robustly optimized bert pretraining ap-

vanced Studies on Collaborative Research: Meeting proach , 2019 . arXiv: 1907 .11692.

of Minds, CASCON '08 , ACM , New York, NY, USA, [13] X.

Wu , W.

Zheng , X.

Xia , D.

Lo , Data quality mat-

2008. doi: 10 .1145/1463788.1463819. ters: A case study on data label correctness for [2 ]

Herzig ,

Just ,

Zeller , It's not a bug, it's a fea- security bug report prediction , IEEE Transactions

ture: How misclassification impacts bug prediction , on Software Engineering ( 2022 ). doi: 10 .1109/TSE.

in: 2013 35th Int'l Conf.on Software Engineering 2021 . 3063727 .

(ICSE) , 2013 . doi: 10 .1109/ICSE. 2013 . 6606585 . [14]

Tunstall ,

Reimers ,

U. E. S.

Jo ,

Bates ,

Korat , [3]

Pandey ,

Sanyal ,

Hudait ,

Sen , Auto- M. Wasserblat , O. Pereg , Eficient Few-Shot Learn-

mated classification of software issue reports using ing Without Prompts , 2022 . doi: 10 .48550/arXiv.

machine learning techniques: an empirical study , 2209 . 11055 .

Innovations in Systems and Software Engineering [15]

Colavito ,

Lanubile ,

Novielli , Few-shot

( 2017 ). doi: 10 .1007/s11334-017 -0294-1. learning for issue report classification , in: 2023 [4]

Levy ,

Goldberg , Neural word embedding as IEEE/ACM 2nd Int'l Work. on Natural Language-

implicit matrix factorization , in: Z. Ghahramani, Based Software Eng. (NLBSE) , 2023 .

Welling ,

Cortes ,

Lawrence ,

K. Q.

Wein- [16]

Hou ,

Zhao ,

Liu ,

Yang ,

Wang ,

Li ,

cessing Systems , Curran Assoc., Inc., 2014 . models for software engineering: A systematic lit [5]

Mikolov , I. Sutskever,

Chen , G. Corrado, erature review, 2023 . arXiv: 2308 . 10620 .

Dean , Distributed representations of words and [17]

Fan ,

Gokkaya ,

Harman , M. Lyubarskiy,

26th Int'l Conf.on Neural Inf. Proc. Systems - Vol- models for software engineering: Survey and open

ume 2 , NIPS'13, Curran Associates Inc., Red Hook, problems, 2023 . arXiv: 2310 . 03533 .

NY , USA, 2013 . [18] OpenAI, ChatGPT: Optimizing Language Models [6]

Kallis ,

A. Di

Sorbo , G. Canfora,

Panichella , Pre- for Dialogue , 2022 .

dicting issue types on github , Science of Computer [19]

Colavito ,

Lanubile ,

Novielli , L. Quaranta,

Programming ( 2021 ). doi:https://doi.org/10. Leveraging gpt -like llms to automate issue labeling,

1016/j.scico. 2020 . 102598 . in: 2024 IEEE/ACM 21th International Conference [7]

Kallis ,

A. Di

Sorbo , G. Canfora, S. Panichella, on Mining Software Repositories (MSR) (to appear),

Ticket tagger: Machine learning driven issue clas- 2024 . doi: 10 .1145/3643991.3644903.

sification, in: 2019 IEEE Int'l. Conf on Software [20]

Kallis ,

Izadi ,

Pascarella ,

Chaparro , P. Rani,

Maintenance and Evolution (ICSME),

IEEE

, 2019 . The nlbse'23 tool competition , in: Proc. of The 2nd

doi:10 .1109/ICSME. 2019 . 00070 . Intl . Work. on Natural Language-based Software [8]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Engineering (NLBSE'23), 2023 .

Pre-training of deep bidirectional transformers for [21]

A. J.

Viera ,

J. M.

Garrett , Understanding inter-

language understanding, in: Proc. of the 2019 observer agreement: the kappa statistic , Family

Conf. of the North American Chapter of the As- medicine (

2005 ).

sociation for Computational Linguistics: Human [22]

Colavito ,

Lanubile ,

Novielli , Few-shot learn-

Language

Technologies , ACL , 2019 . doi: 10 .18653/ ing for issue report classification , 2023 . doi: 10.

v1/ N19 -1423. 5281/zenodo.7628150. [9]

Kallis ,

Chaparro ,

A. Di

Sorbo , S. Panichella, [23] T. B. Brown , B.

Mann , N.

Ryder , M. Subbiah,

Nlbse'22 tool competition , in: Proc. of The 1st Int'l J . Kaplan , P.

Dhariwal , A.

Neelakantan , P. Shyam,

(NLBSE'22) , 2022 . G. Krueger,

Henighan ,

Child ,

Ramesh , [10]

Colavito ,

Lanubile ,

Novielli , Issue report D. M. Ziegler , J.

Wu , C.

Winter , C.

Hesse , M. Chen,

in: 2022 IEEE/ACM 1st Int'l Workshop on Natural C. Berner,

McCandlish ,

Radford , I. Sutskever,

puter

Society

, USA, 2022 . doi: 10 .1145/3528588. in: Proceedings of the 34th International Confer-

3528659. ence on Neural Information Processing Systems ,

NIPS'20 , Curran Associates Inc., Red

Hook

, NY,

USA , 2020 . [24]

Ouyang ,

J. M.

Zhang ,

Harman ,

Wang ,

determinism of chatgpt in code generation , 2023 .

arXiv:2308 . 02828 . [25]

Wei ,

Wang ,

Schuurmans , M. Bosma, b. ichter,

Information Processing Systems , volume 35 , Cur-

ran Associates, Inc., 2022 , pp. 24824 - 24837 . [26]

Kojima ,

S. S.

Gu ,

Reid ,

Matsuo ,

Iwasawa ,

tion Processing Systems , volume 35 , Curran Asso-

ciates , Inc., 2022 , pp. 22199 - 22213 .