Large Language Models for Issue Report Classification Giuseppe Colavito1 , Filippo Lanubile1 , Nicole Novielli1 and Luigi Quaranta1 1 University of Bari "Aldo Moro", Italy Abstract Effective issue classification is crucial for efficient software project management. However, labels assigned to issues are often inconsistent, which can negatively impact the performance of supervised classification models. In this work, we investigate how label consistency and training data size affect automatic issue classification. We first evaluate a few-shot learning approach on a manually validated dataset and compare it to fine-tuning on a larger crowd-sourced set. The results show that our approach achieves higher accuracy when trained and tested on consistent labels. We then examine zero-shot classification using GPT-3.5, finding that its performance is comparable to supervised models despite having no fine-tuning. This suggests that generative models can help classify issues when annotated data is limited. Overall, our findings provide insights into balancing data quantity and quality for issue classification. Keywords Issue classification, Large Language Models, Generative AI, Software Maintenance and Evolution, Few-Shot Learning 1. Introduction the task of automatic issue report classification [1]. More recently, approaches leveraging word embeddings have Collaborative software development involves complex emerged [4, 5, 6, 7]. In particular, approaches based on processes and activities to effectively support software BERT [8] and its variants achieved state-of-the-art per- development and maintenance. In this context, issue- formance [9, 10, 11]. tracking systems are widely adopted to manage requests In our previous work, we conducted an empirical for changes – such as bug fixes or product enhancements, study to investigate to what extent we can leverage as well as requests for support from users – and are pre-trained language models for automatic issue label- regarded as essential tools for maintainers to efficiently ing [10]. We experimented with a dataset of more than manage software evolution activities. 800K issue reports from GitHub open-source software Issue reports organized in such systems typically con- projects labeled by project contributors as bug, enhance- tain information such as an identifier, a description, the ment, or question [9]. We fine-tuned the BERT [8] variant author, the issue status (e.g., open, assigned, closed), a RoBERTa [12], achieving state-of-the-art performance comment thread, and a label indicating the type of issue, (F1 = 0.8591). such as bug, enhancement, or support. Effective labeling Our manual error analysis revealed that the main cause of issue reports is of paramount importance to support of the misclassification of issues is label inconsistency prioritization and decision-making. Unfortunately, how- across different projects. Also, several issue reports in the ever, label misuse is a common problem, as submitters dataset were tagged with more than one label, which is often confuse improvement requests as bugs and vice indeed a source of noise. This evidence is in line with pre- versa [1]. For example, Herzig et.al [2] reported that vious studies reporting the impact of data quality on the approximately 33.8% of all issue reports are incorrectly performance of machine learning models [13]. Informed labeled. To avoid dealing with incorrect labels, automated by the results of our error analysis and by findings of classification methods have been proposed. Automatic previous research, we formulate the following research issue classification can enable effective issue manage- question: ment and prioritization [3], without the need to instruct RQ1: To what extent does label consistency impact the developers on how to assign labels correctly. performance of supervised issue classification models? Early research on this topic proposed exploiting su- To address it, we investigate the efficacy of few-shot pervised methods that leverage text-based features for learning for training robust classifiers using a small train- Ital-IA 2024: 4th National Conference on Artificial Intelligence, ing dataset with manually validated labels. Specifically, organized by CINI, May 29-30, 2024, Naples, Italy we experiment with SETFIT, an effective methodology for $ giuseppe.colavito@uniba.it (G. Colavito); fine-tuning of transformer-based models using few-shot filippo.lanubile@uniba.it (F. Lanubile); nicole.novielli@uniba.it learning [14], achieving promising results [15]. (N. Novielli); luigi.quaranta@uniba.it (L. Quaranta) Still, manual annotation can be a costly task, both in  0000-0003-3871-401X (G. Colavito); 0000-0003-3373-7589 (F. Lanubile); 0000-0003-1160-2608 (N. Novielli); terms of time and resources, even if done on a small 0000-0002-9221-0739 (L. Quaranta) set of manually curated examples. Hence, the need for © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). minimizing the effort associated with data labeling re- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings mains. With the advent of recent GPT-like Large Lan- Table 1 guage Models (LLMs), researchers have started investi- Distribution of labels in the extracted samples. gating their potential in solving software engineering Label Train set Test set challenges [16, 17]. To better understand how GPT-like Bug 47 24% 53 27% LLMs can be leveraged in automated issue labeling in the Documentation 33 17% 32 16% absence of training data, we formulate and investigate Feature 60 30% 55 28% our second research question as follows: Question 44 22% 47 24% RQ2: To what extent we can leverage GPT-like LLMs to Discarded 16 8% 13 7% classify issue reports? Total 200 200 To address it, we evaluate GPT3.5-turbo [18] in a zero- shot learning scenario, where the model is prompted fiers using the small manually validated training dataset by only providing the task and label descriptions. We described in Section 2. In particular, we train and evaluate compare the performance of classifiers based on GPT-like a model based on SETFIT [14] using the manually labeled LLMs with fine-tuned BERT-like LLMs [19]. train and test sets. Then we compare its performance In this paper, we discuss our ongoing work on using with the one obtained by fine-tuning RoBERTa [15] using LLMs to address software engineering challenges, with a the full dataset of 1.4M crowd-annotated issues [20]. particular focus on the automatic classification of issue To address our second research question, we compare reports in a low-resource setting. Specifically, we sum- the performance of the SETFIT classifier with the per- marize the findings of two recent studies in which we ad- formance achieved by GPT 3.5 in a zero-shot learning dressed the research questions formulated above [15, 19]. scenario. We highlight that prompting is only used for The remainder of the paper is organized as follows. In GPT while the SETFIT model is trained on the manually Sections 2 and 3, we describe the datasets and methodol- labeled data. Both models are evaluated on the test set ogy adopted in our empirical studies, respectively. Then, partition of manually labeled issues. we report and discuss the study results in Section 4. The paper is concluded in Section 5, where we also outline Preprocessing For our SETFIT model, we preprocess directions for future work. our dataset as follows. First, non-textual items, such as links, code snippets, and images, are identified and re- 2. Dataset placed with tokens (e.g., for links) in the dataset. Next, we use the ekphrasis Text Pre-Processor 1 to nor- To address our research questions, we use a dataset of malize the text by detecting and replacing items such as 400 GitHub issues labeled as bug, features, question, and URLs, email addresses, symbols, phone numbers, men- documentation. The dataset is split into two subsets of 200 tions, time, date, and numbers with specific tokens. issues which we use as train and test sets, respectively. Both subsets are equally distributed and include 50 issues Choice of GPT-like models Several LLMs have been per class. Our dataset is obtained by manually labeling proposed in the last few years, with GPT-3 [23] being the 400 randomly selected items from the dataset of 1.4M one of the most popular. There is a significant prevalence GitHub issues distributed by the NLBSE’23 tool competi- of studies leveraging GPT3.5-turbo [24], an instruction- tion organizers [20]. To manually ensure the consistency tuned version of GPT-3, which is able to interact as a of labels in our dataset, three annotators individually chatbot. For this reason, we select GPT3.5-turbo [18] as categorized each issue report based on the information representative of GPT-like LLMs. We experiment with in its title and body. Each issue report was assigned to several versions of GPT3.5-turbo, with varying context two of the annotators. We observed a Cohen’s 𝜅 of 0.74, length and date of training. Here we only report the which indicates a substantial level of interrater agree- results of the model with the best performance. More ment [21]. The annotators had a joint plenary meeting to details can be found in our original work describing this discuss and resolve the cases of disagreement. Through study [19]. this procedure, we ensured the reliability and consistency of the annotations. Table 1 presents the dataset’s distribu- Prompting To instruct the model to perform the clas- tion before and after the manual labeling. The manually sification task, we create a prompt that includes the fol- annotated sample is publicly available [22]. lowing items: • Input Format: The format of the input issues, 3. Methodology which includes a title and a body; To address our first research question, we investigate the efficacy of few-shot learning for training robust classi- 1 https://github.com/cbaziotis/ekphrasis • Task Description: A description of the classifica- beled gold standard and tested on the raw test set. When tion task to be performed, including the possible trained and evaluated on the manually labeled dataset labels that can be assigned to the issues; (a), SETFIT performs better than RoBERTa (b1 and b2), • Label Descriptions: A brief description of each la- regardless of whether the training set used for RoBERTa bel. Label descriptions are generated by ChatGPT is raw or manually labeled. However, when trained on and then manually reviewed to ensure they are the manually-labeled dataset (b1), RoBERTa struggles to clear and informative. deliver good performance due to a shortage of training • Input Issue: The issue to be classified; data. On the other hand, when trained on the raw dataset • Output format instructions: The desired output (b2), RoBERTa achieves competitive performances, but it format. We ask the model for a JSON object con- is unable to outperform SETFIT (b). taining a reasoning and the predicted label. This As the manually-labeled dataset embodies the ideal is done to inject some Chain-of-Thought reason- labeling criteria for classifiers, comparing SETFIT (a) and ing into the model, as suggested in previous stud- RoBERTa (b2) provides a practical scenario in which we ies about prompting LLMs [25, 26]. However, the must choose either training a classifier on a large volume reasoning serves as a prompt-engineering strat- of data with disregard for data quality or concentrating egy and is not used to evaluate the model. on a smaller portion of data and manually improving label quality. This comparison suggests that data qual- ity might be crucial for ensuring classification accuracy. Evaluation In line with previous work [6, 7, 11, 10], A potential approach could be to start with a few-shot the evaluation of the classifiers on the test set is provided classifier and gradually switch to a more powerful model in terms of precision, recall, and f1-measure [15]. For like RoBERTa when a fair amount of manually verified GPT-like LLMs, we parse the JSON response and extract data becomes available. By doing so, we can strike a the predicted label. In cases in which the label is not valid balance between data quantity and quality, ensuring that or the model did not follow the instructions appropriately, the classifier performs effectively while minimizing the we discard the prediction. This process is done with the possibility of inaccurate results caused by inconsistency use of regular expressions. Both the models are tested in the labeling. on the manually verified test set [19]. 4.2. Leveraging GPT for automatic issue 4. Results and Discussion report classification (RQ2) 4.1. Impact of label consistency on the In Table 3, we report the classification performance of classifier performance (RQ1) GPT compared to the SETFIT model. As already ex- plained in the previous section, we experimented with In Table 2, we present the results obtained by training the several versions of GPT 3.5 that were available at the SETFIT classifier on the hand-labeled gold standard and time of the study. For a full report of the results, see evaluating it on both the hand-labeled test set (a) and the Colavito et al. [19]. In this paper, we include consider- full test set distributed for the challenge (c). To ensure a ation of the 16k-0613 model only as this achieves the fair comparison, we compared the SETFIT model’s per- best performance in terms of a combination of F1 and formance with the performance obtained by RoBERTa percentage of discarded items due to nonsensical model on the same test set, when trained on the hand-labeled output. Specifically, none of the predictions from this gold standard set (b1). Furthermore, we also include the model were discarded. We observe that the Feature class performance obtained by training the RoBERTa classifier achieves the best F1, while the Documentation class is on the full train set distributed by the organizers (b2). the most problematic to identify, showing a lower recall To assess the ability of the models to generalize on than the other classes. a broader dataset, we also include a comparison with While the zero-shot GPT model achieves a slightly the NLBSE ’23 challenge baseline [20] (see row (d) of lower performance (F1 = .8155) than SETFIT (F1 = .8321), the table) and the SETFIT model’s performance on the the models are still comparable. It’s worth noting that challenge full test set (see model (c) in the table). It is SETFIT was fine-tuned on a portion of the issue report worth noting that the SETFIT model is designed to learn gold standard dataset, while GPT was evaluated in a zero- from a few examples. As such, it was not possible to train shot setting without any task-specific fine-tuning. This it on the raw dataset, since it is not optimized for such a implies that GPT is capable of classifying issue reports setting and it would have been heavily time expensive. with only a minor decrease in accuracy compared to fine- Instead, the RoBERTa baseline is trained on the full set. tuned BERT-like models. This presents a major benefit of The SETFIT model achieved an F1-micro score of .7767 GPT for this application since it can perform the classifi- (see model (c) in Table 2) when trained on the manually la- cation in absence of labeled data, i.e., without the need Table 2 Performance of the SETFIT model and comparison with the RoBERTa baseline approach. The performance of the model submitted to the challenge is reported in Italic. In bold, we highlight the best performance obtained with SETFIT. Model Train Test F1 (a) SETFIT Sampled Manual labels Sampled Manual labels 0.8321 (b1) RoBERTa Sampled Manual labels Sampled Manual labels 0.4348 (b2) RoBERTa Full GitHub labels Sampled Manual labels 0.8182 (c) SETFIT Sampled Manual labels Full GitHub labeling 0.7767 (d) RoBERTa (baseline) Full GitHub labels Full GitHub labels 0.8890 Table 3 Comparison between SETFIT and GPT-3.5. SETFIT GPT-3.5 (16k-0613), zero-shot Label Precision Recall F1-Score Precision Recall F1-Score Bug 0.8723 0.8472 0.8590 0,7133 0,9811 0,8261 Documentation 0.9039 0.6594 0.7616 0,8853 0,6191 0,7285 Feature 0.7494 0.9182 0.8251 0,8861 0,8491 0,8672 Question 0.8754 0.8319 0.8528 0,8668 0,7719 0,8164 Overall 0.8321 0.8321 0.8321 0,8155 0,8155 0,8155 for fine-tuning. This evidence could help maintainers of state-of-the-art performance in the absence of manually new projects, for which historical data is not available or annotated issues, i.e. when a gold standard is not avail- is scarce. In such cases, API calls to GPT could be used able for fine-tuning state-of-the-art approaches based on to classify issue reports, providing a valuable tool for BERT-like models. Our empirical results show that GPT- project management. Once the project has accumulated like models can achieve a performance comparable to the enough labeled data, the maintainer could switch to a state-of-the-art without the need for fine-tuning. This fine-tuned model to improve the classification accuracy. suggests that when manual annotation is not feasible or Although this could be a viable solution for open-source a gold standard for training is not available (i.e., on a projects, it is worth noting that the cost of API calls and new project), maintainers could rely on generative AI to the privacy of data could limit its practical feasibility in successfully address the issue classification task. commercial projects. In such cases, project maintainers However, using LLMs to build issue classifiers might might consider using open-source models or building pose important challenges due to licensing and computa- and deploying a classifier on-premise. Nonetheless, the tional limitations. As such, we plan to extend this bench- construction and maintenance of LLMs is expensive both mark with open-source LLMs, also including issue-report in terms of resources and time, and this constitutes a datasets. This will enable evaluating the generalizability barrier to their adoption in most cases. of our findings. 5. Conclusion and Future Works Acknowledgments In this paper, we summarized the outcomes of our re- This research was co-funded by the NRRP Initiative, cently published studies on the use of large language Mission 4, Component 2, Investment 1.3 - Partnerships models for automated issue classification. Specifically, extended to universities, research centres, companies we investigated the impact of improving data quality on and research D.D. MUR n. 341 del 15.03.2022 – Next issue classification performance. We trained and eval- Generation EU (“FAIR - Future Artificial Intelligence Re- uated a model based on few-shot learning using SET- search”, code PE00000013, CUP H97G22000210007) and FIT with a subset of manually verified data. The model by the European Union - NextGenerationEU through achieves better performance when trained and tested the Italian Ministry of University and Research, Projects on data for which label consistency was manually veri- PRIN 2022 (“QualAI: Continuous Quality Improvement fied [22], compared to the RoBERTa baseline. However, of AI-based Systems”, grant n. 2022B3BP5S, CUP: RoBERTa generalizes better on the full test dataset when H53D23003510006). fine-tuned on the full crowd-sourced dataset. Furthermore, we explored the performance of GPT- like models for automatic issue classification [19] to un- derstand if we can leverage GPT-like LLMs to achieve References [11] M. Izadi, CatIss: An Intelligent Tool for Categoriz- ing Issues Reports using Transformers, in: (NLBSE [1] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, Y.- 2022), 2022. doi:10.1145/3528588.3528662. G. Guéhéneuc, Is it a bug or an enhancement? a [12] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, text-based approach to classify change requests, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, in: Proc. of the 2008 Conf. of the Center for Ad- Roberta: A robustly optimized bert pretraining ap- vanced Studies on Collaborative Research: Meeting proach, 2019. arXiv:1907.11692. of Minds, CASCON ’08, ACM, New York, NY, USA, [13] X. Wu, W. Zheng, X. Xia, D. Lo, Data quality mat- 2008. doi:10.1145/1463788.1463819. ters: A case study on data label correctness for [2] K. Herzig, S. Just, A. Zeller, It’s not a bug, it’s a fea- security bug report prediction, IEEE Transactions ture: How misclassification impacts bug prediction, on Software Engineering (2022). doi:10.1109/TSE. in: 2013 35th Int’l Conf.on Software Engineering 2021.3063727. (ICSE), 2013. doi:10.1109/ICSE.2013.6606585. [14] L. Tunstall, N. Reimers, U. E. S. Jo, L. Bates, D. Korat, [3] N. Pandey, D. Sanyal, A. Hudait, A. Sen, Auto- M. Wasserblat, O. Pereg, Efficient Few-Shot Learn- mated classification of software issue reports using ing Without Prompts, 2022. doi:10.48550/arXiv. machine learning techniques: an empirical study, 2209.11055. Innovations in Systems and Software Engineering [15] G. Colavito, F. Lanubile, N. Novielli, Few-shot (2017). doi:10.1007/s11334-017-0294-1. learning for issue report classification, in: 2023 [4] O. Levy, Y. Goldberg, Neural word embedding as IEEE/ACM 2nd Int’l Work. on Natural Language- implicit matrix factorization, in: Z. Ghahramani, Based Software Eng. (NLBSE), 2023. M. Welling, C. Cortes, N. Lawrence, K. Q. Wein- [16] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, berger (Eds.), Advances in Neural Information Pro- X. Luo, D. Lo, J. Grundy, H. Wang, Large language cessing Systems, Curran Assoc., Inc., 2014. models for software engineering: A systematic lit- [5] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, erature review, 2023. arXiv:2308.10620. J. Dean, Distributed representations of words and [17] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, phrases and their compositionality, in: Proc. of the S. Sengupta, S. Yoo, J. M. Zhang, Large language 26th Int’l Conf.on Neural Inf. Proc. Systems - Vol- models for software engineering: Survey and open ume 2, NIPS’13, Curran Associates Inc., Red Hook, problems, 2023. arXiv:2310.03533. NY, USA, 2013. [18] OpenAI, ChatGPT: Optimizing Language Models [6] R. Kallis, A. Di Sorbo, G. Canfora, S. Panichella, Pre- for Dialogue, 2022. dicting issue types on github, Science of Computer [19] G. Colavito, F. Lanubile, N. Novielli, L. Quaranta, Programming (2021). doi:https://doi.org/10. Leveraging gpt-like llms to automate issue labeling, 1016/j.scico.2020.102598. in: 2024 IEEE/ACM 21th International Conference [7] R. Kallis, A. Di Sorbo, G. Canfora, S. Panichella, on Mining Software Repositories (MSR) (to appear), Ticket tagger: Machine learning driven issue clas- 2024. doi:10.1145/3643991.3644903. sification, in: 2019 IEEE Int’l. Conf on Software [20] R. Kallis, M. Izadi, L. Pascarella, O. Chaparro, P. Rani, Maintenance and Evolution (ICSME), IEEE, 2019. The nlbse’23 tool competition, in: Proc. of The 2nd doi:10.1109/ICSME.2019.00070. Intl. Work. on Natural Language-based Software [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Engineering (NLBSE’23), 2023. Pre-training of deep bidirectional transformers for [21] A. J. Viera, J. M. Garrett, Understanding inter- language understanding, in: Proc. of the 2019 observer agreement: the kappa statistic, Family Conf. of the North American Chapter of the As- medicine (2005). sociation for Computational Linguistics: Human [22] G. Colavito, F. Lanubile, N. Novielli, Few-shot learn- Language Technologies, ACL, 2019. doi:10.18653/ ing for issue report classification, 2023. doi:10. v1/N19-1423. 5281/zenodo.7628150. [9] R. Kallis, O. Chaparro, A. Di Sorbo, S. Panichella, [23] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, Nlbse’22 tool competition, in: Proc. of The 1st Int’l J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, Work. on Natural Language-based Software Eng. G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, (NLBSE’22), 2022. G. Krueger, T. Henighan, R. Child, A. Ramesh, [10] G. Colavito, F. Lanubile, N. Novielli, Issue report D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, classification using pre-trained language models, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, in: 2022 IEEE/ACM 1st Int’l Workshop on Natural C. Berner, S. McCandlish, A. Radford, I. Sutskever, Language-Based Software Eng. (NLBSE), IEEE Com- D. Amodei, Language models are few-shot learners, puter Society, USA, 2022. doi:10.1145/3528588. in: Proceedings of the 34th International Confer- 3528659. ence on Neural Information Processing Systems, NIPS’20, Curran Associates Inc., Red Hook, NY, USA, 2020. [24] S. Ouyang, J. M. Zhang, M. Harman, M. Wang, Llm is like a box of chocolates: the non- determinism of chatgpt in code generation, 2023. arXiv:2308.02828. [25] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, Chain-of-thought prompting elicits reasoning in large language mod- els, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Bel- grave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, volume 35, Cur- ran Associates, Inc., 2022, pp. 24824–24837. [26] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Informa- tion Processing Systems, volume 35, Curran Asso- ciates, Inc., 2022, pp. 22199–22213.