Large Language Models for Issue Report Classification

Large Language Models for Issue Report Classification GiuseppeColavito giuseppe.colavito@uniba.it University of Bari "Aldo Moro"

Italy

FilippoLanubile filippo.lanubile@uniba.it University of Bari "Aldo Moro"

Italy

NicoleNovielli nicole.novielli@uniba.it University of Bari "Aldo Moro"

Italy

LuigiQuaranta luigi.quaranta@uniba.it University of Bari "Aldo Moro"

Italy

Large Language Models for Issue Report Classification 1613-0073 04AA1DFF1985DAE2BD3069603106C02D GROBID - A machine learning software for extracting information from scholarly documents Issue classification, Large Language Models, Generative AI, Software Maintenance and Evolution, Few-Shot Learning (L. Quaranta) 0000-0003-3871-401X (G. Colavito) 0000-0003-3373-7589 (F. Lanubile) 0000-0003-1160-2608 (N. Novielli) 0000-0002-9221-0739 (L. Quaranta)

Effective issue classification is crucial for efficient software project management. However, labels assigned to issues are often inconsistent, which can negatively impact the performance of supervised classification models. In this work, we investigate how label consistency and training data size affect automatic issue classification. We first evaluate a few-shot learning approach on a manually validated dataset and compare it to fine-tuning on a larger crowd-sourced set. The results show that our approach achieves higher accuracy when trained and tested on consistent labels. We then examine zero-shot classification using GPT-3.5, finding that its performance is comparable to supervised models despite having no fine-tuning. This suggests that generative models can help classify issues when annotated data is limited. Overall, our findings provide insights into balancing data quantity and quality for issue classification.

Introduction

Collaborative software development involves complex processes and activities to effectively support software development and maintenance. In this context, issuetracking systems are widely adopted to manage requests for changes -such as bug fixes or product enhancements, as well as requests for support from users -and are regarded as essential tools for maintainers to efficiently manage software evolution activities.

Issue reports organized in such systems typically contain information such as an identifier, a description, the author, the issue status (e.g., open, assigned, closed), a comment thread, and a label indicating the type of issue, such as bug, enhancement, or support. Effective labeling of issue reports is of paramount importance to support prioritization and decision-making. Unfortunately, however, label misuse is a common problem, as submitters often confuse improvement requests as bugs and vice versa [1]. For example, Herzig et.al [2] reported that approximately 33.8% of all issue reports are incorrectly labeled. To avoid dealing with incorrect labels, automated classification methods have been proposed. Automatic issue classification can enable effective issue management and prioritization [3], without the need to instruct developers on how to assign labels correctly.

Early research on this topic proposed exploiting supervised methods that leverage text-based features for the task of automatic issue report classification [1]. More recently, approaches leveraging word embeddings have emerged [4,5,6,7]. In particular, approaches based on BERT [8] and its variants achieved state-of-the-art performance [9,10,11].

In our previous work, we conducted an empirical study to investigate to what extent we can leverage pre-trained language models for automatic issue labeling [10]. We experimented with a dataset of more than 800K issue reports from GitHub open-source software projects labeled by project contributors as bug, enhancement, or question [9]. We fine-tuned the BERT [8] variant RoBERTa [12], achieving state-of-the-art performance (F1 = 0.8591).

Our manual error analysis revealed that the main cause of the misclassification of issues is label inconsistency across different projects. Also, several issue reports in the dataset were tagged with more than one label, which is indeed a source of noise. This evidence is in line with previous studies reporting the impact of data quality on the performance of machine learning models [13]. Informed by the results of our error analysis and by findings of previous research, we formulate the following research question:

RQ1: To what extent does label consistency impact the performance of supervised issue classification models?

To address it, we investigate the efficacy of few-shot learning for training robust classifiers using a small training dataset with manually validated labels. Specifically, we experiment with SETFIT, an effective methodology for fine-tuning of transformer-based models using few-shot learning [14], achieving promising results [15].

Still, manual annotation can be a costly task, both in terms of time and resources, even if done on a small set of manually curated examples. Hence, the need for minimizing the effort associated with data labeling re-mains. With the advent of recent GPT-like Large Language Models (LLMs), researchers have started investigating their potential in solving software engineering challenges [16,17]. To better understand how GPT-like LLMs can be leveraged in automated issue labeling in the absence of training data, we formulate and investigate our second research question as follows:

RQ2: To what extent we can leverage GPT-like LLMs to classify issue reports?

To address it, we evaluate GPT3.5-turbo [18] in a zeroshot learning scenario, where the model is prompted by only providing the task and label descriptions. We compare the performance of classifiers based on GPT-like LLMs with fine-tuned BERT-like LLMs [19].

In this paper, we discuss our ongoing work on using LLMs to address software engineering challenges, with a particular focus on the automatic classification of issue reports in a low-resource setting. Specifically, we summarize the findings of two recent studies in which we addressed the research questions formulated above [15,19]. The remainder of the paper is organized as follows. In Sections 2 and 3, we describe the datasets and methodology adopted in our empirical studies, respectively. Then, we report and discuss the study results in Section 4. The paper is concluded in Section 5, where we also outline directions for future work.

Dataset

To address our research questions, we use a dataset of 400 GitHub issues labeled as bug, features, question, and documentation. The dataset is split into two subsets of 200 issues which we use as train and test sets, respectively. Both subsets are equally distributed and include 50 issues per class. Our dataset is obtained by manually labeling the 400 randomly selected items from the dataset of 1.4M GitHub issues distributed by the NLBSE'23 tool competition organizers [20]. To manually ensure the consistency of labels in our dataset, three annotators individually categorized each issue report based on the information in its title and body. Each issue report was assigned to two of the annotators. We observed a Cohen's 𝜅 of 0.74, which indicates a substantial level of interrater agreement [21]. The annotators had a joint plenary meeting to discuss and resolve the cases of disagreement. Through this procedure, we ensured the reliability and consistency of the annotations. Table 1 presents the dataset's distribution before and after the manual labeling. The manually annotated sample is publicly available [22].

Methodology

To address our first research question, we investigate the efficacy of few-shot learning for training robust classi- fiers using the small manually validated training dataset described in Section 2. In particular, we train and evaluate a model based on SETFIT [14] using the manually labeled train and test sets. Then we compare its performance with the one obtained by fine-tuning RoBERTa [15] using the full dataset of 1.4M crowd-annotated issues [20].

To address our second research question, we compare the performance of the SETFIT classifier with the performance achieved by GPT 3.5 in a zero-shot learning scenario. We highlight that prompting is only used for GPT while the SETFIT model is trained on the manually labeled data. Both models are evaluated on the test set partition of manually labeled issues.

Preprocessing For our SETFIT model, we preprocess our dataset as follows. First, non-textual items, such as links, code snippets, and images, are identified and replaced with tokens (e.g., <link> for links) in the dataset. Next, we use the ekphrasis Text Pre-Processor1 to normalize the text by detecting and replacing items such as URLs, email addresses, symbols, phone numbers, mentions, time, date, and numbers with specific tokens.

Choice of GPT-like models Several LLMs have been proposed in the last few years, with GPT-3 [23] being one of the most popular. There is a significant prevalence of studies leveraging GPT3.5-turbo [24], an instructiontuned version of GPT-3, which is able to interact as a chatbot. For this reason, we select GPT3.5-turbo [18] as representative of GPT-like LLMs. We experiment with several versions of GPT3.5-turbo, with varying context length and date of training. Here we only report the results of the model with the best performance. More details can be found in our original work describing this study [19].

Prompting To instruct the model to perform the classification task, we create a prompt that includes the following items:

• Input Format: The format of the input issues, which includes a title and a body;

• Task Description: A description of the classification task to be performed, including the possible labels that can be assigned to the issues; • Label Descriptions: A brief description of each label. Label descriptions are generated by ChatGPT and then manually reviewed to ensure they are clear and informative. • Input Issue: The issue to be classified;

• Output format instructions: The desired output format. We ask the model for a JSON object containing a reasoning and the predicted label. This is done to inject some Chain-of-Thought reasoning into the model, as suggested in previous studies about prompting LLMs [25,26]. However, the reasoning serves as a prompt-engineering strategy and is not used to evaluate the model.

Evaluation

In line with previous work [6,7,11,10], the evaluation of the classifiers on the test set is provided in terms of precision, recall, and f1-measure [15]. For GPT-like LLMs, we parse the JSON response and extract the predicted label. In cases in which the label is not valid or the model did not follow the instructions appropriately, we discard the prediction. This process is done with the use of regular expressions. Both the models are tested on the manually verified test set [19].

Results and Discussion

Impact of label consistency on the classifier performance (RQ1)

In Table 2, we present the results obtained by training the SETFIT classifier on the hand-labeled gold standard and evaluating it on both the hand-labeled test set (a) and the full test set distributed for the challenge (c). To ensure a fair comparison, we compared the SETFIT model's performance with the performance obtained by RoBERTa on the same test set, when trained on the hand-labeled gold standard set (b1). Furthermore, we also include the performance obtained by training the RoBERTa classifier on the full train set distributed by the organizers (b2).

To assess the ability of the models to generalize on a broader dataset, we also include a comparison with the NLBSE '23 challenge baseline [20] (see row (d) of the table) and the SETFIT model's performance on the challenge full test set (see model (c) in the table). It is worth noting that the SETFIT model is designed to learn from a few examples. As such, it was not possible to train it on the raw dataset, since it is not optimized for such a setting and it would have been heavily time expensive. Instead, the RoBERTa baseline is trained on the full set.

The SETFIT model achieved an F1-micro score of .7767 (see model (c) in Table 2) when trained on the manually la-beled gold standard and tested on the raw test set. When trained and evaluated on the manually labeled dataset (a), SETFIT performs better than RoBERTa (b1 and b2), regardless of whether the training set used for RoBERTa is raw or manually labeled. However, when trained on the manually-labeled dataset (b1), RoBERTa struggles to deliver good performance due to a shortage of training data. On the other hand, when trained on the raw dataset (b2), RoBERTa achieves competitive performances, but it is unable to outperform SETFIT (b).

As the manually-labeled dataset embodies the ideal labeling criteria for classifiers, comparing SETFIT (a) and RoBERTa (b2) provides a practical scenario in which we must choose either training a classifier on a large volume of data with disregard for data quality or concentrating on a smaller portion of data and manually improving label quality. This comparison suggests that data quality might be crucial for ensuring classification accuracy. A potential approach could be to start with a few-shot classifier and gradually switch to a more powerful model like RoBERTa when a fair amount of manually verified data becomes available. By doing so, we can strike a balance between data quantity and quality, ensuring that the classifier performs effectively while minimizing the possibility of inaccurate results caused by inconsistency in the labeling.

Leveraging GPT for automatic issue report classification (RQ2)

In Table 3, we report the classification performance of GPT compared to the SETFIT model. As already explained in the previous section, we experimented with several versions of GPT 3.5 that were available at the time of the study. For a full report of the results, see Colavito et al. [19]. In this paper, we include consideration of the 16k-0613 model only as this achieves the best performance in terms of a combination of F1 and percentage of discarded items due to nonsensical model output. Specifically, none of the predictions from this model were discarded. We observe that the Feature class achieves the best F1, while the Documentation class is the most problematic to identify, showing a lower recall than the other classes. While the zero-shot GPT model achieves a slightly lower performance (F1 = .8155) than SETFIT (F1 = .8321), the models are still comparable. It's worth noting that SETFIT was fine-tuned on a portion of the issue report gold standard dataset, while GPT was evaluated in a zeroshot setting without any task-specific fine-tuning. This implies that GPT is capable of classifying issue reports with only a minor decrease in accuracy compared to finetuned BERT-like models. This presents a major benefit of GPT for this application since it can perform the classification in absence of labeled data, i.e., without the need Although this could be a viable solution for open-source projects, it is worth noting that the cost of API calls and the privacy of data could limit its practical feasibility in commercial projects. In such cases, project maintainers might consider using open-source models or building and deploying a classifier on-premise. Nonetheless, the construction and maintenance of LLMs is expensive both in terms of resources and time, and this constitutes a barrier to their adoption in most cases.

Conclusion and Future Works

In this paper, we summarized the outcomes of our recently published studies on the use of large language models for automated issue classification. Specifically, we investigated the impact of improving data quality on issue classification performance. We trained and evaluated a model based on few-shot learning using SET-FIT with a subset of manually verified data. The model achieves better performance when trained and tested on data for which label consistency was manually verified [22], compared to the RoBERTa baseline. However, RoBERTa generalizes better on the full test dataset when fine-tuned on the full crowd-sourced dataset. Furthermore, we explored the performance of GPTlike models for automatic issue classification [19] to understand if we can leverage GPT-like LLMs to achieve state-of-the-art performance in the absence of manually annotated issues, i.e. when a gold standard is not available for fine-tuning state-of-the-art approaches based on BERT-like models. Our empirical results show that GPTlike models can achieve a performance comparable to the state-of-the-art without the need for fine-tuning. This suggests that when manual annotation is not feasible or a gold standard for training is not available (i.e., on a new project), maintainers could rely on generative AI to successfully address the issue classification task.

However, using LLMs to build issue classifiers might pose important challenges due to licensing and computational limitations. As such, we plan to extend this benchmark with open-source LLMs, also including issue-report datasets. This will enable evaluating the generalizability of our findings.

//ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)

Table 11Distribution of labels in the extracted samples.LabelTrain setTest setBug47 24%53 27%Documentation33 17%32 16%Feature60 30%55 28%Question44 22%47 24%Discarded168%137%Total200200

Table 22Performance of the SETFIT model and comparison with the RoBERTa baseline approach. The performance of the model submitted to the challenge is reported in Italic. In bold, we highlight the best performance obtained with SETFIT.ModelTrainTestF1(a)SETFITSampledManual labelsSampledManual labels0.8321(b1)RoBERTaSampledManual labelsSampledManual labels0.4348(b2)RoBERTaFullGitHub labelsSampledManual labels0.8182(c)SETFITSampledManual labelsFullGitHub labeling0.7767(d)RoBERTa (baseline)FullGitHub labelsFullGitHub labels0.8890

Table 33Comparison between SETFIT and GPT-3.5. This evidence could help maintainers of new projects, for which historical data is not available or is scarce. In such cases, API calls to GPT could be used to classify issue reports, providing a valuable tool for project management. Once the project has accumulated enough labeled data, the maintainer could switch to a fine-tuned model to improve the classification accuracy.SETFITGPT-3.5 (16k-0613), zero-shotLabelPrecision Recall F1-ScorePrecision Recall F1-ScoreBug0.87230.84720.85900,71330,98110,8261Documentation0.90390.65940.76160,88530,61910,7285Feature0.74940.91820.82510,88610,84910,8672Question0.87540.83190.85280,86680,77190,8164Overall0.83210.83210.83210,81550,81550,8155for fine-tuning.

https://github.com/cbaziotis/ekphrasis

Acknowledgments

This research was co-funded by the NRRP Initiative, Mission 4, Component 2, Investment 1.3 -Partnerships extended to universities, research centres, companies and research D.D. MUR n. 341 del 15.03.2022 -Next Generation EU ("FAIR -Future Artificial Intelligence Research", code PE00000013, CUP H97G22000210007) and by the European Union -NextGenerationEU through the Italian Ministry of University and Research, Projects PRIN 2022 ("QualAI: Continuous Quality Improvement of AI-based Systems", grant n. 2022B3BP5S, CUP: H53D23003510006).

Is it a bug or an enhancement? a text-based approach to classify change requests GAntoniol KAyari MDi Penta FKhomh Y.-GGuéhéneuc 10.1145/1463788.1463819 Proc. of the 2008 Conf. of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, CASCON '08 of the 2008 Conf. of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, CASCON '08

New York, NY, USA

ACM 2008 It's not a bug, it's a feature: How misclassification impacts bug prediction KHerzig SJust AZeller 10.1109/ICSE.2013.6606585 2013 35th Int'l Conf.on Software Engineering (ICSE) 2013 Automated classification of software issue reports using machine learning techniques: an empirical study NPandey DSanyal AHudait ASen 10.1007/s11334-017-0294-1 Innovations in Systems and Software Engineering 2017 Neural word embedding as implicit matrix factorization OLevy YGoldberg Advances in Neural Information Processing Systems ZGhahramani MWelling CCortes NLawrence KQWeinberger Curran Assoc., Inc 2014 Distributed representations of words and phrases and their compositionality TMikolov ISutskever KChen GCorrado JDean Proc. of the 26th Int'l Conf.on Neural Inf. Proc. Systems -Volume 2, NIPS'13 of the 26th Int'l Conf.on Neural Inf. . Systems -Volume 2, NIPS'13

Red Hook, NY, USA

Curran Associates Inc 2013 Predicting issue types on github RKallis ADi Sorbo GCanfora SPanichella 10.1016/j.scico.2020.102598 doi: Science of Computer Programming 2021 Ticket tagger: Machine learning driven issue classification RKallis ADi Sorbo GCanfora SPanichella 10.1109/ICSME.2019.00070 IEEE Int'l. Conf on Software Maintenance and Evolution (ICSME) IEEE 2019. 2019 BERT: Pre-training of deep bidirectional transformers for language understanding JDevlin M.-WChang KLee KToutanova 10.18653/v1/N19-1423 Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ACL of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ACL 2019 Nlbse'22 tool competition RKallis OChaparro ADi Sorbo SPanichella Proc. of The 1st Int'l Work. on Natural Language-based Software Eng. (NLBSE'22) of The 1st Int'l Work. on Natural Language-based Software Eng. (NLBSE'22) 2022 Issue report classification using pre-trained language models GColavito FLanubile NNovielli 10.1145/3528588.3528659 IEEE/ACM 1st Int'l Workshop on Natural Language-Based Software Eng. (NLBSE)

USA

IEEE Computer Society 2022. 2022 MIzadi 10.1145/3528588.3528662 CatIss: An Intelligent Tool for Categorizing Issues Reports using Transformers

NLBSE

2022. 2022 Roberta: A robustly optimized bert pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov arXiv:1907.11692 2019 Data quality matters: A case study on data label correctness for security bug report prediction XWu WZheng XXia DLo 10.1109/TSE.2021.3063727 IEEE Transactions on Software Engineering 2022 Efficient Few-Shot Learning Without Prompts LTunstall NReimers UE SJo LBates DKorat MWasserblat OPereg 10.48550/arXiv.2209.11055 2022 Few-shot learning for issue report classification GColavito FLanubile NNovielli IEEE/ACM 2nd Int'l Work. on Natural Language-Based Software Eng. (NLBSE) 2023. 2023 XHou YZhao YLiu ZYang KWang LLi XLuo DLo JGrundy HWang arXiv:2308.10620 Large language models for software engineering: A systematic literature review 2023 AFan BGokkaya MHarman MLyubarskiy SSengupta SYoo JMZhang arXiv:2310.03533 Large language models for software engineering: Survey and open problems 2023 ChatGPT: Optimizing Language Models for Dialogue Openai 2022 Leveraging gpt-like llms to automate issue labeling GColavito FLanubile NNovielli LQuaranta 10.1145/3643991.3644903 IEEE/ACM 21th International Conference on Mining Software Repositories (MSR) (to appear) 2024. 2024 The nlbse'23 tool competition RKallis MIzadi LPascarella OChaparro PRani Proc. of The 2nd Intl. Work. on Natural Language-based Software Engineering (NLBSE'23) of The 2nd Intl. Work. on Natural Language-based Software Engineering (NLBSE'23) 2023 AJViera JMGarrett Understanding interobserver agreement: the kappa statistic 2005 Few-shot learning for issue report classification GColavito FLanubile NNovielli 10.5281/zenodo.7628150 2023 Language models are few-shot learners TBBrown BMann NRyder MSubbiah JKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DMZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20 the 34th International Conference on Neural Information Processing Systems, NIPS'20

Red Hook, NY, USA

Curran Associates Inc 2020 Llm is like a box of chocolates: the nondeterminism of chatgpt in code generation SOuyang JMZhang MHarman MWang arXiv:2308.02828 2023 Chain-of-thought prompting elicits reasoning in large language models JWei XWang DSchuurmans MBosma FIchter EXia QVChi DLe Zhou Advances in Neural Information Processing Systems SKoyejo SMohamed AAgarwal DBelgrave KCho AOh Curran Associates, Inc 2022 35 Large language models are zero-shot reasoners TKojima SSGu MReid YMatsuo YIwasawa Advances in Neural Information Processing Systems SKoyejo SMohamed AAgarwal DBelgrave KCho AOh Curran Associates, Inc 2022 35