Large Language Models for Issue Report Classification
                                Giuseppe Colavito1 , Filippo Lanubile1 , Nicole Novielli1 and Luigi Quaranta1
                                1
                                    University of Bari "Aldo Moro", Italy


                                                                             Abstract
                                                                             Effective issue classification is crucial for efficient software project management. However, labels assigned to issues are often
                                                                             inconsistent, which can negatively impact the performance of supervised classification models. In this work, we investigate
                                                                             how label consistency and training data size affect automatic issue classification. We first evaluate a few-shot learning
                                                                             approach on a manually validated dataset and compare it to fine-tuning on a larger crowd-sourced set. The results show that
                                                                             our approach achieves higher accuracy when trained and tested on consistent labels. We then examine zero-shot classification
                                                                             using GPT-3.5, finding that its performance is comparable to supervised models despite having no fine-tuning. This suggests
                                                                             that generative models can help classify issues when annotated data is limited. Overall, our findings provide insights into
                                                                             balancing data quantity and quality for issue classification.

                                                                             Keywords
                                                                             Issue classification, Large Language Models, Generative AI, Software Maintenance and Evolution, Few-Shot Learning


                                1. Introduction                                                                                                       the task of automatic issue report classification [1]. More
                                                                                                                                                      recently, approaches leveraging word embeddings have
                                Collaborative software development involves complex                                                                   emerged [4, 5, 6, 7]. In particular, approaches based on
                                processes and activities to effectively support software                                                              BERT [8] and its variants achieved state-of-the-art per-
                                development and maintenance. In this context, issue-                                                                  formance [9, 10, 11].
                                tracking systems are widely adopted to manage requests                                                                   In our previous work, we conducted an empirical
                                for changes – such as bug fixes or product enhancements,                                                              study to investigate to what extent we can leverage
                                as well as requests for support from users – and are                                                                  pre-trained language models for automatic issue label-
                                regarded as essential tools for maintainers to efficiently                                                            ing [10]. We experimented with a dataset of more than
                                manage software evolution activities.                                                                                 800K issue reports from GitHub open-source software
                                   Issue reports organized in such systems typically con-                                                             projects labeled by project contributors as bug, enhance-
                                tain information such as an identifier, a description, the                                                            ment, or question [9]. We fine-tuned the BERT [8] variant
                                author, the issue status (e.g., open, assigned, closed), a                                                            RoBERTa [12], achieving state-of-the-art performance
                                comment thread, and a label indicating the type of issue,                                                             (F1 = 0.8591).
                                such as bug, enhancement, or support. Effective labeling                                                                 Our manual error analysis revealed that the main cause
                                of issue reports is of paramount importance to support                                                                of the misclassification of issues is label inconsistency
                                prioritization and decision-making. Unfortunately, how-                                                               across different projects. Also, several issue reports in the
                                ever, label misuse is a common problem, as submitters                                                                 dataset were tagged with more than one label, which is
                                often confuse improvement requests as bugs and vice                                                                   indeed a source of noise. This evidence is in line with pre-
                                versa [1]. For example, Herzig et.al [2] reported that                                                                vious studies reporting the impact of data quality on the
                                approximately 33.8% of all issue reports are incorrectly                                                              performance of machine learning models [13]. Informed
                                labeled. To avoid dealing with incorrect labels, automated                                                            by the results of our error analysis and by findings of
                                classification methods have been proposed. Automatic                                                                  previous research, we formulate the following research
                                issue classification can enable effective issue manage-                                                               question:
                                ment and prioritization [3], without the need to instruct                                                                RQ1: To what extent does label consistency impact the
                                developers on how to assign labels correctly.                                                                         performance of supervised issue classification models?
                                   Early research on this topic proposed exploiting su-                                                                  To address it, we investigate the efficacy of few-shot
                                pervised methods that leverage text-based features for                                                                learning for training robust classifiers using a small train-
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence,                                                     ing dataset with manually validated labels. Specifically,
                                organized by CINI, May 29-30, 2024, Naples, Italy                                                                     we experiment with SETFIT, an effective methodology for
                                $ giuseppe.colavito@uniba.it (G. Colavito);                                                                           fine-tuning of transformer-based models using few-shot
                                filippo.lanubile@uniba.it (F. Lanubile); nicole.novielli@uniba.it                                                     learning [14], achieving promising results [15].
                                (N. Novielli); luigi.quaranta@uniba.it (L. Quaranta)                                                                     Still, manual annotation can be a costly task, both in
                                 0000-0003-3871-401X (G. Colavito); 0000-0003-3373-7589
                                (F. Lanubile); 0000-0003-1160-2608 (N. Novielli);
                                                                                                                                                      terms of time and resources, even if done on a small
                                0000-0002-9221-0739 (L. Quaranta)                                                                                     set of manually curated examples. Hence, the need for
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative
                                                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                                                      minimizing the effort associated with data labeling re-
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
mains. With the advent of recent GPT-like Large Lan-           Table 1
guage Models (LLMs), researchers have started investi-         Distribution of labels in the extracted samples.
gating their potential in solving software engineering                     Label                Train set        Test set
challenges [16, 17]. To better understand how GPT-like                     Bug                  47 24%          53 27%
LLMs can be leveraged in automated issue labeling in the                   Documentation        33 17%          32 16%
absence of training data, we formulate and investigate                     Feature              60 30%          55 28%
our second research question as follows:                                   Question             44 22%          47 24%
   RQ2: To what extent we can leverage GPT-like LLMs to                    Discarded            16     8%       13     7%
classify issue reports?                                                    Total                   200             200
   To address it, we evaluate GPT3.5-turbo [18] in a zero-
shot learning scenario, where the model is prompted            fiers using the small manually validated training dataset
by only providing the task and label descriptions. We          described in Section 2. In particular, we train and evaluate
compare the performance of classifiers based on GPT-like       a model based on SETFIT [14] using the manually labeled
LLMs with fine-tuned BERT-like LLMs [19].                      train and test sets. Then we compare its performance
   In this paper, we discuss our ongoing work on using         with the one obtained by fine-tuning RoBERTa [15] using
LLMs to address software engineering challenges, with a        the full dataset of 1.4M crowd-annotated issues [20].
particular focus on the automatic classification of issue         To address our second research question, we compare
reports in a low-resource setting. Specifically, we sum-       the performance of the SETFIT classifier with the per-
marize the findings of two recent studies in which we ad-      formance achieved by GPT 3.5 in a zero-shot learning
dressed the research questions formulated above [15, 19].      scenario. We highlight that prompting is only used for
The remainder of the paper is organized as follows. In         GPT while the SETFIT model is trained on the manually
Sections 2 and 3, we describe the datasets and methodol-       labeled data. Both models are evaluated on the test set
ogy adopted in our empirical studies, respectively. Then,      partition of manually labeled issues.
we report and discuss the study results in Section 4. The
paper is concluded in Section 5, where we also outline
                                                               Preprocessing For our SETFIT model, we preprocess
directions for future work.
                                                               our dataset as follows. First, non-textual items, such as
                                                               links, code snippets, and images, are identified and re-
2. Dataset                                                     placed with tokens (e.g., <link> for links) in the dataset.
                                                               Next, we use the ekphrasis Text Pre-Processor 1 to nor-
To address our research questions, we use a dataset of         malize the text by detecting and replacing items such as
400 GitHub issues labeled as bug, features, question, and      URLs, email addresses, symbols, phone numbers, men-
documentation. The dataset is split into two subsets of 200    tions, time, date, and numbers with specific tokens.
issues which we use as train and test sets, respectively.
Both subsets are equally distributed and include 50 issues     Choice of GPT-like models Several LLMs have been
per class. Our dataset is obtained by manually labeling        proposed in the last few years, with GPT-3 [23] being
the 400 randomly selected items from the dataset of 1.4M       one of the most popular. There is a significant prevalence
GitHub issues distributed by the NLBSE’23 tool competi-        of studies leveraging GPT3.5-turbo [24], an instruction-
tion organizers [20]. To manually ensure the consistency       tuned version of GPT-3, which is able to interact as a
of labels in our dataset, three annotators individually        chatbot. For this reason, we select GPT3.5-turbo [18] as
categorized each issue report based on the information         representative of GPT-like LLMs. We experiment with
in its title and body. Each issue report was assigned to       several versions of GPT3.5-turbo, with varying context
two of the annotators. We observed a Cohen’s 𝜅 of 0.74,        length and date of training. Here we only report the
which indicates a substantial level of interrater agree-       results of the model with the best performance. More
ment [21]. The annotators had a joint plenary meeting to       details can be found in our original work describing this
discuss and resolve the cases of disagreement. Through         study [19].
this procedure, we ensured the reliability and consistency
of the annotations. Table 1 presents the dataset’s distribu-   Prompting To instruct the model to perform the clas-
tion before and after the manual labeling. The manually        sification task, we create a prompt that includes the fol-
annotated sample is publicly available [22].                   lowing items:

                                                                       • Input Format: The format of the input issues,
3. Methodology                                                           which includes a title and a body;
To address our first research question, we investigate the
efficacy of few-shot learning for training robust classi-          1
                                                                       https://github.com/cbaziotis/ekphrasis
     • Task Description: A description of the classifica-     beled gold standard and tested on the raw test set. When
       tion task to be performed, including the possible      trained and evaluated on the manually labeled dataset
       labels that can be assigned to the issues;             (a), SETFIT performs better than RoBERTa (b1 and b2),
     • Label Descriptions: A brief description of each la-    regardless of whether the training set used for RoBERTa
       bel. Label descriptions are generated by ChatGPT       is raw or manually labeled. However, when trained on
       and then manually reviewed to ensure they are          the manually-labeled dataset (b1), RoBERTa struggles to
       clear and informative.                                 deliver good performance due to a shortage of training
     • Input Issue: The issue to be classified;               data. On the other hand, when trained on the raw dataset
     • Output format instructions: The desired output         (b2), RoBERTa achieves competitive performances, but it
       format. We ask the model for a JSON object con-        is unable to outperform SETFIT (b).
       taining a reasoning and the predicted label. This         As the manually-labeled dataset embodies the ideal
       is done to inject some Chain-of-Thought reason-        labeling criteria for classifiers, comparing SETFIT (a) and
       ing into the model, as suggested in previous stud-     RoBERTa (b2) provides a practical scenario in which we
       ies about prompting LLMs [25, 26]. However, the        must choose either training a classifier on a large volume
       reasoning serves as a prompt-engineering strat-        of data with disregard for data quality or concentrating
       egy and is not used to evaluate the model.             on a smaller portion of data and manually improving
                                                              label quality. This comparison suggests that data qual-
                                                              ity might be crucial for ensuring classification accuracy.
Evaluation In line with previous work [6, 7, 11, 10],
                                                              A potential approach could be to start with a few-shot
the evaluation of the classifiers on the test set is provided
                                                              classifier and gradually switch to a more powerful model
in terms of precision, recall, and f1-measure [15]. For
                                                              like RoBERTa when a fair amount of manually verified
GPT-like LLMs, we parse the JSON response and extract
                                                              data becomes available. By doing so, we can strike a
the predicted label. In cases in which the label is not valid
                                                              balance between data quantity and quality, ensuring that
or the model did not follow the instructions appropriately,
                                                              the classifier performs effectively while minimizing the
we discard the prediction. This process is done with the
                                                              possibility of inaccurate results caused by inconsistency
use of regular expressions. Both the models are tested
                                                              in the labeling.
on the manually verified test set [19].

                                                             4.2. Leveraging GPT for automatic issue
4. Results and Discussion                                         report classification (RQ2)
4.1. Impact of label consistency on the                      In Table 3, we report the classification performance of
     classifier performance (RQ1)                            GPT compared to the SETFIT model. As already ex-
                                                             plained in the previous section, we experimented with
In Table 2, we present the results obtained by training the several versions of GPT 3.5 that were available at the
SETFIT classifier on the hand-labeled gold standard and time of the study. For a full report of the results, see
evaluating it on both the hand-labeled test set (a) and the Colavito et al. [19]. In this paper, we include consider-
full test set distributed for the challenge (c). To ensure a ation of the 16k-0613 model only as this achieves the
fair comparison, we compared the SETFIT model’s per- best performance in terms of a combination of F1 and
formance with the performance obtained by RoBERTa percentage of discarded items due to nonsensical model
on the same test set, when trained on the hand-labeled output. Specifically, none of the predictions from this
gold standard set (b1). Furthermore, we also include the model were discarded. We observe that the Feature class
performance obtained by training the RoBERTa classifier achieves the best F1, while the Documentation class is
on the full train set distributed by the organizers (b2).    the most problematic to identify, showing a lower recall
   To assess the ability of the models to generalize on than the other classes.
a broader dataset, we also include a comparison with            While the zero-shot GPT model achieves a slightly
the NLBSE ’23 challenge baseline [20] (see row (d) of lower performance (F1 = .8155) than SETFIT (F1 = .8321),
the table) and the SETFIT model’s performance on the the models are still comparable. It’s worth noting that
challenge full test set (see model (c) in the table). It is SETFIT was fine-tuned on a portion of the issue report
worth noting that the SETFIT model is designed to learn gold standard dataset, while GPT was evaluated in a zero-
from a few examples. As such, it was not possible to train shot setting without any task-specific fine-tuning. This
it on the raw dataset, since it is not optimized for such a implies that GPT is capable of classifying issue reports
setting and it would have been heavily time expensive. with only a minor decrease in accuracy compared to fine-
Instead, the RoBERTa baseline is trained on the full set. tuned BERT-like models. This presents a major benefit of
   The SETFIT model achieved an F1-micro score of .7767 GPT for this application since it can perform the classifi-
(see model (c) in Table 2) when trained on the manually la- cation in absence of labeled data, i.e., without the need
Table 2
Performance of the SETFIT model and comparison with the RoBERTa baseline approach. The performance of the model
submitted to the challenge is reported in Italic. In bold, we highlight the best performance obtained with SETFIT.
                              Model                    Train                        Test                F1
                (a)     SETFIT               Sampled     Manual labels    Sampled    Manual labels    0.8321
               (b1)     RoBERTa              Sampled     Manual labels    Sampled    Manual labels    0.4348
               (b2)     RoBERTa                Full      GitHub labels    Sampled    Manual labels    0.8182
                (c)     SETFIT               Sampled     Manual labels      Full    GitHub labeling   0.7767
                (d)     RoBERTa (baseline)     Full      GitHub labels      Full     GitHub labels    0.8890

Table 3
Comparison between SETFIT and GPT-3.5.
                                                    SETFIT                  GPT-3.5 (16k-0613), zero-shot
                      Label             Precision    Recall    F1-Score    Precision Recall F1-Score
                      Bug                0.8723      0.8472     0.8590      0,7133     0,9811      0,8261
                      Documentation      0.9039      0.6594     0.7616      0,8853     0,6191      0,7285
                      Feature            0.7494      0.9182     0.8251      0,8861     0,8491     0,8672
                      Question           0.8754      0.8319     0.8528      0,8668     0,7719      0,8164
                      Overall            0.8321      0.8321     0.8321      0,8155     0,8155      0,8155


for fine-tuning. This evidence could help maintainers of         state-of-the-art performance in the absence of manually
new projects, for which historical data is not available or      annotated issues, i.e. when a gold standard is not avail-
is scarce. In such cases, API calls to GPT could be used         able for fine-tuning state-of-the-art approaches based on
to classify issue reports, providing a valuable tool for         BERT-like models. Our empirical results show that GPT-
project management. Once the project has accumulated             like models can achieve a performance comparable to the
enough labeled data, the maintainer could switch to a            state-of-the-art without the need for fine-tuning. This
fine-tuned model to improve the classification accuracy.         suggests that when manual annotation is not feasible or
Although this could be a viable solution for open-source         a gold standard for training is not available (i.e., on a
projects, it is worth noting that the cost of API calls and      new project), maintainers could rely on generative AI to
the privacy of data could limit its practical feasibility in     successfully address the issue classification task.
commercial projects. In such cases, project maintainers             However, using LLMs to build issue classifiers might
might consider using open-source models or building              pose important challenges due to licensing and computa-
and deploying a classifier on-premise. Nonetheless, the          tional limitations. As such, we plan to extend this bench-
construction and maintenance of LLMs is expensive both           mark with open-source LLMs, also including issue-report
in terms of resources and time, and this constitutes a           datasets. This will enable evaluating the generalizability
barrier to their adoption in most cases.                         of our findings.


5. Conclusion and Future Works                                   Acknowledgments
In this paper, we summarized the outcomes of our re-             This research was co-funded by the NRRP Initiative,
cently published studies on the use of large language            Mission 4, Component 2, Investment 1.3 - Partnerships
models for automated issue classification. Specifically,         extended to universities, research centres, companies
we investigated the impact of improving data quality on          and research D.D. MUR n. 341 del 15.03.2022 – Next
issue classification performance. We trained and eval-           Generation EU (“FAIR - Future Artificial Intelligence Re-
uated a model based on few-shot learning using SET-              search”, code PE00000013, CUP H97G22000210007) and
FIT with a subset of manually verified data. The model           by the European Union - NextGenerationEU through
achieves better performance when trained and tested              the Italian Ministry of University and Research, Projects
on data for which label consistency was manually veri-           PRIN 2022 (“QualAI: Continuous Quality Improvement
fied [22], compared to the RoBERTa baseline. However,            of AI-based Systems”, grant n. 2022B3BP5S, CUP:
RoBERTa generalizes better on the full test dataset when         H53D23003510006).
fine-tuned on the full crowd-sourced dataset.
   Furthermore, we explored the performance of GPT-
like models for automatic issue classification [19] to un-
derstand if we can leverage GPT-like LLMs to achieve
References                                                        [11] M. Izadi, CatIss: An Intelligent Tool for Categoriz-
                                                                       ing Issues Reports using Transformers, in: (NLBSE
 [1] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, Y.-                 2022), 2022. doi:10.1145/3528588.3528662.
     G. Guéhéneuc, Is it a bug or an enhancement? a               [12] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
     text-based approach to classify change requests,                  O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     in: Proc. of the 2008 Conf. of the Center for Ad-                 Roberta: A robustly optimized bert pretraining ap-
     vanced Studies on Collaborative Research: Meeting                 proach, 2019. arXiv:1907.11692.
     of Minds, CASCON ’08, ACM, New York, NY, USA,                [13] X. Wu, W. Zheng, X. Xia, D. Lo, Data quality mat-
     2008. doi:10.1145/1463788.1463819.                                ters: A case study on data label correctness for
 [2] K. Herzig, S. Just, A. Zeller, It’s not a bug, it’s a fea-        security bug report prediction, IEEE Transactions
     ture: How misclassification impacts bug prediction,               on Software Engineering (2022). doi:10.1109/TSE.
     in: 2013 35th Int’l Conf.on Software Engineering                  2021.3063727.
     (ICSE), 2013. doi:10.1109/ICSE.2013.6606585.                 [14] L. Tunstall, N. Reimers, U. E. S. Jo, L. Bates, D. Korat,
 [3] N. Pandey, D. Sanyal, A. Hudait, A. Sen, Auto-                    M. Wasserblat, O. Pereg, Efficient Few-Shot Learn-
     mated classification of software issue reports using              ing Without Prompts, 2022. doi:10.48550/arXiv.
     machine learning techniques: an empirical study,                  2209.11055.
     Innovations in Systems and Software Engineering              [15] G. Colavito, F. Lanubile, N. Novielli, Few-shot
     (2017). doi:10.1007/s11334-017-0294-1.                            learning for issue report classification, in: 2023
 [4] O. Levy, Y. Goldberg, Neural word embedding as                    IEEE/ACM 2nd Int’l Work. on Natural Language-
     implicit matrix factorization, in: Z. Ghahramani,                 Based Software Eng. (NLBSE), 2023.
     M. Welling, C. Cortes, N. Lawrence, K. Q. Wein-              [16] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li,
     berger (Eds.), Advances in Neural Information Pro-                X. Luo, D. Lo, J. Grundy, H. Wang, Large language
     cessing Systems, Curran Assoc., Inc., 2014.                       models for software engineering: A systematic lit-
 [5] T. Mikolov, I. Sutskever, K. Chen, G. Corrado,                    erature review, 2023. arXiv:2308.10620.
     J. Dean, Distributed representations of words and            [17] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy,
     phrases and their compositionality, in: Proc. of the              S. Sengupta, S. Yoo, J. M. Zhang, Large language
     26th Int’l Conf.on Neural Inf. Proc. Systems - Vol-               models for software engineering: Survey and open
     ume 2, NIPS’13, Curran Associates Inc., Red Hook,                 problems, 2023. arXiv:2310.03533.
     NY, USA, 2013.                                               [18] OpenAI, ChatGPT: Optimizing Language Models
 [6] R. Kallis, A. Di Sorbo, G. Canfora, S. Panichella, Pre-           for Dialogue, 2022.
     dicting issue types on github, Science of Computer           [19] G. Colavito, F. Lanubile, N. Novielli, L. Quaranta,
     Programming (2021). doi:https://doi.org/10.                       Leveraging gpt-like llms to automate issue labeling,
     1016/j.scico.2020.102598.                                         in: 2024 IEEE/ACM 21th International Conference
 [7] R. Kallis, A. Di Sorbo, G. Canfora, S. Panichella,                on Mining Software Repositories (MSR) (to appear),
     Ticket tagger: Machine learning driven issue clas-                2024. doi:10.1145/3643991.3644903.
     sification, in: 2019 IEEE Int’l. Conf on Software            [20] R. Kallis, M. Izadi, L. Pascarella, O. Chaparro, P. Rani,
     Maintenance and Evolution (ICSME), IEEE, 2019.                    The nlbse’23 tool competition, in: Proc. of The 2nd
     doi:10.1109/ICSME.2019.00070.                                     Intl. Work. on Natural Language-based Software
 [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:               Engineering (NLBSE’23), 2023.
     Pre-training of deep bidirectional transformers for          [21] A. J. Viera, J. M. Garrett, Understanding inter-
     language understanding, in: Proc. of the 2019                     observer agreement: the kappa statistic, Family
     Conf. of the North American Chapter of the As-                    medicine (2005).
     sociation for Computational Linguistics: Human               [22] G. Colavito, F. Lanubile, N. Novielli, Few-shot learn-
     Language Technologies, ACL, 2019. doi:10.18653/                   ing for issue report classification, 2023. doi:10.
     v1/N19-1423.                                                      5281/zenodo.7628150.
 [9] R. Kallis, O. Chaparro, A. Di Sorbo, S. Panichella,          [23] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
     Nlbse’22 tool competition, in: Proc. of The 1st Int’l             J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     Work. on Natural Language-based Software Eng.                     G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
     (NLBSE’22), 2022.                                                 G. Krueger, T. Henighan, R. Child, A. Ramesh,
[10] G. Colavito, F. Lanubile, N. Novielli, Issue report               D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
     classification using pre-trained language models,                 E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
     in: 2022 IEEE/ACM 1st Int’l Workshop on Natural                   C. Berner, S. McCandlish, A. Radford, I. Sutskever,
     Language-Based Software Eng. (NLBSE), IEEE Com-                   D. Amodei, Language models are few-shot learners,
     puter Society, USA, 2022. doi:10.1145/3528588.                    in: Proceedings of the 34th International Confer-
     3528659.                                                          ence on Neural Information Processing Systems,
     NIPS’20, Curran Associates Inc., Red Hook, NY,
     USA, 2020.
[24] S. Ouyang, J. M. Zhang, M. Harman, M. Wang,
     Llm is like a box of chocolates: the non-
     determinism of chatgpt in code generation, 2023.
     arXiv:2308.02828.
[25] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter,
     F. Xia, E. Chi, Q. V. Le, D. Zhou, Chain-of-thought
     prompting elicits reasoning in large language mod-
     els, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Bel-
     grave, K. Cho, A. Oh (Eds.), Advances in Neural
     Information Processing Systems, volume 35, Cur-
     ran Associates, Inc., 2022, pp. 24824–24837.
[26] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa,
     Large language models are zero-shot reasoners, in:
     S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
     K. Cho, A. Oh (Eds.), Advances in Neural Informa-
     tion Processing Systems, volume 35, Curran Asso-
     ciates, Inc., 2022, pp. 22199–22213.