How good are you? An empirical classification
                                performance comparison of Large Language Models
                                with traditional Open Set Recognition classifiers
                                Alexander Grote1,∗,† , Anuja Hariharan2,† , Michael Knierim2,† and
                                Christof Weinhardt2,†
                                1
                                    FZI Research Center for Information Technology, Haid-und-Neu-Str. 10–14, 76131 Karlsruhe, Germany
                                2
                                    Karlsruhe Institute of Technology, Kaiserstraße 12, 76131 Karlsruhe, Germany


                                              Abstract
                                              The release of ChatGPT has led to an unprecedented surge in the popularity of generative AI-based Large
                                              Language Models (LLMs) among practitioners. These models have gained traction in business processes
                                              due to their ability to receive instructions in natural language. However, they suffer from hallucinations,
                                              which are generated texts that are factually incorrect. Hallucinations also arise in text classification tasks,
                                              such as customer support ticket classification or intent classification for chatbots. In such scenarios,
                                              the user prompts the model to classify an incoming text into predefined categories. Furthermore, in
                                              real-world scenarios, it is common to encounter texts that do not fit into the predefined categories. It
                                              is unclear if current state-of-the-art LLM can handle such scenarios and how they compare to existing
                                              classifiers focusing on these situations. In this paper, we propose a way to evaluate the classification
                                              performance of LLMs in an Open Set Recognition (OSR) scenario, where unseen classes can occur at
                                              inference time. The simulation consists of an empirical comparison between GPT4 and Gemini Pro, two
                                              state-of-the-art language models, a fine-tuned version of GPT3.5 and established OSR classifiers. The
                                              results would provide insights into how reliable large language models are for classification purposes
                                              and if they can replace existing OSR classifiers that typically require a decent amount of labelled data.

                                              Keywords
                                              Large Language Models, Open Set Recognition, Classification


                                1. Introduction
                                Since the release of ChatGPT in November 2022 [1], the adoption of Large Language Models
                                (LLMs) in businesses has experienced significant growth [2]. Especially the ability to use natural
                                language to interact with these models has allowed practitioners with little programming
                                knowledge to harness the power of such systems in their daily operations. However, utilising
                                LLMs comes at the risk of factually incorrect generated texts, also known as hallucinations [3].
                                Often, these hallucinations are undesired and, for example in the intent classification used in
                                chatbot interactions, they might negatively impact customer service quality and potentially

                                16th ZEUS Workshop, ZEUS 2024, Ulm, Germany, February 29th - March 1st , 2024, Germany
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open grote@fzi.de (A. Grote); anuja.hariharan@kit.edu (A. Hariharan); michael.knierim@kit.edu (M. Knierim);
                                weinhardt@kit.edu (C. Weinhardt)
                                Orcid 0009-0005-9743-6648 (A. Hariharan); 0000-0001-7148-5138 (M. Knierim)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                S. Böhm and D. Lübke (Eds.): 16th ZEUS Workshop, ZEUS 2024, Ulm, Germany, 29
                                February–1 March 2024, published at http://ceur-ws.org


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Grote et al.: How good are you? An empirical classification performance comparison of
large language models

harm the company’s reputation. This highlights the need of a robust system that is not only
able to classify the customer intent correctly, but also detects out-of-distribution questions
and replies accordingly. The functionality of a system that rejects out-of-distribution data
points and classifies known patterns into existing categories has been widely studied under
the term Open Set Recognition (OSR) [4]. In particular, deep learning based OSR classifiers,
such as the OpenMax [5] or the DOC [6] algorithm, have shown an increased performance on
OSR classification tasks [7]. Similarly, the zero- [8] and few-shot [9] abilities of LLMs have
also been leveraged to solve theses tasks. Due to the fast-paced advancements in the realm
of LLMs, it is unclear from a practitioner’s point of view how well state-of-the-art LLMs with
zero- and few-shot strategies compare to established solutions from OSR and how reliable they
are in a production setting. This ultimately leads to the question of which approach to choose
and how they compare against each other. In this work, we plan to provide insights into the
classification accuracy and the ability to reject unknown instances by conducting an empirical
analysis between these two research areas. We thereby give guidance for practitioners and an
updated benchmark for the current state-of-the-art LLM classification performance.


2. Related Work
Generative Pre-trained Transformer (GPT) models represent a paradigm shift in Natural Lan-
guage Processing (NLP) [10]. While these LLMs are typically pretrained in a self-supervised,
task-independent manner, they are known to be very good at NLP tasks, even without fine-
tuning [11]. To use these models for classification tasks one can either fine-tune the model or
use zero- and few-shot techniques for in-context-learning. Fine-tuning involves adjusting the
weights of a pre-trained model for a particular task and, given a large dataset, supersedes the
classification performance of zero- and few-shot strategies [12]. In contrast, zero- and few-shot
learning methods utilise the capability of Large Language Models (LLMs) to categorise new
data points effectively, even when they have encountered none or only a minimal number of
examples from a specific class. Typically, zero- and few-shot strategies are combined with
prompting strategies, such as ”Chain of Thought” [13] and ”Clue And Reasoning Prompting”
[14], to further enhance the classification performance. Despite these strategies, Kocoń et al. [15]
and Caruccio et al. [16] have demonstrated that the zero- and few-shot capabilities are worse
than supervised machine learning models for classification tasks. In their analysis, however,
they assumed a closed set scenario, which is an unrealistic assumption.
   A more realistic scenario than traditional classification is Open Set Recognition [4]. It allows
for unknown classes during inference, and the classifier has an additional option to reject
data points as unknown. If the incoming data point is not rejected as unknown, the classifier
classifies the data point into a known class. Among the first OSR models were adapted Support
Vector Machines [17, 18]. With the rise of neural networks, Bendale and Boult [5] reformulated
the final softmax layer to also estimate the probability of a data point being out-of-distribution.
Similarly, Shu et al. [6] use a one-versus-rest classification layer to reduce the misclassifications
in the open space, while Oza and Patel [19] utilise an autoencoder and its reconstruction loss to
determine if a data point is novel.


                                                                                                 28
Grote et al.: How good are you? An empirical classification performance comparison of
large language models

3. Proposed Approach
To compare the performance of LLMs versus Open Set Recognition classifiers, we plan to set up
an empirical evaluation as illustrated in Figure 1.

Repeat 10 times for each openness scenario

                                                                 LLM
                                 Training
                                   Data             OSR


   Random Selection             Validation           Hyperparameter-            Evaluation
  of KKCs and UUCs                Data                   tuning                 on F1 Score


                                   Test
                                   Data


Figure 1: Machine learning workflow to compare conversational LLMs with OSR classifiers.


   For our experiments, we will use four different text classification datasets. These datasets
include the 20 Newsgroups dataset [20], the Yahoo! Answers dataset [21], the CLINC150 dataset
[22] and the BANKING77 dataset [23]. While the first dataset consists of news articles and the
second questions-answer pairs of certain categories, the last two represent intent classification
tasks. All datasets have at least ten different classes or categories, based on which we will
simulate an open set scenario. We follow the data splitting procedure of Geng et al. [7] to select
the known and unknown classes for the open set simulation and repeat each simulation ten
times to derive statistically meaningful results. Furthermore, we plan to exclude 0, 10, 20 and
30 % of all available classes from training and evaluate the classification for each scenario on
the f1 score. The f1 score is a commonly used metric in classification problems, measuring the
harmonic mean between precision and recall. However, in OSR scenarios, the unknown classes
are typically not considered as an additional class when calculating the f1 score [7]. That is why
we additionally distinguish between the f1 score classification performance on the known and
unknown classes, providing further insights into the applicability of LLMs for open scenarios.
In terms of conversational LLMs, we plan to use two state-of-the-art models, GPT4 [24] from
OpenAI and Gemini Pro [25] from Google, with zero-shot and few-shot prompt configurations.
When using a zero-shot configuration, we provide the LLM with only the category name and
description, while in a few-shot setting, we also include examples of each category. Currently, it
is not possible to create a custom, fine-tuned model from both of these two models. Instead, we
will use OpenAI’s GPT3.5 model and fine-tune it with resources from OpenAI to also investigate
the improvements made through fine-tuning. We then compare the results to the classification


                                                                                               29
Grote et al.: How good are you? An empirical classification performance comparison of
large language models

performance of the OpenMax [5] and DOC [6] classifiers. To speed up the training process
of both OSR classifiers, we first transform the incoming texts into meaningful embeddings
using the most advanced text embedding provided by OpenAI [26] and then train a shallow
neural network on the retrieved embeddings. The shallow neural network integrates either the
OpenMax or the DOC architecture.


4. Conclusion
Generative AI models for text generation, like ChatGPT, have proven useful in various tasks.
In particular, they can classify an incoming text into predefined categories. In this paper, we
propose a study design that compares the classification performance of state-of-the-art LLMs
with existing classifiers for Open Set Recognition. The results of this study provide insights into
the reliability of conversational LLMs and whether they are a viable alternative to traditional
classification systems.


References
 [1] OpenAI, Introducing ChatGPT, 2022. URL: https://openai.com/blog/chatgpt.
 [2] W. Hariri, Unlocking the Potential of ChatGPT: A Comprehensive Exploration of its
     Applications, Advantages, Limitations, and Future Directions in Natural Language Pro-
     cessing (2023). URL: https://arxiv.org/abs/2304.02017. doi:10.48550/ARXIV.2304.02017 ,
     publisher: arXiv Version Number: 6.
 [3] J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, J.-R. Wen, Halueval: A large-scale hallucination
     evaluation benchmark for large language models, in: Proceedings of the 2023 conference
     on empirical methods in natural language processing, 2023, pp. 6449–6464.
 [4] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, T. E. Boult, Toward Open Set Recognition,
     IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2013) 1757–1772.
     doi:10.1109/TPAMI.2012.256 , conference Name: IEEE Transactions on Pattern Analysis
     and Machine Intelligence.
 [5] A. Bendale, T. E. Boult, Towards Open Set Deep Networks, in: 2016 IEEE Conference on
     Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, NV, USA, 2016, pp.
     1563–1572. URL: http://ieeexplore.ieee.org/document/7780542/. doi:10.1109/CVPR.2016.
     173 .
 [6] L. Shu, H. Xu, B. Liu, DOC: Deep Open Classification of Text Documents, in: Proceedings
     of the 2017 Conference on Empirical Methods in Natural Language Processing, Association
     for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2911–2916. URL: https:
     //aclanthology.org/D17-1314. doi:10.18653/v1/D17- 1314 .
 [7] C. Geng, S.-J. Huang, S. Chen, Recent Advances in Open Set Recognition: A Survey, IEEE
     Transactions on Pattern Analysis and Machine Intelligence 43 (2021) 3614–3631. doi:10.
     1109/TPAMI.2020.2981604 , conference Name: IEEE Transactions on Pattern Analysis
     and Machine Intelligence.
 [8] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le,


                                                                                                30
Grote et al.: How good are you? An empirical classification performance comparison of
large language models

     Finetuned language models are zero-shot learners, in: International conference on learning
     representations, 2022. URL: https://openreview.net/forum?id=gEZrGCozdqR.
 [9] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, others, Language models are few-shot learners, Advances
     in neural information processing systems 33 (2020) 1877–1901.
[10] T.-X. Sun, X.-Y. Liu, X.-P. Qiu, X.-J. Huang, Paradigm Shift in Natural Language Processing,
     Machine Intelligence Research 19 (2022) 169–183. URL: https://link.springer.com/10.1007/
     s11633-022-1331-6. doi:10.1007/s11633- 022- 1331- 6 .
[11] K. S. Kalyan, A Survey of GPT-3 Family Large Language Models Including ChatGPT and
     GPT-4, SSRN Electronic Journal (2023). URL: https://www.ssrn.com/abstract=4593895.
     doi:10.2139/ssrn.4593895 .
[12] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, C. A. Raffel, Few-shot
     parameter-efficient fine-tuning is better and cheaper than in-context learning, Advances
     in Neural Information Processing Systems 35 (2022) 1950–1965.
[13] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, others,
     Chain-of-thought prompting elicits reasoning in large language models, Advances in
     Neural Information Processing Systems 35 (2022) 24824–24837.
[14] X. Sun, X. Li, J. Li, F. Wu, S. Guo, T. Zhang, G. Wang, Text Classification via Large Language
     Models (2023). URL: https://arxiv.org/abs/2305.08377. doi:10.48550/ARXIV.2305.08377 ,
     publisher: arXiv Version Number: 3.
[15] J. Kocoń, I. Cichecki, O. Kaszyca, M. Kochanek, D. Szydło, J. Baran, J. Bielaniewicz, M. Gruza,
     A. Janz, K. Kanclerz, A. Kocoń, B. Koptyra, W. Mieleszczenko-Kowszewicz, P. Miłkowski,
     M. Oleksy, M. Piasecki, L. Radlinski, K. Wojtasik, S. Woźniak, P. Kazienko, ChatGPT: Jack
     of all trades, master of none, Information Fusion 99 (2023) 101861. URL: https://linkinghub.
     elsevier.com/retrieve/pii/S156625352300177X. doi:10.1016/j.inffus.2023.101861 .
[16] L. Caruccio, S. Cirillo, G. Polese, G. Solimando, S. Sundaramurthy, G. Tortora, Can
     ChatGPT provide intelligent diagnoses? A comparative study between predictive models
     and ChatGPT to define a new medical diagnostic bot, Expert Systems with Applications
     235 (2024) 121186. URL: https://linkinghub.elsevier.com/retrieve/pii/S0957417423016883.
     doi:10.1016/j.eswa.2023.121186 .
[17] W. J. Scheirer, L. P. Jain, T. E. Boult, Probability Models for Open Set Recognition, IEEE
     Transactions on Pattern Analysis and Machine Intelligence 36 (2014) 2317–2324. doi:10.
     1109/TPAMI.2014.2321392 , conference Name: IEEE Transactions on Pattern Analysis
     and Machine Intelligence.
[18] L. P. Jain, W. J. Scheirer, T. E. Boult, Multi-class Open Set Recognition Using Probability of
     Inclusion, in: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.), Computer Vision – ECCV
     2014, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2014,
     pp. 393–409. doi:https://doi.org/10.1007/978- 3- 319- 10578- 9_26 .
[19] P. Oza, V. M. Patel, C2ae: Class conditioned auto-encoder for open-set recognition, in:
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
     2019, pp. 2307–2316.
[20] D. Lewis, Reuters-21578 text categorization collection, 1997. Tex.howpublished: UCI
     Machine Learning Repository.
[21] X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification,


                                                                                                31
Grote et al.: How good are you? An empirical classification performance comparison of
large language models

     Advances in neural information processing systems 28 (2015).
[22] S. Larson, A. Mahendran, J. J. Peper, C. Clarke, A. Lee, P. Hill, J. K. Kummerfeld, K. Leach,
     M. A. Laurenzano, L. Tang, J. Mars, An evaluation dataset for intent classification and
     out-of-scope prediction, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the
     2019 conference on empirical methods in natural language processing and the 9th in-
     ternational joint conference on natural language processing (EMNLP-IJCNLP), Asso-
     ciation for Computational Linguistics, Hong Kong, China, 2019, pp. 1311–1316. URL:
     https://aclanthology.org/D19-1131. doi:10.18653/v1/D19- 1131 .
[23] I. Casanueva, T. Temčinas, D. Gerz, M. Henderson, I. Vulić, Efficient intent detection
     with dual sentence encoders, in: T.-H. Wen, A. Celikyilmaz, Z. Yu, A. Papangelis, M. Eric,
     A. Kumar, I. Casanueva, R. Shah (Eds.), Proceedings of the 2nd workshop on natural
     language processing for conversational AI, Association for Computational Linguistics,
     Online, 2020, pp. 38–45. URL: https://aclanthology.org/2020.nlp4convai-1.5. doi:10.18653/
     v1/2020.nlp4convai- 1.5 .
[24] OpenAI, GPT-4 Technical Report (2023). URL: https://arxiv.org/abs/2303.08774. doi:10.
     48550/ARXIV.2303.08774 , publisher: arXiv tex.version: 4.
[25] Gemini Team, Gemini: A Family of Highly Capable Multimodal Models (2023). URL: https://
     arxiv.org/abs/2312.11805. doi:10.48550/ARXIV.2312.11805 , publisher: arXiv tex.version:
     1.
[26] OpenAI, New and improved embedding model, 2022. URL: https://openai.com/blog/
     new-and-improved-embedding-model.


                                                                                              32