A Review of Text Classification Models from Bayesian to Transformers Ema Ilic1 , Mercedes Garcia Martinez1 and Marina Souto Pastor1 1 Pangeanic, Valencia, Spain Abstract This paper is discussing a review of different text classification models, both the traditional ones, as well as the state-of-the-art models. Simple models under review were the Logistic Regression, naïve Bayes, k-Nearest Neighbors, C-Support Vector Classifier, Linear Support Vector Machine Classifier, and Random Forest. On the other hand, the state-of-the-art models used were classifiers that include pretrained embeddings layers, namely BERT or GPT-2. Results are compared among all of these classification models on two multiclass datasets, ‘Text_types’ and ‘Digital’, addressed later on in the paper. These datasets are internal to Pangeanic. The experiments were coded in Python 3.8. The codes have been executed with various quantities of data, on different servers, and on two different datasets. While BERT was tested both as a multiclass as well as a binary model, GPT-2 was used as a binary model on all the classes of a certain dataset. In this paper we showcase the most interesting and relevant results. The results show that for the datasets on hand, BERT and GPT-2 models perform the best, though the BERT model outperforms GPT-2 by one percentage point in terms of accuracy. It should be born in mind that these two models were tested on a binary case though, whereas the other ones were tested on a multiclass case. The models that performed the best on a multiclass case are C-Support Vector Classifier and BERT. To establish the absolute best classifier in a multiclass case, further research is needed that would deploy GPT-2 on a multiclass case. 1. Introduction several different text classification models, some shallow and some deep. Text Classification is the procedure of designating pre- defined labels for text, and is an essential and significant part in many Natural Language Processing (NLP) tasks, 2. Models such as sentiment analysis [1], topic labeling [2], ques- tion answering [3] and dialog act classification [4]. In The shallow models tested in this paper are the well- the era that we live in, there are massive amounts of explored Naive Bayes, Support Vector Machine and K- data and textual data is produced daily. Thus, it is highly Nearest Neighbor. Bayesian classi�ers assign the most inconvenient to process all this information manually. likely class to a given example described by its feature Moreover, due to fatigue or a lack of expertise, the accu- vector [5]. On the other hand, the Support Vector Ma- racy of manual data processing is highly questionable. chine are supervised learning models with associated For these reasons, more and more people and institutions learning algorithms that analyze data for classification revert to automatic text classification to do the task with and regression analysis [6]. Finally, the K-Nearest Neigh- increased accuracy and reduced human bias. Distinction bor is a non-parametric classification method, which is between shallow and deep learning models have been simple but effective in many cases. For a data record 𝑡 to already investigated [4]. Mainly, shallow models dom- be classified, its 𝑘 nearest neighbours are retrieved, and inated the text classification field since 1960s until the this forms a neighbourhood of 𝑡 [7]. early 2010s. Shallow learning refers to statistics-based The deep neural models tested use Bidirectional En- models, such as Naïve Bayes (NB), K-Nearest Neighbor coder Representations from Transformers (BERT) [8] and (KNN), and Support Vector Machine (SVM). These meth- second generation Generative Pre-trained Transformer ods had their fair share of success. However, they still (GPT-2) [9], implemented by the Huggingface library need to do feature engineering, which costs time and [10].Both of them are transformers-architecture based financial resources. In addition, they disregard the nat- models and differ fundamentally in that BERT has just ural sequential structure or contextual information in the encoder blocks from the transformer, whilst GPT-2 textual data. Thus, these models often fail to assign cor- has just the decoder blocks from the transformer. More- rect semantics to words. In this research paper, we test over, GPT-2 is like a traditional language model that takes word vectors as input and estimates the probability of the SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, next word as output. It is auto-regressive in nature: each Lugano, Switzerland token in the sentence has the context of the previous Envelope-Open ema.ilic9@gmail.com (E. Ilic); m.garcia@pangeanic.com (M. G. Martinez); words. Thus, GPT-2 generates one token at a time[11]. m.souto@pangeanic.com (M. S. Pastor) By contrast, BERT is not auto-regressive. It uses the © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). entire surrounding context all-at-once[12]. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Table 1 Examples from the ’Text_Type’ dataset Text Label provisions relating to the act of accession of 16 april 2003 Legal 79 particulars to appear on the outer packaging Medical desloratadine was not teratogenic in animal studies. Medical there’s no actress in town who can hold a candle to her. Vernacular each press of this button cycles through the following three indicator display options: Tech ”a further leading interest rate indicator , the eurepo , was established in early 2002 .” Finances Table 2 Examples from the ’Digital’ dataset Text Label Averages over the reference period referred to in Article 2(2) of Regulation (EC) No 1249/96: Email A discount of 10 EUR/t (Article 4(3) of Regulation (EC) No 1249/96). Marketing ”The risk is limited to the explosion of a single article.” Social Media ”a rating of 75 Ah, and ” Social Media 3. Methodology 4.1. Case 1: Simple Classifiers and Grid Search The main idea of this research paper is to compare the re- sults of different classifiers on two datasets, ’Text_types’ The ’Text_types’ dataset was reduced to a total of 8437 and ’Digital’, described later on in the section 3.1 with units. The Randomized Search and Grid Search cross regards to relevant metrics, more precisely, precision, validation was applied with the help of scikit-learn li- recall, accuracy, and F1. brary in order to choose the best hyperparameters for each simple classifier. The results are reported below. For the K-Nearest Neighbor, the optimal parameters chosen 3.1. Datasets were the following: for the weights, the inverse weights Two Pangeanic internal datasets are used for the experi- with respect to the distance were choosen, and a total ments. The first dataset called ’Text_types’ is comprised number of 3 nearest neighbors was chosen. On the other of 8.4M values and is divided into four classes: vernacular, hand, the optimal parameters chosen in the grid search legal, medical, tech and financial text. On the other hand, for Naive Bayes were a prior fit and the additive smooth- the second dataset is comprised of 1.3M values, and is ing parameter was set to 0.01. The parameters chosen representing digital text content divided into 3 classes: were Newton CG solver, no penalty and a constant was Social Media, Marketing and Email content. The second added to the decision function. For the C-Support Vector dataset is referred to as ’Digital’. Classifier the kernel type chosen was ’rbf’ and the degree of polinomial kernel function is 5. 3.2. Tools 4.2. Case 2: Binary BERT The experiments were executed using 24 parallelized CPU units of type x86_64, and NVIDIA Titan GPU with The first model tested was binary BERT model by the Cuda Version 11.0. ’huggingface’ library. The ’text_types’ dataset was re- duced to 1687 samples for the sake of faster execution of the code. The dataset was turned into a binary one, in this 4. Experiments case with ’legal’ and ’non-legal’ text categories. The anal- ysis was conducted with pretrained BERT-base-uncased Numerous different experiments, tests and trials have model and the results were the following: Namely, with been done in order to observe the widest possible array this pretrained BERT-base-uncased model, the accuracy of results. Namely, the codes have been executed with of 98.46% was achieved, accompanied by the f1 score of different quantities of data, on different servers, and on 98.72% for class legal and 98.07% for class non-legal. different datasets. While BERT was tested both as a mul- ticlass as well as a binary model, GPT-2 was used as a binary model on all the classes of a certain dataset. 4.4. Case 4: Binary GPT-2 Binary GPT-2 model by OpenAI was tested on 5062 sam- ples of the ’Vernacular’ vs ’Non-Vernacular’ class of the Text_types dataset, with the weighted accuracy obtained of 98%. Below, one can observe the training and valida- tion loss for the given classes as well as the confusion matrix. The same model was also tested on all the three classes of ’Digital’ dataset on a total of 13336 samples for training of each class. As can be observed, the results for discriminating be- tween the marketing and non-marketing class with the Figure 1: Training and Validation Loss for Legal vs. Non- GPT-2 model were interesting, namely, a weighted aver- Legal binary BERT age of 89% can be observed for GPT-2 trained on ’Digital’ dataset. Below are visual representations of the success of this model on discriminating between the other two classes. A weighted average of the accuracy between the social media and non-social media class was 96% and for the email vs. non-email class was 94%. The total weighted av- erage accuracy of the binary GPT-2 model on the ’Digital’ dataset was 93%. Figure 2: Confusion Matrix for Legal vs. Non-Legal binary BERT 4.3. Case 3: Multiclass BERT On the other hand, BERT-base-uncased pretrained model was also used on a multiclass case of the same dataset (’Text_types’) and later on the ’Digital’. The Text_types Figure 3: Training and Validation Loss for Vernacular vs. non- dataset was tested with 844 samples split into training Vernacular class with GPT-2 on ’Text_type’ dataset and validation. The ’Vernacular’ Class had the accuracy of 50/52, ’Finances’ 10/14, ’Legal’ 12/15, ’Medical’ 24/26, and ’Tech’ 18/20. The overall accuracy of the model on ’Text_types’ dataset was therefore 89.76%. For the ’Digital’ dataset, on the other hand, 9000 sam- 4.5. Results ples were used which were later split to training and Results of the research may be observed in the Table 3. validation sets, and the BERT model was fine-tuned with K-Nearest Neighbor, Multinomial Naive Bayes, Logistic the following results. The email, marketing and social Regression C-Support Vector Classifier and Linear media class had the true positive rates of 416/455 (91.43% Support Vector Machine Classifier were tested against accuracy), 417/456 (91.65% accuracy) and 425/455 (93.4% the ’Text_Type’ dataset, with the vectorization type accuracy). Namely, this is a weighted accuracy of 92.05%. chosen being Character level TF-IDF vector, whereas the Random Forest model was assigned the word level TF-IDF vectorization as the character one was incompatible with the classifier. The best results in terms of accuracy for the multiclass case were obtained Table 3 Outcomes of Different Classification Models on ’Text_Type’ Classifier Accuracy Precision Recall F1 K-Nearest Neighbor 0.77 0.75-0.92 0.68-0.88 0.60-0.90 Multinomial Naive Bayes 0.89 0.81-0.93 0.72-0.96 0.77-0.94 Logistic Regression 0.89 0.76-0.94 0.79-0.93 0.79-0.93 C-Support Vector Classifier 0.90 0.83-0.99 0.74-0.99 0.78-0.95 Linear Support Vector Machine Classifier 0.88 0.78-0.92 0.80-0.96 0.80-0.95 Random Forest 0.78 0.55-0.92 0.65-0.92 0.60-0.88 BERT Pretrained Uncased 0.90 - - - BERT binary (Legal/Non-Legal) 0.99 0.98-0.99 0.98-0.99 0.98-0.99 GPT-2 binary (Vernacular/ Non-Vernacular) 0.98 0.96-1.00 0.97-0.99 0.98 Figure 4: Training and Validation Accuracy for Vernacular vs. non-Vernacular class with GPT-2 on ’Text_type’ dataset Figure 5: Confusion Matrix for Vernacular vs. non-Vernacular class with GPT-2 on ’Text_type’ dataset with the BERT model by the huggingface library and the C-Support vector classifier from the scikit-learn. On the other hand, the best results for the binary case were obtained with the GPT-2 classifier on a legal-vs it should be borne in mind that GPT-2 was only tested non-legal class. on a binary case. This is in line with the current research The absolute best results in terms of precision, recall and on the performance of large scale transformers models F1 were achieved for the binary BERT, whereas the best in classification tasks. [13] [14] results in terms of those same metrics achieved for a Some further research might be done comparing the per- multiclass case were by a C-Support Vector Classifier by formance of the multiclass GPT-2 on classification tasks the scikit-learn library. Bear in mind that the Precision, in comparison to BERT. It would be interesting to observe Recall and F1 for the BERT Pretrained Uncased remain if BERT always performs better, or if it only performs unknown, and might indeed be greater than for the better on certain kinds of datasets. other classifiers. References [1] A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, C. Potts, 5. Conclusions Learning word vectors for sentiment analysis, 2011, According to our research, BERT and GPT-2 appear to pp. 142–150. perform excellent in a classification task, although BERT [2] S. Wang, C. Manning, Baselines and bigrams: Sim- appears to be outperforming the GPT-2 by one percent- ple, good sentiment and topic classification, in: age point in terms of accuracy. Both of these models Proceedings of the 50th Annual Meeting of the As- significantly outperformed the shallow models, though sociation for Computational Linguistics (Volume 2: Figure 6: Training and Validation Loss for Marketing vs. non- Marketing class with GPT-2 on ’Digital’ dataset Figure 8: Confusion Matrix for Marketing vs. non-Marketing class with GPT-2 on ’Digital’ dataset approach in classification (2004). [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL: https:// arxiv.org/abs/1810.04805 . doi:10.48550/ARXIV. 1810.04805 . [9] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsuper- vised multitask learners, OpenAI blog 1 (2019) 9. [10] Huggingface website, https://huggingface.co/ , Figure 7: Training and Validation Accuracy for Marketing vs. ???? Accessed: 2010-09-30. non-Marketing class with GPT-2 on ’Digital’ dataset [11] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners (2018). URL: https://d4mucfpksywv.cloudfront.net/ Short Papers), Association for Computational Lin- better- language- models/language- models.pdf . guistics, Jeju Island, Korea, 2012, pp. 90–94. URL: [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: https://aclanthology.org/P12- 2018 . Pre-training of deep bidirectional transformers for [3] Q. Mei, X. Shen, C. Zhai, Automatic labeling of language understanding, 2019. arXiv:1810.04805 . multinomial topic models, in: Proceedings of the [13] C. Sun, X. Qiu, Y. Xu, X. Huang, How to 13th ACM SIGKDD international conference on fine-tune BERT for text classification?, CoRR Knowledge discovery and data mining, 2007, pp. abs/1905.05583 (2019). URL: http://arxiv.org/ 490–499. abs/1905.05583 . arXiv:1905.05583 . [4] Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. [14] S. González-Carvajal, E. C. Garrido-Merchán, Com- Yu, L. He, A survey on text classification: From paring BERT against traditional machine learn- shallow to deep learning, CoRR abs/2008.00364 ing text classification, CoRR abs/2005.13012 (2020). URL: https://arxiv.org/abs/2008.00364 . (2020). URL: https://arxiv.org/abs/2005.13012 . arXiv:2008.00364 . arXiv:2005.13012 . [5] I. Rish, An empirical study of the naïve bayes clas- sifier, IJCAI 2001 Work Empir Methods Artif Intell 3 (2001). [6] C. Cortes, V. Vapnik, Support vector networks, Machine Learning 20 (1995) 273–297. [7] G. Guo, H. Wang, D. Bell, Y. Bi, Knn model-based