A Review of Text Classification Models from Bayesian to
Transformers
Ema Ilic1 , Mercedes Garcia Martinez1 and Marina Souto Pastor1
1
    Pangeanic, Valencia, Spain


                                          Abstract
                                          This paper is discussing a review of different text classification models, both the traditional ones, as well as the state-of-the-art
                                          models. Simple models under review were the Logistic Regression, naïve Bayes, k-Nearest Neighbors, C-Support Vector
                                          Classifier, Linear Support Vector Machine Classifier, and Random Forest. On the other hand, the state-of-the-art models used
                                          were classifiers that include pretrained embeddings layers, namely BERT or GPT-2. Results are compared among all of these
                                          classification models on two multiclass datasets, ‘Text_types’ and ‘Digital’, addressed later on in the paper. These datasets are
                                          internal to Pangeanic. The experiments were coded in Python 3.8. The codes have been executed with various quantities of
                                          data, on different servers, and on two different datasets. While BERT was tested both as a multiclass as well as a binary model,
                                          GPT-2 was used as a binary model on all the classes of a certain dataset. In this paper we showcase the most interesting and
                                          relevant results. The results show that for the datasets on hand, BERT and GPT-2 models perform the best, though the BERT
                                          model outperforms GPT-2 by one percentage point in terms of accuracy. It should be born in mind that these two models
                                          were tested on a binary case though, whereas the other ones were tested on a multiclass case. The models that performed the
                                          best on a multiclass case are C-Support Vector Classifier and BERT. To establish the absolute best classifier in a multiclass
                                          case, further research is needed that would deploy GPT-2 on a multiclass case.


1. Introduction                                                                                       several different text classification models, some shallow
                                                                                                      and some deep.
Text Classification is the procedure of designating pre-
defined labels for text, and is an essential and significant
part in many Natural Language Processing (NLP) tasks,                                                 2. Models
such as sentiment analysis [1], topic labeling [2], ques-
tion answering [3] and dialog act classification [4]. In                                                        The shallow models tested in this paper are the well-
the era that we live in, there are massive amounts of                                                           explored Naive Bayes, Support Vector Machine and K-
data and textual data is produced daily. Thus, it is highly                                                     Nearest Neighbor. Bayesian classi�ers assign the most
inconvenient to process all this information manually.                                                          likely class to a given example described by its feature
Moreover, due to fatigue or a lack of expertise, the accu-                                                      vector [5]. On the other hand, the Support Vector Ma-
racy of manual data processing is highly questionable.                                                          chine are supervised learning models with associated
For these reasons, more and more people and institutions                                                        learning algorithms that analyze data for classification
revert to automatic text classification to do the task with                                                     and regression analysis [6]. Finally, the K-Nearest Neigh-
increased accuracy and reduced human bias. Distinction                                                          bor is a non-parametric classification method, which is
between shallow and deep learning models have been                                                              simple but effective in many cases. For a data record 𝑡 to
already investigated [4]. Mainly, shallow models dom-                                                           be classified, its 𝑘 nearest neighbours are retrieved, and
inated the text classification field since 1960s until the                                                      this forms a neighbourhood of 𝑡 [7].
early 2010s. Shallow learning refers to statistics-based                                                           The deep neural models tested use Bidirectional En-
models, such as Naïve Bayes (NB), K-Nearest Neighbor                                                            coder Representations from Transformers (BERT) [8] and
(KNN), and Support Vector Machine (SVM). These meth-                                                            second generation Generative Pre-trained Transformer
ods had their fair share of success. However, they still                                                        (GPT-2) [9], implemented by the Huggingface library
need to do feature engineering, which costs time and                                                            [10].Both of them are transformers-architecture based
financial resources. In addition, they disregard the nat-                                                       models and differ fundamentally in that BERT has just
ural sequential structure or contextual information in                                                          the encoder blocks from the transformer, whilst GPT-2
textual data. Thus, these models often fail to assign cor-                                                      has just the decoder blocks from the transformer. More-
rect semantics to words. In this research paper, we test                                                        over, GPT-2 is like a traditional language model that takes
                                                                                                                word vectors as input and estimates the probability of the
SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, next word as output. It is auto-regressive in nature: each
Lugano, Switzerland                                                                                             token in the sentence has the context of the previous
Envelope-Open ema.ilic9@gmail.com (E. Ilic);
m.garcia@pangeanic.com (M. G. Martinez);
                                                                                                                words. Thus, GPT-2 generates one token at a time[11].
m.souto@pangeanic.com (M. S. Pastor)                                                                            By contrast, BERT is not auto-regressive. It uses the
                   © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                   Attribution 4.0 International (CC BY 4.0).
                                                                                                                entire surrounding context all-at-once[12].
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Examples from the ’Text_Type’ dataset

           Text                                                                                         Label
           provisions relating to the act of accession of 16 april 2003                                 Legal
           79 particulars to appear on the outer packaging                                              Medical
           desloratadine was not teratogenic in animal studies.                                         Medical
           there’s no actress in town who can hold a candle to her.                                     Vernacular
           each press of this button cycles through the following three indicator display options:      Tech
           ”a further leading interest rate indicator , the eurepo , was established in early 2002 .”   Finances


Table 2
Examples from the ’Digital’ dataset

        Text                                                                                             Label
        Averages over the reference period referred to in Article 2(2) of Regulation (EC) No 1249/96:    Email
        A discount of 10 EUR/t (Article 4(3) of Regulation (EC) No 1249/96).                             Marketing
        ”The risk is limited to the explosion of a single article.”                                      Social Media
        ”a rating of 75 Ah, and ”                                                                        Social Media


3. Methodology                                                   4.1. Case 1: Simple Classifiers and Grid
                                                                      Search
The main idea of this research paper is to compare the re-
sults of different classifiers on two datasets, ’Text_types’    The ’Text_types’ dataset was reduced to a total of 8437
and ’Digital’, described later on in the section 3.1 with       units. The Randomized Search and Grid Search cross
regards to relevant metrics, more precisely, precision,         validation was applied with the help of scikit-learn li-
recall, accuracy, and F1.                                       brary in order to choose the best hyperparameters for
                                                                each simple classifier. The results are reported below. For
                                                                the K-Nearest Neighbor, the optimal parameters chosen
3.1. Datasets
                                                                were the following: for the weights, the inverse weights
Two Pangeanic internal datasets are used for the experi-        with respect to the distance were choosen, and a total
ments. The first dataset called ’Text_types’ is comprised       number of 3 nearest neighbors was chosen. On the other
of 8.4M values and is divided into four classes: vernacular,    hand, the optimal parameters chosen in the grid search
legal, medical, tech and financial text. On the other hand,     for Naive Bayes were a prior fit and the additive smooth-
the second dataset is comprised of 1.3M values, and is          ing parameter was set to 0.01. The parameters chosen
representing digital text content divided into 3 classes:       were Newton CG solver, no penalty and a constant was
Social Media, Marketing and Email content. The second           added to the decision function. For the C-Support Vector
dataset is referred to as ’Digital’.                            Classifier the kernel type chosen was ’rbf’ and the degree
                                                                of polinomial kernel function is 5.
3.2. Tools
                                                                 4.2. Case 2: Binary BERT
The experiments were executed using 24 parallelized
CPU units of type x86_64, and NVIDIA Titan GPU with             The first model tested was binary BERT model by the
Cuda Version 11.0.                                              ’huggingface’ library. The ’text_types’ dataset was re-
                                                                duced to 1687 samples for the sake of faster execution of
                                                                the code. The dataset was turned into a binary one, in this
4. Experiments                                                  case with ’legal’ and ’non-legal’ text categories. The anal-
                                                                ysis was conducted with pretrained BERT-base-uncased
Numerous different experiments, tests and trials have
                                                                model and the results were the following: Namely, with
been done in order to observe the widest possible array
                                                                this pretrained BERT-base-uncased model, the accuracy
of results. Namely, the codes have been executed with
                                                                of 98.46% was achieved, accompanied by the f1 score of
different quantities of data, on different servers, and on
                                                                98.72% for class legal and 98.07% for class non-legal.
different datasets. While BERT was tested both as a mul-
ticlass as well as a binary model, GPT-2 was used as a
binary model on all the classes of a certain dataset.
                                                              4.4. Case 4: Binary GPT-2
                                                          Binary GPT-2 model by OpenAI was tested on 5062 sam-
                                                          ples of the ’Vernacular’ vs ’Non-Vernacular’ class of the
                                                          Text_types dataset, with the weighted accuracy obtained
                                                          of 98%. Below, one can observe the training and valida-
                                                          tion loss for the given classes as well as the confusion
                                                          matrix.
                                                             The same model was also tested on all the three classes
                                                          of ’Digital’ dataset on a total of 13336 samples for training
                                                          of each class.
                                                             As can be observed, the results for discriminating be-
                                                          tween the marketing and non-marketing class with the
Figure 1: Training and Validation Loss for Legal vs. Non- GPT-2 model were interesting, namely, a weighted aver-
Legal binary BERT                                         age of 89% can be observed for GPT-2 trained on ’Digital’
                                                          dataset. Below are visual representations of the success
                                                          of this model on discriminating between the other two
                                                          classes.
                                                             A weighted average of the accuracy between the social
                                                          media and non-social media class was 96% and for the
                                                          email vs. non-email class was 94%. The total weighted av-
                                                          erage accuracy of the binary GPT-2 model on the ’Digital’
                                                          dataset was 93%.


Figure 2: Confusion Matrix for Legal vs. Non-Legal binary
BERT


4.3. Case 3: Multiclass BERT
On the other hand, BERT-base-uncased pretrained model
was also used on a multiclass case of the same dataset
(’Text_types’) and later on the ’Digital’. The Text_types
                                                              Figure 3: Training and Validation Loss for Vernacular vs. non-
dataset was tested with 844 samples split into training       Vernacular class with GPT-2 on ’Text_type’ dataset
and validation. The ’Vernacular’ Class had the accuracy
of 50/52, ’Finances’ 10/14, ’Legal’ 12/15, ’Medical’ 24/26,
and ’Tech’ 18/20. The overall accuracy of the model on
’Text_types’ dataset was therefore 89.76%.
   For the ’Digital’ dataset, on the other hand, 9000 sam-    4.5. Results
ples were used which were later split to training and
                                                              Results of the research may be observed in the Table 3.
validation sets, and the BERT model was fine-tuned with
                                                              K-Nearest Neighbor, Multinomial Naive Bayes, Logistic
the following results. The email, marketing and social
                                                              Regression C-Support Vector Classifier and Linear
media class had the true positive rates of 416/455 (91.43%
                                                              Support Vector Machine Classifier were tested against
accuracy), 417/456 (91.65% accuracy) and 425/455 (93.4%
                                                              the ’Text_Type’ dataset, with the vectorization type
accuracy). Namely, this is a weighted accuracy of 92.05%.
                                                              chosen being Character level TF-IDF vector, whereas
                                                              the Random Forest model was assigned the word
                                                              level TF-IDF vectorization as the character one was
                                                              incompatible with the classifier. The best results in
                                                              terms of accuracy for the multiclass case were obtained
Table 3
Outcomes of Different Classification Models on ’Text_Type’

               Classifier                                       Accuracy    Precision   Recall       F1
               K-Nearest Neighbor                               0.77        0.75-0.92   0.68-0.88    0.60-0.90
               Multinomial Naive Bayes                          0.89        0.81-0.93   0.72-0.96    0.77-0.94
               Logistic Regression                              0.89        0.76-0.94   0.79-0.93    0.79-0.93
               C-Support Vector Classifier                      0.90        0.83-0.99   0.74-0.99    0.78-0.95
               Linear Support Vector Machine Classifier         0.88        0.78-0.92   0.80-0.96    0.80-0.95
               Random Forest                                    0.78        0.55-0.92   0.65-0.92    0.60-0.88
               BERT Pretrained Uncased                          0.90        -           -            -
               BERT binary (Legal/Non-Legal)                    0.99        0.98-0.99   0.98-0.99    0.98-0.99
               GPT-2 binary (Vernacular/ Non-Vernacular)        0.98        0.96-1.00   0.97-0.99    0.98


Figure 4: Training and Validation Accuracy for Vernacular vs.
non-Vernacular class with GPT-2 on ’Text_type’ dataset

                                                                 Figure 5: Confusion Matrix for Vernacular vs. non-Vernacular
                                                                 class with GPT-2 on ’Text_type’ dataset
with the BERT model by the huggingface library and
the C-Support vector classifier from the scikit-learn.
On the other hand, the best results for the binary case
were obtained with the GPT-2 classifier on a legal-vs            it should be borne in mind that GPT-2 was only tested
non-legal class.                                                 on a binary case. This is in line with the current research
The absolute best results in terms of precision, recall and      on the performance of large scale transformers models
F1 were achieved for the binary BERT, whereas the best           in classification tasks. [13] [14]
results in terms of those same metrics achieved for a            Some further research might be done comparing the per-
multiclass case were by a C-Support Vector Classifier by         formance of the multiclass GPT-2 on classification tasks
the scikit-learn library. Bear in mind that the Precision,       in comparison to BERT. It would be interesting to observe
Recall and F1 for the BERT Pretrained Uncased remain             if BERT always performs better, or if it only performs
unknown, and might indeed be greater than for the                better on certain kinds of datasets.
other classifiers.
                                                                 References
                                                                  [1] A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, C. Potts,
5. Conclusions                                                        Learning word vectors for sentiment analysis, 2011,
According to our research, BERT and GPT-2 appear to                   pp. 142–150.
perform excellent in a classification task, although BERT         [2] S. Wang, C. Manning, Baselines and bigrams: Sim-
appears to be outperforming the GPT-2 by one percent-                 ple, good sentiment and topic classification, in:
age point in terms of accuracy. Both of these models                  Proceedings of the 50th Annual Meeting of the As-
significantly outperformed the shallow models, though                 sociation for Computational Linguistics (Volume 2:
Figure 6: Training and Validation Loss for Marketing vs. non-
Marketing class with GPT-2 on ’Digital’ dataset

                                                                Figure 8: Confusion Matrix for Marketing vs. non-Marketing
                                                                class with GPT-2 on ’Digital’ dataset


                                                                     approach in classification (2004).
                                                                 [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
                                                                     Pre-training of deep bidirectional transformers
                                                                     for language understanding, 2018. URL: https://
                                                                     arxiv.org/abs/1810.04805 . doi:10.48550/ARXIV.
                                                                     1810.04805 .
                                                                 [9] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
                                                                     I. Sutskever, et al., Language models are unsuper-
                                                                     vised multitask learners, OpenAI blog 1 (2019) 9.
                                                                [10] Huggingface website, https://huggingface.co/ ,
Figure 7: Training and Validation Accuracy for Marketing vs.         ???? Accessed: 2010-09-30.
non-Marketing class with GPT-2 on ’Digital’ dataset             [11] A. Radford, J. Wu, R. Child, D. Luan,
                                                                     D. Amodei, I. Sutskever,         Language models
                                                                     are unsupervised multitask learners (2018).
                                                                     URL:       https://d4mucfpksywv.cloudfront.net/
     Short Papers), Association for Computational Lin-               better- language- models/language- models.pdf .
     guistics, Jeju Island, Korea, 2012, pp. 90–94. URL:        [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
     https://aclanthology.org/P12- 2018 .                            Pre-training of deep bidirectional transformers for
 [3] Q. Mei, X. Shen, C. Zhai, Automatic labeling of                 language understanding, 2019. arXiv:1810.04805 .
     multinomial topic models, in: Proceedings of the           [13] C. Sun, X. Qiu, Y. Xu, X. Huang,           How to
     13th ACM SIGKDD international conference on                     fine-tune BERT for text classification?, CoRR
     Knowledge discovery and data mining, 2007, pp.                  abs/1905.05583 (2019). URL: http://arxiv.org/
     490–499.                                                        abs/1905.05583 . arXiv:1905.05583 .
 [4] Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S.      [14] S. González-Carvajal, E. C. Garrido-Merchán, Com-
     Yu, L. He, A survey on text classification: From                paring BERT against traditional machine learn-
     shallow to deep learning, CoRR abs/2008.00364                   ing text classification,     CoRR abs/2005.13012
     (2020). URL: https://arxiv.org/abs/2008.00364 .                 (2020). URL: https://arxiv.org/abs/2005.13012 .
     arXiv:2008.00364 .                                              arXiv:2005.13012 .
 [5] I. Rish, An empirical study of the naïve bayes clas-
     sifier, IJCAI 2001 Work Empir Methods Artif Intell
     3 (2001).
 [6] C. Cortes, V. Vapnik, Support vector networks,
     Machine Learning 20 (1995) 273–297.
 [7] G. Guo, H. Wang, D. Bell, Y. Bi, Knn model-based