<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Models from Bayesian to Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ema Ilic</string-name>
          <email>ema.ilic9@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mercedes Garcia Martinez</string-name>
          <email>m.garcia@pangeanic.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marina Souto Pastor</string-name>
          <email>m.souto@pangeanic.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lugano</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pangeanic</institution>
          ,
          <addr-line>Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper is discussing a review of diferent text classification models, both the traditional ones, as well as the state-of-the-art models. Simple models under review were the Logistic Regression, naïve Bayes, k-Nearest Neighbors, C-Support Vector Classifier, Linear Support Vector Machine Classifier, and Random Forest. On the other hand, the state-of-the-art models used were classifiers that include pretrained embeddings layers, namely BERT or GPT-2. Results are compared among all of these classification models on two multiclass datasets, 'Text_types' and 'Digital', addressed later on in the paper. These datasets are internal to Pangeanic. The experiments were coded in Python 3.8. The codes have been executed with various quantities of data, on diferent servers, and on two diferent datasets. While BERT was tested both as a multiclass as well as a binary model, GPT-2 was used as a binary model on all the classes of a certain dataset. In this paper we showcase the most interesting and relevant results. The results show that for the datasets on hand, BERT and GPT-2 models perform the best, though the BERT model outperforms GPT-2 by one percentage point in terms of accuracy. It should be born in mind that these two models were tested on a binary case though, whereas the other ones were tested on a multiclass case. The models that performed the best on a multiclass case are C-Support Vector Classifier and BERT. To establish the absolute best classifier in a multiclass case, further research is needed that would deploy GPT-2 on a multiclass case.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Text Classification is the procedure of designating
predefined labels for text, and is an essential and significant
part in many Natural Language Processing (NLP) tasks,
such as sentiment analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], topic labeling [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
question answering [
        <xref ref-type="bibr" rid="ref7">3</xref>
        ] and dialog act classification [
        <xref ref-type="bibr" rid="ref8">4</xref>
        ]. In
the era that we live in, there are massive amounts of
data and textual data is produced daily. Thus, it is highly
inconvenient to process all this information manually.
Moreover, due to fatigue or a lack of expertise, the
accuracy of manual data processing is highly questionable.
For these reasons, more and more people and institutions
revert to automatic text classification to do the task with
increased accuracy and reduced human bias. Distinction
between shallow and deep learning models have been
already investigated [
        <xref ref-type="bibr" rid="ref8">4</xref>
        ]. Mainly, shallow models
dominated the text classification field since 1960s until the
early 2010s. Shallow learning refers to statistics-based
models, such as Naïve Bayes (NB), K-Nearest Neighbor
(KNN), and Support Vector Machine (SVM). These
methods had their fair share of success. However, they still
need to do feature engineering, which costs time and
ifnancial resources. In addition, they disregard the
natural sequential structure or contextual information in
textual data. Thus, these models often fail to assign
correct semantics to words. In this research paper, we test
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Models</title>
      <p>
        The shallow models tested in this paper are the
wellexplored Naive Bayes, Support Vector Machine and
KNearest Neighbor. Bayesian classi ers assign the most
likely class to a given example described by its feature
vector [
        <xref ref-type="bibr" rid="ref9">5</xref>
        ]. On the other hand, the Support Vector
Machine are supervised learning models with associated
learning algorithms that analyze data for classification
and regression analysis [
        <xref ref-type="bibr" rid="ref10">6</xref>
        ]. Finally, the K-Nearest
Neighbor is a non-parametric classification method, which is
simple but efective in many cases. For a data record  to
be classified, its  nearest neighbours are retrieved, and
this forms a neighbourhood of  [
        <xref ref-type="bibr" rid="ref11">7</xref>
        ].
      </p>
      <p>
        The deep neural models tested use Bidirectional
Encoder Representations from Transformers (BERT) [
        <xref ref-type="bibr" rid="ref3">8</xref>
        ] and
second generation Generative Pre-trained Transformer
(GPT-2) [
        <xref ref-type="bibr" rid="ref4">9</xref>
        ], implemented by the Huggingface library
[
        <xref ref-type="bibr" rid="ref5">10</xref>
        ].Both of them are transformers-architecture based
models and difer fundamentally in that BERT has just
the encoder blocks from the transformer, whilst GPT-2
has just the decoder blocks from the transformer.
Moreover, GPT-2 is like a traditional language model that takes
word vectors as input and estimates the probability of the
token in the sentence has the context of the previous
words. Thus, GPT-2 generates one token at a time[
        <xref ref-type="bibr" rid="ref6">11</xref>
        ].
      </p>
      <p>By contrast, BERT is not auto-regressive. It uses the
SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, next word as output. It is auto-regressive in nature: each
provisions relating to the act of accession of 16 april 2003
79 particulars to appear on the outer packaging
desloratadine was not teratogenic in animal studies.
there’s no actress in town who can hold a candle to her.
each press of this button cycles through the following three indicator display options:
”a further leading interest rate indicator , the eurepo , was established in early 2002 .”
Averages over the reference period referred to in Article 2(2) of Regulation (EC) No 1249/96:
A discount of 10 EUR/t (Article 4(3) of Regulation (EC) No 1249/96).
”The risk is limited to the explosion of a single article.”
”a rating of 75 Ah, and ”</p>
      <sec id="sec-2-1">
        <title>Label</title>
        <sec id="sec-2-1-1">
          <title>Legal</title>
          <p>Medical
Medical
Vernacular
Tech
Finances</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Label</title>
        <sec id="sec-2-2-1">
          <title>Email Marketing Social Media Social Media</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The main idea of this research paper is to compare the
results of diferent classifiers on two datasets, ’Text_types’ The ’Text_types’ dataset was reduced to a total of 8437
and ’Digital’, described later on in the section 3.1 with units. The Randomized Search and Grid Search cross
regards to relevant metrics, more precisely, precision, validation was applied with the help of scikit-learn
lirecall, accuracy, and F1. brary in order to choose the best hyperparameters for
each simple classifier. The results are reported below. For
3.1. Datasets the K-Nearest Neighbor, the optimal parameters chosen
were the following: for the weights, the inverse weights
Two Pangeanic internal datasets are used for the experi- with respect to the distance were choosen, and a total
ments. The first dataset called ’Text_types’ is comprised number of 3 nearest neighbors was chosen. On the other
of 8.4M values and is divided into four classes: vernacular, hand, the optimal parameters chosen in the grid search
legal, medical, tech and financial text. On the other hand, for Naive Bayes were a prior fit and the additive
smooththe second dataset is comprised of 1.3M values, and is ing parameter was set to 0.01. The parameters chosen
representing digital text content divided into 3 classes: were Newton CG solver, no penalty and a constant was
Social Media, Marketing and Email content. The second added to the decision function. For the C-Support Vector
dataset is referred to as ’Digital’. Classifier the kernel type chosen was ’rbf’ and the degree
of polinomial kernel function is 5.
4.1. Case 1: Simple Classifiers and Grid</p>
      <p>Search
3.2. Tools
The experiments were executed using 24 parallelized
CPU units of type x86_64, and NVIDIA Titan GPU with
Cuda Version 11.0.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>Numerous diferent experiments, tests and trials have
been done in order to observe the widest possible array
of results. Namely, the codes have been executed with
diferent quantities of data, on diferent servers, and on
diferent datasets. While BERT was tested both as a
multiclass as well as a binary model, GPT-2 was used as a
binary model on all the classes of a certain dataset.
4.2. Case 2: Binary BERT
The first model tested was binary BERT model by the
’huggingface’ library. The ’text_types’ dataset was
reduced to 1687 samples for the sake of faster execution of
the code. The dataset was turned into a binary one, in this
case with ’legal’ and ’non-legal’ text categories. The
analysis was conducted with pretrained BERT-base-uncased
model and the results were the following: Namely, with
this pretrained BERT-base-uncased model, the accuracy
of 98.46% was achieved, accompanied by the f1 score of
98.72% for class legal and 98.07% for class non-legal.
4.4. Case 4: Binary GPT-2
Binary GPT-2 model by OpenAI was tested on 5062
samples of the ’Vernacular’ vs ’Non-Vernacular’ class of the
Text_types dataset, with the weighted accuracy obtained
of 98%. Below, one can observe the training and
validation loss for the given classes as well as the confusion
matrix.</p>
      <p>The same model was also tested on all the three classes
of ’Digital’ dataset on a total of 13336 samples for training
of each class.</p>
      <p>As can be observed, the results for discriminating
between the marketing and non-marketing class with the
GPT-2 model were interesting, namely, a weighted
average of 89% can be observed for GPT-2 trained on ’Digital’
dataset. Below are visual representations of the success
of this model on discriminating between the other two
classes.</p>
      <p>A weighted average of the accuracy between the social
media and non-social media class was 96% and for the
email vs. non-email class was 94%. The total weighted
average accuracy of the binary GPT-2 model on the ’Digital’
dataset was 93%.
4.3. Case 3: Multiclass BERT
On the other hand, BERT-base-uncased pretrained model
was also used on a multiclass case of the same dataset
(’Text_types’) and later on the ’Digital’. The Text_types
dataset was tested with 844 samples split into training VFeigrnuarceu3la:rTcralainssinwgiathndGVPaTl-id2aotnio’nTeLxots_stfyopreV’edrantaacsueltar vs.
nonand validation. The ’Vernacular’ Class had the accuracy
of 50/52, ’Finances’ 10/14, ’Legal’ 12/15, ’Medical’ 24/26,
and ’Tech’ 18/20. The overall accuracy of the model on
’Text_types’ dataset was therefore 89.76%.</p>
      <p>For the ’Digital’ dataset, on the other hand, 9000 sam- 4.5. Results
ples were used which were later split to training and
validation sets, and the BERT model was fine-tuned with Results of the research may be observed in the Table 3.
the following results. The email, marketing and social K-Nearest Neighbor, Multinomial Naive Bayes, Logistic
media class had the true positive rates of 416/455 (91.43% Regression C-Support Vector Classifier and Linear
accuracy), 417/456 (91.65% accuracy) and 425/455 (93.4% Support Vector Machine Classifier were tested against
accuracy). Namely, this is a weighted accuracy of 92.05%. the ’Text_Type’ dataset, with the vectorization type
chosen being Character level TF-IDF vector, whereas
the Random Forest model was assigned the word
level TF-IDF vectorization as the character one was
incompatible with the classifier. The best results in
terms of accuracy for the multiclass case were obtained
with the BERT model by the huggingface library and
the C-Support vector classifier from the scikit-learn.</p>
      <p>On the other hand, the best results for the binary case
were obtained with the GPT-2 classifier on a legal-vs it should be borne in mind that GPT-2 was only tested
non-legal class. on a binary case. This is in line with the current research
The absolute best results in terms of precision, recall and on the performance of large scale transformers models
F1 were achieved for the binary BERT, whereas the best in classification tasks. [ 13] [14]
results in terms of those same metrics achieved for a Some further research might be done comparing the
permulticlass case were by a C-Support Vector Classifier by formance of the multiclass GPT-2 on classification tasks
the scikit-learn library. Bear in mind that the Precision, in comparison to BERT. It would be interesting to observe
Recall and F1 for the BERT Pretrained Uncased remain if BERT always performs better, or if it only performs
unknown, and might indeed be greater than for the better on certain kinds of datasets.
other classifiers.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>According to our research, BERT and GPT-2 appear to
perform excellent in a classification task, although BERT
appears to be outperforming the GPT-2 by one
percentage point in terms of accuracy. Both of these models
significantly outperformed the shallow models, though</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Maas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Daly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <article-title>Learning word vectors for sentiment analysis</article-title>
          ,
          <year>2011</year>
          , pp.
          <fpage>142</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Baselines and bigrams: Simple, good sentiment and topic classification, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: approach in classification (</article-title>
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2018</year>
          . URL: https:// arxiv.org/abs/
          <year>1810</year>
          .04805. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1810</year>
          .
          <volume>04805</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Huggingface</surname>
            <given-names>website</given-names>
          </string-name>
          , https://huggingface.co/, ???? Accessed:
          <fpage>2010</fpage>
          -09-30.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Language models are unsupervised multitask learners (</article-title>
          <year>2018</year>
          ). URL: https://d4mucfpksywv.cloudfront.net/ Short Papers),
          <article-title>Association for Computational Lin- better-language-models/language-models.pdf</article-title>
          . guistics, Jeju Island, Korea,
          <year>2012</year>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>94</lpage>
          . URL: [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert: https://aclanthology.org/P12-2018.
          <article-title>Pre-training of deep bidirectional transformers for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <source>Automatic labeling of language understanding</source>
          ,
          <year>2019</year>
          . arXiv:
          <year>1810</year>
          .04805.
          <article-title>multinomial topic models</article-title>
          , in: Proceedings of the [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>How to 13th ACM SIGKDD international conference on ifne-tune BERT for text classification?, CoRR Knowledge discovery</article-title>
          and data mining,
          <year>2007</year>
          , pp.
          <fpage>abs</fpage>
          /
          <year>1905</year>
          .05583 (
          <year>2019</year>
          ). URL: http://arxiv.org/ 490-
          <fpage>499</fpage>
          . abs/
          <year>1905</year>
          .05583. arXiv:
          <year>1905</year>
          .05583.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          , P. S. [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>González-Carvajal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Garrido-Merchán</surname>
          </string-name>
          , ComYu,
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>A survey on text classification: From paring BERT against traditional machine learnshallow to deep learning</article-title>
          , CoRR abs/
          <year>2008</year>
          .00364 ing text classification, CoRR abs/
          <year>2005</year>
          .13012 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2008</year>
          .00364. (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2005</year>
          .13012. arXiv:
          <year>2008</year>
          .00364. arXiv:
          <year>2005</year>
          .13012.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [5]
          <string-name>
            <surname>I. Rish,</surname>
          </string-name>
          <article-title>An empirical study of the naïve bayes classifier</article-title>
          ,
          <source>IJCAI 2001 Work Empir Methods Artif Intell</source>
          <volume>3</volume>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          ,
          <article-title>Support vector networks</article-title>
          ,
          <source>Machine Learning</source>
          <volume>20</volume>
          (
          <year>1995</year>
          )
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <article-title>Knn model-based</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>