=Paper= {{Paper |id=Vol-3178/CIRCLE_2022_paper_07 |storemode=property |title=Adapting Transformers for Multi-Label Text Classification |pdfUrl=https://ceur-ws.org/Vol-3178/CIRCLE_2022_paper_07.pdf |volume=Vol-3178 |authors=Haytame Fallah,Patrice Bellot,Emmanuel Bruno,Elisabeth Murisasco |dblpUrl=https://dblp.org/rec/conf/circle/FallahBBM22 }} ==Adapting Transformers for Multi-Label Text Classification== https://ceur-ws.org/Vol-3178/CIRCLE_2022_paper_07.pdf
Adapting Transformers for Multi-Label Text
Classification
Haytame Fallah1,3,* , Patrice Bellot1 , Emmanuel Bruno2 and Elisabeth Murisasco2
1
  Aix-Marseille Univ, University of Toulon, CNRS, LIS, Marseille, France
2
  University of Toulon, Aix-Marseille Univ, CNRS, LIS, Toulon, France
3
  Hyperbios, Toulon, France


                                         Abstract
                                         Pre-trained language models have proven to be effective in multi-class text classification. Our goal is to
                                         study and improve this approach for multi-label text classification, a task that has been surprisingly little
                                         explored in the last few years despite its many real world applications. In this paper, our originality is
                                         to propose architectures for the classification layers that are used on top of transformers to improve
                                         their performance for multi-label classification. Our contribution involves the evaluation of thresholding
                                         methods on several transformers, either by computing an individual threshold for each label (IT ) or a
                                         global one (GCT ). We also propose two approaches for multi-label text classification. The first consists
                                         in adding a parameter for learning the number of labels present for a given example (NHA). The second
                                         approach consists in adding a layer to the classification layers in order to learn the features for selecting
                                         the relevant labels while avoiding the use of thresholds (TL). We evaluate these approaches on two
                                         English corpora of newspaper articles and scientific papers and then on a new multi-label dataset of
                                         French scientific article abstracts publicly available. The evaluations show that the performance of
                                         our proposals exceeds that of state-of-the-art multi-label text classification methods for the evaluated
                                         datasets, and are transposable to any multi-label classification problem.

                                         Keywords
                                         Multi-label classification, Transformers, BERT, French Transformers,




1. Introduction
Multi-label classification is a generalization of the multi-class classification problem, where
each instance is associated with exactly one label. In multi-label text classification, the goal is
to associate one or more labels to the input text sample. It is an important natural language
processing task that has many applications in other NLP tasks such as question answering or
entity recognition, but also real-world applications such as information retrieval (e.g. metadata
enrichment and analysis in digital libraries) or content recommendation, and so on.
   Multi-label text classification is a challenging task due to the fact that several factors must be
taken into account such as the dependencies that can exist between the labels, the complexity of

CIRCLE (Joint Conference of the Information Retrieval Communities in Europe) 2022, July 04–07, 2022, Samatan,
FRANCE
*
  Corresponding author.
$ haytame.fallah@lis-lab.fr (H. Fallah); patrice.bellot@univ-amu.fr (P. Bellot); emmanuel.bruno@univ-tln.fr
(E. Bruno); elisabeth.murisasco@univ-tln.fr (E. Murisasco)
 0000-0001-8698-5055 (P. Bellot)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
extracting semantic features from the noisy input text that can contain redundant information,
and mapping those features to multiple targets, while also finding the discriminative information
that allows the identification of each label of the document.
   Several methods have been proposed to tackle the multi-label classification problem, whether
it’s traditional methods such as Binary Relevance [1], or deep learning-based approaches such
as CNN, RNN (and a combination of both approaches [2]) or the attention mechanism. These
methods manage to capture the semantic features of the document but fail to consider the
dependencies that can exist between labels. Hierarchical models [3] and graph neural networks
[4] as well as other architectures [5] have been introduced to better capture those dependencies.
But with the emergence of attention-based transformers [6] and their ability to better extract
the semantic representations of text documents, and adaptation of these models for multi-label
text classification, that has the potential of achieving overall better results, is yet to be explored.
   Few multi-label text datasets are popular among the papers treating the multi-label problem.
AAPD [5] and Reuters [7] seem to be the most used datasets in the literature. This is even more
true for the french language, only a few studies have involved french datasets in multi-label
text classification [8, 9]. We, therefore, introduce in this paper a new french multi-label dataset.
   In this article, our main contributions are :

    • The creation of a French corpus of multi-label text classification, MFHAD (for Multilabel
      French HAL Abstracts Dataset), containing the abstracts of scientific articles obtained
      from the open archive HAL (https://hal.archives-ouvertes.fr) that contains more than one
      million papers,
    • The adaptation of available transformer models for multi-label classification,
    • The study of threshold selection methods for more efficient exploitation of the results
      of the transformer models, in particular, the choice of a global threshold , or a threshold
      specific to each label for the individual optimization of the labels,
    • The proposal of two alternative approaches to thresholding for the selection of relevant
      labels. The first one consists in introducing a parameter at the last layer of the transformer
      which will be trained for the calculation of the number of labels present in an example, the
      value of this parameter will be used to select the labels having the strongest activations;
      The second one consists in adding a final layer to the model, which will have the same
      number of parameters as the second last layer (equal to the number of labels), in order to
      obtain more discriminating activation values for the given example, i.e. high activation if
      labels are present, low activation in the opposite case.

  The article is organized as follows: the section 2 presents the approaches that address the
multi-label classification problem, the sections 3 and 4 describe the proposed approaches and
the section 5 is dedicated to experiments.


2. Related Work
In multi-class classification, each example (instance) 𝑋 in the dataset is associated with a single
label. Multi-label classification is furthermore about being able to associate each entry with
multiple Y labels, rather than just one.
2.1. Multi-label Classification Strategies
Multi-label classification methods can be classified into three categories: problem transformation,
adaptation, and ensemble methods.

2.1.1. Problem Transformation (PT)
Problem transformation consists in ’transforming’ the dataset to change the problem into
a single-label multi-class classification. One such method is to consider all possible unique
combinations of labels, label powerset [10], and train a multi-class classifier 𝑀 : 𝑋 → 𝑃 (𝑌 ),
where 𝑃 (𝑌 ) is the powerset of Y, the set of unique and distinct subsets of labels. In addition
to the high number of possible labels that can reach 2|𝑌 | , the challenge lies in finding enough
examples for each combination of labels. For a large |Y|, the training and inference time of
the models is high. It is important to note, that by transforming the problem into a multi-
class classification, the dependencies that may exist between the different labels are no longer
considered [11].

2.1.2. Ensemble Methods
A set of multi-class classifiers can be combined to create a multi-label classifier. For a given
instance, each classifier will predict a single label and all outputs of these classifiers are then
combined via an ensemble method. One of these methods consists in considering a label as
present if a percentage of classifiers having predicted this label is reached, also called the
discriminative threshold. The 𝑅𝐴𝐾𝐸𝐿 algorithm [12] is another variation of this method.
Classifiers trained on random subsets of the labels powersets are used for the creation of a
multi-label classifier, the predictions of these classifiers go through a voting system for the final
prediction. The use of multiple classifiers imposes strong constraints in terms of memory use,
as well as the need to optimize a number of models that increase linearly with the number of
labels in the dataset.

2.1.3. Problem Adaptation (PA)
Problem adaptation methods do not require a transformation of the dataset but an adaptation of
classification algorithms, such as ML-kNN [13] which extends the kNN algorithm for multi-label
data, or BP-MLL [14] an adaptation of the backpropagation algorithm for neural networks.
   The adaptation of deep learning algorithms for multi-labels remains in general an avenue with
few contributions. An adaptation of these approaches could contribute to a significant increase
in performance. The use of a single model without the need for prior data transformation is an
efficient method to try to address the multi-label problem.

2.2. Thresholding Methods
Thresholding methods directly impact the choice of a label for the multi-label problem. The
threshold can be adjusted in several ways, either to optimize all the labels (a global threshold),
or to optimize each label individually (number of thresholds equal to the number of labels). Let
𝑚 be the number of examples in the test dataset (or validation) and 𝑛𝑦 the number of labels.
The four most commonly used strategies for choosing the threshold(s) are:
   - SCut: Labels are optimized individually, thresholds are chosen based on the validation set,
measured either by maximizing a score or minimizing a cost function and without guaranteeing
a global optimum. This method can also be used to obtain a global threshold;
   - RCut (Rank Cut) : Labels are ordered according to their score, the 𝑡 first labels are chosen
as relevant labels. The 𝑡 parameter is either predefined or set from the validation dataset [15];
   - PCut (Proportion Cut): for each label 𝑦𝑖 , the instances of the test data set are ordered
according to the score obtained for this label. The first 𝑘𝑖 instances are chosen for the label
𝑦𝑖 where 𝑘𝑖 = 𝑃 (𝑦𝑖 ) × 𝑥 × 𝑛𝑦 is the number of instances assigned to this label. 𝑃 (𝑦𝑖 ) is the
probability an instance belonging to the label 𝑦𝑖 (computed from the training set), and 𝑥 the
average number of instances to be assigned for any label previously set. If 𝑥 = 𝑛 all instances
are taken, for 𝑥 = 0 no instance is considered to be part of the evaluated label [16, 17];
   - MCut (Maximum Cut): the labels are ordered from the scores obtained for an instance of
the dataset, the threshold is equal to the average of the two contiguous labels for which the
score difference is the most important [18].
   Variations of these methods aiming to overcome the constraints they may impose have been
proposed [15]. The classification score-based methods are the best among the thresholding
approaches [19]. In this paper, the proposed approaches are similar in nature to the Scut and
RCut methods, but with different implementations.



                                                                                        ny activations : al(L)
                                                      …              …         …



                                                C (0) = [CLS]    C (1)      C (L)                                    Transformer
                                                                 pooler



                                                                Transformer Layer-12



                                                                Transformer Layer-2



                                                                Transformer Layer-1

             embeddings

                              [CLS]        I   used   Transformers   for   Multi    Label   classification [SEP]     [PAD]   [PAD]
                          Classification                                                                Separation       Padding
                              Token                                                                       Token          Tokens

Figure 1: BERT architecture with a dense classification layer, on top of the transformer layers, connected
to the [CLS token].




2.3. Deep learning based approaches
[20] uses a neural network for multi-label classification in spectroscopic multi-composition
analysis, a network composed of a classifier to which a parameter is added for learning an
activation threshold. This parameter will be optimized according to the threshold computed by
applying the model on the training set, the target value of the threshold will thus be different
for each iteration of the training phase, rendering the training process much more difficult.
   On the other hand, MAGNET [4], a graph network implementing the attention mechanism to
capture the dependency structure between labels that uses BERT’s embeddings, explicitly tries
to tackle the Multi-label text classification problem and manages to have good performances in
F1 for the AAPD and Reuters datasets (cf. section 5.2). DocBERT [21], which is now the state of
the art reference, adds a linear network to the head of BERT, but without subsequent processing
of the model outputs that can enhance the performance of the transformer (e.g. thresholding
techniques to better choose the correct present labels).
   Transformers have been used for the ’extreme’ multi-label classification of text where very
large corpora of texts with a number of labels that can reach tens of thousands are processed.
Such architectures are not well suited for short texts.
   Attempts to use neural networks for text classification do not focus on multi-label classifica-
tion. Those that do deal with this problem generally do not give importance to how the output
layer activations are exploited (e.g. use of thresholds).


3. BERT Adaptation
BERT introduces bi-directionality in the prediction of masked tokens, where both the left and
right semantic contexts of the word to be predicted are considered. In addition to masked
language modeling, BERT is trained for next-sentence prediction, a task where the model
receives a pair of sentences and tries to predict whether the second sentence follows the first.
BERT introduces a special classification token [CLS] (also having an identifier and an embedding
vector) containing a hidden state of the sentence, updated in each layer of the model.
   A feed-forward neural network (FFNN) of 𝐿 dense layers (usually 𝐿=2) is added on top of the
last transformer layers of the model. This is done to fine-tune the pre-trained transformer for
the desired NLP task (text classification in our case). The token [CLS] is the input of this FFNN,
and 𝑛𝑦 outputs corresponding to the labels of the dataset (similar approach to [21]. Figure 1
shows the architecture of the model.
   For multi-label classification, the values of the 𝐴[𝐿] activations of the 𝐶 [𝐿] output layer can
be used to determine the presence of a label. Each activation can be a value between 0 and
1 representing a probability of the presence of the corresponding label. A threshold is then
used for the label selection process, the trivial value of 0.5 is usually used for this purpose. The
Sigmoid 𝜎 function is the activation function that is suited for this case, alongside the Binary
Cross Entropy as a loss function.
   This approach cannot be considered completely as a problem transformation, it does not
require a transformation of the dataset or the creation of multiple binary classifiers. It also does
not a complete adaptation of neural networks for multi-label classification, the results must be
processed later on for the final classifications.
                                                  #of Labels
                                                  Parameter




                                                           ny
                        …          …          …    activations : al(n)                 …       …          …              …


                C (0) ([CLS])   C (1)    C (n)
                                                                              C (0) ([CLS]) C (1)       C (2)         C (n)
                                                                                                    ny activations ny activations

                                  MAE / MSE
          Transformer                                                                           Optimizer-1
                                  BCE                                    Transformer
                                                                                                Optimizer-2


(a) Added number of labels parameter architecture(b) Added Thresholding Layer architecture (TL ap-
    (NHA approach)                                   proach)
Figure 2: Alternative approaches’ architectures


4. Thresholding methods
Thresholding approaches can be applied to the transformer if we consider each activation in the
final layer to be a representation of a binary classifier of the label it represents. If the activation
is high, the label is considered to be a valid label, and vice-versa.
   We propose two alternative methods to thresholding in an attempt for more efficient use of
the 𝐴[𝐿] activation values of the output layer. These approaches aim to avoid thresholding by
learning text-specific features for a better label selection.

4.1. Global Classification Threshold (GCT)
The classification threshold 𝑠 can be chosen to maximize the classification scores. During the
training of the model and after each iteration, the value of 𝑠 is varied from 0 to 1, with a step
defined beforehand (10−2 ). The micro-F1 score is then computed for each threshold 𝑠. We
finally obtain the global optimal threshold 𝑔𝑐𝑡 which provides the best performance on the
training dataset. This threshold is then used for the validation and test datasets. The set of
labels 𝐿 present for an example 𝑥 can be expressed as:

                                                  𝑌𝑥∈𝑋 = ∪𝑦∈𝑌 {𝑦𝑖 } : 𝜎(𝑎[𝐿])
                                                                         𝑦𝑖 ) ≥ 𝑔𝑐𝑡                                                 (1)
    [𝐿]
   𝑎𝑦𝑖 being the activation corresponding to label 𝑦𝑖 among the activations 𝐴[𝐿] . A variant of
the SCut method (with a global threshold) consists in fine-tuning the optimal threshold from
the validation dataset and then applying it to the test dataset. Only the first approach has been
studied in this article.

4.2. Individual Thresholds (IT)
The use of a shared threshold 𝑠 for all labels assumes that the features associated with the
activations 𝐴[𝐿] are the same for each label, which is not the case, the activation intensity of a
                            [𝐿]
neuron in the last layer 𝑎𝑦𝑖 , when the label 𝑦 is present, varies from label to label, an effect
that is accentuated if the dataset is unbalanced.
   We, therefore, propose to evaluate the SCut method (with individual thresholds) by setting a
                                                                [𝐿]
threshold 𝑖𝑡𝑦 for each label 𝑦 in the dataset, assigned to the 𝑎𝑦𝑖 activations of 𝐶 [𝐿] . The values
of 𝑖𝑡𝑦 are the values that maximize the classification scores for a label 𝑦. These thresholds are
computed as for 𝑔𝑐𝑡 during the training phase of the model, and on the training dataset, by
varying each threshold to maximize the F1 score for each label.

                                𝑌𝑥∈𝑋 = ∪𝑦∈𝑌 {𝑦𝑖 } : 𝜎(𝑎[𝐿]
                                                       𝑦𝑖 ) ≥ 𝑖𝑡𝑦                                 (2)

4.3. N Highest Activations (NHA)
Using a threshold to determine the presence of a label leads in several cases to an under-
classification (respectively over-classification) of an instance when the predicted number of
labels is lower (respectively higher) than the actual number of labels. The first proposed
alternative consists in introducing a neuron at the last layer which will be used only for
computing the number of labels 𝑦 present for an instance. Let 𝐴′[𝐿] be the list of the 𝑁 highest
activations, and 𝑁 being the actual number of labels present for a given instance:

                                                            ′[𝐿])
                                𝑌𝑥∈𝑋 = ∪𝑦∈𝑌 {𝑦𝑖 } : 𝑎[𝐿])
                                                     𝑦𝑖 ∈ 𝐴                                       (3)
The number of labels present for an instance is the target value, the mean absolute error
MAE is used as a loss function for this regression problem. We use a single optimizer in the
backpropagation, therefore, the error of the regression must be scaled to match the classification
error, done by reducing the regression error by a factor of 5. The value of this neuron will be
used to recover the 𝑁 highest activations that will be considered as predicted labels, similarly
to the RCut method. Figure 2a shows the model architecture for this approach.
   The objective of this architecture is to extract from the [CLS] token, criteria or features about
the number of distinct topics present in the text. This token, the final output of the transformer
layers, contains an information-rich representation of the input text.

4.4. Threshold Layer (TL)
The second proposed approach is based on the addition of a dense layer, after the output of the
classifier, for a total of 𝐿 = 3 layers, having 𝑛𝑦 neurons to match the last classification layer
of the model. This additional layer aims to make the values of the final activations as close as
possible to 1 in the case of the presence of the label, or to 0 in the opposite case. Therefore, the
trivial threshold of 0.5 can be used for the classification.
   The addition of this layer could result in an increase in the recall of the classifier as many
activations of the non-present labels will no longer exceed the classification threshold since
they will be as close as possible to 0. A gain in precision can also be expected as the activations
of the present labels will be emphasized more than the activations of the irrelevant labels.
   For this approach, we use two optimizers, the first one which includes and optimizes all
the layers of the transformers as well as the first two layers of the classifier, and another one
dedicated to the optimization of the last layer that we added. The error function used is the
same for both parts of the classifier, the BCE in this case. The final architecture of the classifier
is presented in the figure 2b
Table 1
Dataset used, W is the average number of words per abstract.
                                  #Train      #Valid      #Test      labels    W
               MFHAD              11035        2366       2370        200     140.75
               AAPD               53840        1000       1000        54      163,16
               Reuter-21578        5827        1943       3019        90      127,76




Figure 3: Instance count based on the number of labels for all datasets.




5. Experiments and Results
We present in this section the results of the evaluation of non-deep learning baselines as
well as all the previously mentioned methods on three multi-label text datasets, including the
French dataset "MFHAD" that we have designed and made available. We compare the different
methods, i.e. the thresholding methods as well as the proposed alternatives coupled with BERT
transformer and its variants, to baseline approaches, all put in perspective with optimal target
results (oracle approaches).

5.1. Evaluated Transformers
For the evaluation of the proposed methods, we use HugginFace’s [22] implementation of the
uncased-base version of BERT, with 12 transformer layers and an embedding vector of size 768
dimensions, as well as variants of BERT (uncased-base versions):
  - RoBERTa [23] a variant with an optimized training process, where the Next Sentence
Prediction part of BERT is removed. The dataset used is ten times larger than the one used for
BERT. This has led to a performance gain over the original version on the automatic language
processing tasks in the GLUE benchmark;
  - DistilBERT [24], an efficient variant of BERT that uses knowledge distillation to shrink
the size of BERT by 40% while keeping 97% of its performance. DistilBERT is based on the fact
that after training a very large model, the output distribution can be approximated by a much
smaller neural network, using the Kulback-Leiber divergence [25] as the optimization function ;
  - DeBERTa [26], the most recent variant of BERT where words are represented by two vectors
that encode their content and their relative position in the sentence. It is also characterized
by the optimization of the decoding of the prediction of hidden tokens, which contributes to a
significant gain in efficiency in the pre-training phase of the model, but also in the performances
concerning the various natural language processing tasks;
   - CamemBERT [27] is based on RoBERTa, trained on the French part of OSCAR [28];
   - FlauBERT [29] trained on various French sub-corpora of different writing styles, from
formal writings (e.g. Wikipedia and books), to writings extracted from the internet (e.g. Common
Crawl). For this variant, we use the cased-base version.
   The comparison of different transformers is important to evaluate the performance of our
approaches as well as their re-usability and applicability on different transformer architectures.

5.2. Datasets
Multi-label text classification is not present in NLP leaderboards, the GLUE benchmark for
example, nor in recent CLEF or Semeval conferences. There are few multi-label corpora fre-
quently used in the literature that can be used to evaluate and compare models, and this is even
more true for the French language. To remedy this, we have built our own corpus comprised
of abstracts of scientific articles written in French from HAL, a corpus that we make available
to the public. In this section, we provide details about the different datasets used 1 for the
evaluation of the different models :
   - MFHAD2 : Our dataset is comprised of abstracts of french scientific papers collected from
’HAL’, an open academic research archive, distributed over three major scientific domains, i.e.
computer science, physics, and mathematics. Articles published between 1980 and 2021 that
have a french abstract, which represents 15 771 documents distributed over 200 different labels;
   - Reuter-21578 3 is a collection of articles from the Reuters newswire from the year 1987. It
is a dataset that has been often used to evaluate models for multi-label text classification. An
article can belong to one or more of the 90 domains of the dataset;
   - AAPD (or ArXiv Academic Paper Dataset) is, not unlike the "MFHAD", a collection of the
"Abstract" of several scientific publications. An article can have one or more classifications
among 54 labels. We use the same training, validation, and test distribution as [5].
   Table 1) and figure 3 present the different characteristics of these datasets in more detail.

5.3. Evaluation Method
We will compare the two proposed approaches as well as the coupling of thresholding methods
to transformers with methods that are more explicit on the label selection criteria, but also
with other adaptations of deep learning models to multi-label text classification. We will also
compare all these methods to several upper bounds that represent the optimal results that can
be achieved.




1
  All datasets can be downloaded here: https://zenodo.org/record/6344750#.Yio1YH_MK-r
2
  The extraction of these abstracts was made using the tool made available by HAL: https://api.archives-ouvertes.fr/
  docs/search
3
  https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection
Table 2
Scores for test dataset from Reuters and AAPD, best scores are in bold blue, the blue up arrow ↑ indicates
when an evaluated approach achieves a gain in performance for the associated model. Orig. refers to
the original result taken from the corresponding paper.
                                               Reuters                                  AAPD
  Modèles
                                Pr.        R             F1     Acc      Pr.        R           F1      Acc
                                                    Baselines
  Decision Tree               78,36      75,05       76,67      74,23   49,67      46,8        48,19    26,6
  Bagging                     88,12      79,65       83,67      73,34   77,27     48,16        59,34    26,9
  Random Forest               97,18      57,13       71,96      64,06   94,2      25,49        40,12     20
  GradientBoost               88,06      80,56       84,14      74,23   79,73      46,8        58,98    27,1
  SVM                         94,19      79,62       86,29      80,64   80,85     59,98        68,86    36,2
  CNN [30] Orig.                -          -          86,3        -       -         -          66,4       -
  CNN-RNN [2] Orig.             -          -          85,5        -       -         -          66,9       -
  SGM [5] Orig.                 -          -           -          -       -         -          71,0       -
  MAGNET [4] Orig.              -          -          89,9        -       -         -          69,6       -
  DocBERT 𝑏𝑎𝑠𝑒 [21] Orig.       -          -          89,0        -       -         -          73,4       -
  DocBERT 𝑙𝑎𝑟𝑔𝑒 [21] Orig.      -          -          90,7        -       -         -          75,2       -
  BERT 𝑏𝑎𝑠𝑒                   91.36      90.46       90.91      86.18   76.33     71.95        74.07    41.5
  DistilBERT 𝑏𝑎𝑠𝑒             91.29      90.41       90.84      86.52   80.84     66.41        72.92    41.0
  RoBERTa 𝑏𝑎𝑠𝑒                91.23      87.60       89.38      85.29   74.22     71.50        72.83    40.1
  DeBERTa 𝑏𝑎𝑠𝑒                91.63      90.41       91.02      86.78   75.99     71.00        73.41    41.0
                        Thresholding without any specific architecture (GCT and IT)
  BERT 𝑏𝑎𝑠𝑒 + 𝐺𝐶𝑇             90.0    91,.4 ↑    90.86     86.45 ↑     75.56   72.65 ↑     74.08 ↑     41.7 ↑
  BERT 𝑏𝑎𝑠𝑒 + 𝐼𝑇             88.89    92.30 ↑    90.56      85.72      75.51   72.61 ↑      74.04       41.3
  DistilBERT 𝑏𝑎𝑠𝑒 + 𝐺𝐶𝑇      90.83    90.78 ↑    90.80      86.29      75.73   72.08 ↑     73.86 ↑      39.8
  DistilBERT 𝑏𝑎𝑠𝑒 + 𝐼𝑇       88.40    91.66 ↑     90.0      85.29      79.06   68,31 ↑      73.3 ↑      41.0
  RoBERTa 𝑏𝑎𝑠𝑒 + 𝐺𝐶𝑇         90.73    89.90 ↑ 90.31 ↑ 86.28 ↑ 74.49 ↑ 72,49 ↑              73.47 ↑     40.3 ↑
  RoBERTa 𝑏𝑎𝑠𝑒 + 𝐼𝑇          89.77    90.49 ↑ 90.13 ↑ 85.92 ↑ 75.35 ↑           71,09      73.15 ↑     40.2 ↑
  DeBERTa 𝑏𝑎𝑠𝑒 + 𝐺𝐶𝑇         91.63     90.41     91.02      86.78      74.46   73.56 ↑     74.01 ↑      39.4
  DeBERTa 𝑏𝑎𝑠𝑒 + 𝐼𝑇          90.67    90.92 ↑    90.79      86.61      75.87   72.08 ↑     73.92 ↑      40.7
                              Transformer Architecture Adaptation (NHA et TL)
  BERT 𝑏𝑎𝑠𝑒 + 𝑁 𝐻𝐴            92.33 ↑   85.87     88.98     85.92    73.48     66.38        69.75       40.3
  BERT 𝑏𝑎𝑠𝑒 + 𝑇 𝐿              90.60    90.41     90.50     86.12    73.48    72.20 ↑       72.83       39.5
  DistilBERT 𝑏𝑎𝑠𝑒 + 𝑁 𝐻𝐴      92.08 ↑   86.03     88.95     86.15    73.93     66.29        69.90      41.8 ↑
  DistilBERT 𝑏𝑎𝑠𝑒 + 𝑇 𝐿         90.0    89.66     89.83     85.89    73.63    72.32 ↑      72.97 ↑      40.2
  RoBERTa 𝑏𝑎𝑠𝑒 + 𝑁 𝐻𝐴          89.43    83.20     86.20     83.40   75.04 ↑    66.58        70.56      42.0 ↑
  RoBERTa 𝑏𝑎𝑠𝑒 + 𝑇 𝐿           91.17   89.05 ↑ 90.09 ↑ 86.02 ↑ 76.48 ↑ 71.58 ↑             73.94 ↑     42.5 ↑
  DeBERTa 𝑏𝑎𝑠𝑒 + 𝑁 𝐻𝐴         92.22 ↑   86.41     89.21     86.35    75.32     65.67        70.16      41.9 ↑
  DeBERTa 𝑏𝑎𝑠𝑒 + 𝑇 𝐿           90.97   90.43 ↑    90.70     86.12    75.97    71.70 ↑      73.78 ↑     41.8 ↑


5.3.1. Baselines
We will first do a comparison with non-neuronal approaches, with interpretable machine
learning classification criteria, using TF-IDF extracted features (with no maximum number of
features for each dataset), as inputs:
   - Decision trees using the "Gini" criterion, without a maximum tree depth, and 2 as the
minimum samples for splitting nodes;
   - Random Forest with 100 estimators (number of trees) and the same parameters used in
the previous method for each tree;
   - Bagging using decision trees as main estimator (10 estimators);
   - GradientBoosting with logistic regression as error function, a learning rate of 0.1 and 100
estimators;
   - Support Vector machine using RBF as the kernel and a regularization parameter of 1.0.

   We also include deep learning approaches as a comparison for the evaluation of our ap-
proaches:
   - CNN [30] and CNN-RNN [2] which use convolutional neural networks to extract text-
specific features;
   - SGM [5] which applies a sequence generation model with a new decoder structure for
multi-label classification;
   - MAGNET [4] a graph network implementing the attention mechanism to capture depen-
dencies between labels;
   - DocBERT [21]: a fine-tuning of the base and large versions of BERT for document
classification.

   To evaluate the performance gain of our approaches, we compare them to an unchanged
version of each transformer model, where no architecture modification is applied, and the
classification threshold is the trivial value of 0.5.

5.4. Oracle Approaches
The target theoretical optimums (oracle approaches) are represented by the following two
approaches:
  - The oracle approach for the N highest activation, where we consider the number of labels
present for an instance as a given and then take the 𝑁 highest activations as the present labels;
  - The oracle approach for both SCut thresholding methods, where the calculation of the
global threshold and the individual thresholds is done from the test dataset. Considered as the
optimal results towards which these methods should come as close as possible.

5.5. Results
In this section, we present the performances of all the approaches mentioned in the previous
sections, tested on the different transformers presented in section 3. We propose the following
notations for these approaches:
   - "GCT": The Global Optimal Threshold method (see section 4.1);
   - "IT": The Individual Threshold method (see section 4.2);
   - "NHA": The N largest activations approach (see section 4.3);
   - "TL": Denotes the "thresholding layer" approach (see section 4.4). As for the oracle
approaches, they will be designated by adding 𝑜𝑟𝑎𝑐𝑙𝑒 after the corresponding approaches.
Table 3
Scores of the oracle approaches for the test set of the Reuters and AAPD datasets (the best scores are in
bold blue).
                                               Reuters                               AAPD
  Models
                                    Pr.       R      F1        Acc      Pr.      R          F1    Acc
  BERT𝑏𝑎𝑠𝑒 +𝐺𝐶𝑇𝑜𝑟𝑎𝑐𝑙𝑒             91.46    90.70    91.08     87.01    80.41   68.85    74.18    43.1
  BERT𝑏𝑎𝑠𝑒 +𝐼𝑇𝑜𝑟𝑎𝑐𝑙𝑒              93.61    91.24    92.41     87.94    83.0    70.59    76.29    44.9
  DistilBERT𝑏𝑎𝑠𝑒 +𝐺𝐶𝑇𝑜𝑟𝑎𝑐𝑙𝑒       91.61    90.17    90.88     86.52    74.50   73.27    73.88    40.0
  DistilBERT𝑏𝑎𝑠𝑒 +𝐼𝑇𝑜𝑟𝑎𝑐𝑙𝑒        92.48    92.04    92.26     87.45    82.59   70.34    75.97    44.0
  RoBERTa𝑏𝑎𝑠𝑒 +𝐺𝐶𝑇𝑜𝑟𝑎𝑐𝑙𝑒          91.21    89.56    90.38     86.32    74.12   71.83    72.96    39.9
  RoBERTa𝑏𝑎𝑠𝑒 +𝐼𝑇𝑜𝑟𝑎𝑐𝑙𝑒           93.18    90.57    91.86     87.78    81.95   70.14    75.58    44.2
  DeBERTa𝑏𝑎𝑠𝑒 +𝐺𝐶𝑇𝑜𝑟𝑎𝑐𝑙𝑒          92.74    90.73    91.72     87.41    75.21   71.95    73.55    40.4
  DeBERTa𝑏𝑎𝑠𝑒 +𝐼𝑇𝑜𝑟𝑎𝑐𝑙𝑒           93.74    91.56    92.64     88.27    82.82   70.26    76.02    44.0
  BERT𝑏𝑎𝑠𝑒 +𝑁 𝐻𝐴𝑜𝑟𝑎𝑐𝑙𝑒            92.55    92.55    92.55     92.45    74.06   74.06    74.06    51.8
  DistilBERT𝑏𝑎𝑠𝑒 +𝑁 𝐻𝐴𝑜𝑟𝑎𝑐𝑙𝑒      91.99    91.99    91.99     91.95    75.09   75.09    75.09    54.7
  RoBERTa𝑏𝑎𝑠𝑒 +𝑁 𝐻𝐴𝑜𝑟𝑎𝑐𝑙𝑒         91.69    91.69    91.69     91.45    73.36   73.36    73.36    51.5
  DeBERTa𝑏𝑎𝑠𝑒 +𝑁 𝐻𝐴𝑜𝑟𝑎𝑐𝑙𝑒         92.31    92.31    92.31     92.41    73.48   73.48    73.48    52.7


   The maximum length of the sequences used is 512 tokens for all datasets. Tables 2 and 4
show the micro-F1 (with precision and recall) and Accuracy scores for the English and French
corpus test datasets respectively. The tables 3 and 5 present the optimum scores for the different
approaches. The results of the thresholding methods as well as the proposed architectures were
obtained from the base versions of the transformers used.
   Transformers outperform all other methods, whether it is the classical methods, or other deep
learning methods. Our baseline versions of BERT and its variants achieves a better performance
than the base version of DocBERT (the current state of the art in multi-label text classification
for the AAPD and Reuters corpora), and for Reuters, better than its large version. This is due to
the longer training in the fine-tuning process, 150 epochs for the Reuters dataset vs 30 in the
case of DocBert, and 40 epochs for the AAPD dataset vs 20. For transformers, a longer training
process generally yields better performance with a low risk of over-fitting.
   We also note that SVMs and decision trees obtain the highest micro-precision scores, at the
expense of recall rate, thus lowering the micro F1 score. But precision is not a reliable factor
for performance measurement in multi-label classification. SVM can be considered as the best
performing non-neural approach due to its high accuracy score, but it falls short of the other
methods tested.
   The GCT thresholding techniques seems to be the best among the evaluated approaches,
it manages to get a better micro-F1 score than its thresholding counterpart, namely the IT
approach, with a score of 91.02 vs 90.70 for the Reuters dataset, and 74.08 vs 74.04 for AAPD
(see table 2). For the Reuters Dataset, the thresholding techniques and the proposed alternatives
do not manage to achieve a gain in micro-F1 scores compared to the baseline versions, except
for the RoBERTa model. This does not apply to the AAPD dataset, gains in performance can be
perceived in using thresholding techniques or the TL approach. Especially for the RoBERTa
model where the gain is the highest. The large version of DocBERT remains the best performing
Table 4
Scores for the test dataset from MFHAD.
                                                                MFHAD
                   Models
                                              Pr.           R        F1      Acc
                                              Baselines
                   Decision Tree             45.29      42.03       43.60    36.84
                   Bagging                   74.99      38.86       51.19    35.49
                   Random Forest             92.80      32.85       48.52    34.43
                   GradientBoost             58.10      40.53       47.75    30.42
                   SVM                       84.79      46.08       59.71    43.33
                   CamemBERT                 71.27      58.85       64.47    49.87
                   FlauBERT                  70.10      62.21       65.92    52.44
                    Thresholding without any specific architecture (GCT and IT)
                   CamemBERT+𝐺𝐶𝑇           70.14     59.31 ↑     64.27    49.74
                   CamemBERT+𝐼𝑇            70.41     59.71 ↑ 64.62 ↑ 50.00 ↑
                   FlauBERT+𝐺𝐶𝑇           71.30 ↑     61.40     66.14 ↑ 52.74 ↑
                   FlauBERT+𝐼𝑇             65.04     64.52 ↑     64.78    50.71
                        Transformer Architecture Adaptation (NHA et TL)
                   CamemBERT+𝑁 𝐻𝐴         60.33     53.60     56.76     48.43
                   CamemBERT+𝑇 𝐿         71.45 ↑    58.18     64.14   50.97 ↑
                   FlauBERT+𝑁 𝐻𝐴          61.64     54.81     58.02     49.16
                   FlauBERT+𝑇 𝐿          70.37 ↑    60.95     65.32     51.05


Table 5
Scores of the oracle approaches for MFHAD.
                                                                 MFHAD
                  Models
                                                     Pr.        R    F1       Acc
                  CamemBERT+𝐺𝐶𝑇𝑜𝑟𝑎𝑐𝑙𝑒               71.37   59.68    65.01   51.09
                  CamemBERT+𝐼𝑇𝑜𝑟𝑎𝑐𝑙𝑒                81.05   56.70    66.73   52.24
                  FlauBERT+𝐺𝐶𝑇𝑜𝑟𝑎𝑐𝑙𝑒                75.44   59.12    66.29   52.87
                  FlauBERT+𝐼𝑇𝑜𝑟𝑎𝑐𝑙𝑒                 80.54   59.45    68.40   54.55
                  CamemBERT+𝑁 𝐻𝐴𝑜𝑟𝑎𝑐𝑙𝑒              66.85   66.85    66.85   62.83
                  FlauBERT+𝑁 𝐻𝐴𝑜𝑟𝑎𝑐𝑙𝑒               67.49   67.49    67.49   63.37

model for AAPD. The complex nature of the scientific vocabulary of this corpus underlines the
possible benefits of increasing the size of the transforming models (adding several layers and
increasing the embedding dimension).
   The difference in performance gains between the datasets can be due to their complexity
level. The proportion of shared tokens/words between the dictionary of the transformers and
the Reuters corpus is much higher than AAPD due to its scientifical nature, which makes the
training process much more difficult on the latter, therefore rendering the thresholding and
alternative approaches more useful to gain performance on the multi-label classification task.
Those techniques won’t be as useful if the learning process yields a good overall semantic
understanding of the dataset, in that case, a threshold of 0.5 is generally more than enough.
   The RoBERTa model is the lowest performing variant of BERT. But the performance gain on
RoBERTa is more perceivable for all the evaluated datasets. This can be explained by the fact that
for this variant of BERT, the Next-Sentence Prediction objective is removed in the pre-training
process. The goal of NSP is to learn long dependencies that can exist across sentences, whereas
Masked Language Modeling is more focused on understanding relationships on the word level.
For multi-label text classification, distinguishing the difference between domains across multiple
sentences is vital, and NSP is certainly a good contributing factor for this purpose.
   The same conclusions can be drawn from the performance data on the MFHAD dataset,
where the baseline version of FlauBERT outperforms its CamemBERT counterpart (65.92 vs
64.47 in the F1-score). This comes as no surprise knowing that the latter is a variant based on
RoBERTa that also lacks the NSP objective, but also the fact that FlauBERT is pre-trained, in
addition to text crawled from the internet (Common-Crawl), on wikipedia article and books,
thus getting more vocabulary coverage of MFHAD than CamemBERT that is pre-trained on
a corpus derived from Common-Crawl. The evaluated approaches fail in obtaining a gain in
performance for FlauBERT that achieved good performance with its baseline version but allow
CamemBERT to get closer to FlauBERT ’s performance.
   The TL architecture performs better than the base version of the DocBERT and MAGNET
models, and matches the large version of the former with a score of 90.70 for the DeBERTa
model on the Reuters dataset. The NHA approach does not succeed in reaching its theoretical
optimal, even though it manages to obtain a score exceeding the base version of DocBERT for the
same dataset (see Transformer Architecture Adaptation part of table 2). These two architectures
require less training time than the two thresholding methods (1.5× faster), but have a higher
inference cost, especially for TL where it’s almost twice that of the other approaches, due to
the presence of the extra layer that this architecture requires.
   As shown in tables 2 and 4, an increase in the recall is observed for almost every model and
every dataset for the two thresholding methods. This is due to the fact that modifying the
classification thresholds (especially by lowering them) can lead to more labels being predicted
as present, but can also lead to a decrease in precision. The same can be said for the TL
alternative method where adding an additional layer contributed in some cases to an increase
in the final activations’ values, thus using a threshold of 0.5 might include some labels that
couldn’t otherwise be predicted. The opposite effect is observed for the NHA method, where
the model predicts the most frequent number of labels present in the unbalanced dataset which
is always the smallest. This leads to an increase in precision (fewer labels are being predicted
thus fewer false positives) but this comes at the cost of the recall.
   The IT𝑜𝑟𝑎𝑐𝑙𝑒 approach is the best performing among the evaluated oracle approaches. Com-
puting an individual threshold for each label from the test set leads to a global optimal score
for all labels, surpassing the optimum for the GCT method. But this is not the case for its
experimental equivalent. GCT remains the method which comes closest to its theoretical
optimum (cf. tables 3 and 2). This can be explained by the fact that for the IT method several
thresholds must be computed for each label, which does not guarantee a global optimal on all
the labels for the test set. This effect is amplified if the number of labels is large.
   For the theoretical optimal of the method NHA, providing the actual number of relevant labels
allows new labels to be considered as present, which increases true positives and decreases
false negatives. But this could lead to an increase in false positives as these new labels are in
some cases non-valid predictions. The instances for which this approach reduces the number
of what would have otherwise been predicted by the model (using 0.5 as a threshold) are rare.
Thus the micro F1 score in some cases may be lower compared to the theoretical optimal of the
thresholding approaches. On the other hand, a considerable increase of the accuracy is observed,
it is in fact the oracle approach that achieves the highest accuracy scores as shown in tables 3
and 5. The actual implementation of this approach did not achieve the same performance as
its theoretical optimum. The hidden state of the sentence contained in the [CLS] token may
not be sufficient to estimate the number of existing labels, a task that is made harder by the
unbalanced nature of the datasets (regarding the number of labels per instance). Exploiting the
attention scores obtained in the different layers of the model could be a more efficient method
to accomplish this task.
    DeBERTa is the best performing variant of BERT among all the evaluated models, the changes
made by this variant regarding the addition of the relative position of the word to the sentence
as an input, seem to improve the performance of BERT. This performance improvement comes
at the expense of the training speed of the model (18% slower training speed than the other
variants on average, but with similar inference speeds). As for DistilBERT, it achieves results at
the same level as its original version, despite its reduced size. Knowledge distillation seems to
be an efficient method to overcome the major drawback of transformers and neural networks:
the necessity of a long training time.


6. Conclusion and Future Work
As far as the application domain is concerned, multi-label classification is a relevant task.
However, multi-label text classification is not a common task and, regrettably, it is not included
in the most prominent benchmarks such as GLUE.
   Transformer-based language models outperform other deep neural architectures and provide
a strong adaptable foundation for a multitude of NLP tasks and text classification, which is the
direction we followed in this study.
   First, we have tested and shown in this paper that thresholding approaches can achieve a
performance gain if the learning process does not yield good results for datasets of a complex
nature. Computing a global threshold achieves higher results than computing an individual
threshold for each class. The optimization done on each class does not guarantee a general
optimal for all labels. We then proposed modifications to the transformer architecture. This is
done by adding an additional layer, with a number of activations equal to the number of classes,
at the bottom of the model to avoid optimizing thresholds. An approach that, on average, is as
effective as thresholding approaches. The analysis of the optimal to be reached showed that the
use of the number of classes for the selection of labels significantly increases the performance.
However, the difference between the optimal and the actual results of our experiments shows
that our proposal needs to be improved in the case of unbalanced data sets. The language of the
text corpora, English for AAPD and Reuters, or French for the MFHAD corpus we have built,
does not seem to be a factor that impacts the performance of the proposed approaches. These
approaches can be used for any multilabel classification problem. Each BERT variant seeks to
improve the constraining aspects of the original model. DistilBERT, despite its reduced size,
obtains results as high as the original version. DeBERTa is the best performing variant, together
with CamemBERT which outperforms FlauBERT.
   We have shown the potential of approaches for adapting the transformer model for
multi-label text classification. All these approaches are not limited to text classification but
can be used for any other multi-label classification task. Our future work will be focused on
exploring other methods of adapting neural networks for multi-label text classification. The
thresholds can be learned parameters of the model during the training phase. Finding a more
effective method for computing the number of classes present for the NHA method, and finding
a better way to represent the dependencies between the labels for the TL approach, are other
areas we will explore in the future.



References
 [1] M.-L. Zhang, Y.-K. Li, X.-Y. Liu, X. Geng, Binary relevance for multi-label learning: an
     overview, Frontiers of Computer Science 12 (2018).
 [2] G. Chen, D. Ye, Z. Xing, J. Chen, E. Cambria, Ensemble application of convolutional and
     recurrent neural networks for multi-label text categorization, in: 2017 International Joint
     Conference on Neural Networks (IJCNN), IEEE, Anchorage, AK, USA, 2017, pp. 2377–2383.
 [3] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical Attention Networks for
     Document Classification, in: Proceedings of the 2016 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies,
     Association for Computational Linguistics, San Diego, California, 2016, pp. 1480–1489.
 [4] A. Pal, M. Selvakumar, M. Sankarasubbu, Multi-Label Text Classification using Attention-
     based Graph Neural Network, in: ICAART, 2020.
 [5] P. Yang, X. Sun, W. Li, S. Ma, W. Wu, H. Wang, SGM: Sequence Generation Model for Multi-
     label Classification, in: Proceedings of the 27th International Conference on Computational
     Linguistics, Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2018,
     pp. 3915–3926.
 [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is All you Need, in: Advances in Neural Information Processing Systems,
     volume 30, Curran Associates, Inc., 2017.
 [7] D. D. Lewis, Y. Yang, T. G. Rose, F. Li, RCV1: A New Benchmark Collection for Text
     Categorization Research, Journal of Machine Learning Research 5 (2004) 361–397.
 [8] A. Bailly, C. Blanc, T. Guillotin, Classification multi-label de cas cliniques avec Camem-
     BERT (Multi-label classification of clinical cases with CamemBERT ), in: Actes de la 28e
     Conférence sur le Traitement Automatique des Langues Naturelles. Atelier DÉfi Fouille de
     Textes (DEFT), ATALA, Lille, France, 2021.
 [9] A. Imane, B. A. Mohamed, Multi-label Categorization of French Death Certificates using
     NLP and Machine Learning, in: Proceedings of the 2nd international Conference on Big
     Data, Cloud and Applications, BDCA’17, Association for Computing Machinery, New York,
     NY, USA, 2017, pp. 1–4.
[10] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining Multi-label Data, in: O. Maimon, L. Rokach
     (Eds.), Data Mining and Knowledge Discovery Handbook, Springer US, Boston, MA, 2010,
     pp. 667–685.
[11] O. Luaces, J. Díez, J. Barranquero, J. J. del Coz, A. Bahamonde, Binary relevance efficacy
     for multilabel classification, Progress in Artificial Intelligence 1 (2012) 303–313.
[12] G. Tsoumakas, I. Vlahavas, Random k -Labelsets: An Ensemble Method for Multilabel
     Classification, volume 4701, 2007, pp. 406–417.
[13] M.-L. Zhang, Z.-H. Zhou, ML-KNN: A lazy learning approach to multi-label learning,
     Pattern Recognit. (2007).
[14] Min-Ling Zhang, Zhi-Hua Zhou, Multilabel Neural Networks with Applications to Func-
     tional Genomics and Text Categorization, IEEE Transactions on Knowledge and Data
     Engineering 18 (2006) 1338–1351.
[15] Y. Yang, A study of thresholding strategies for text categorization, in: Proceedings of
     the 24th annual international ACM SIGIR conference on Research and development in
     information retrieval, SIGIR ’01, Association for Computing Machinery, New York, NY,
     USA, 2001, pp. 137–145.
[16] Y. Yang, An evaluation of statistical approach to text categorization, Technical Report,
     1997.
[17] D. Lewis, C. Info, L. Studies, M. Ringuette, A Comparison of Two Learning Algorithms for
     Text Categorization, Third Annual Symposium on Document Analysis and Information
     Retrieval (1996).
[18] C. Largeron, C. Moulin, M. Géry, MCut: A Thresholding Strategy for Multi-label Classifi-
     cation, volume 7619, 2012.
[19] R. Al-Otaibi, P. A. Flach, M. Kull, Multi-label Classification: A Comparative Study on
     Threshold Selection Methods (2014).
[20] L. Gan, B. Yuen, T. Lu, Multi-label Classification with Optimal Thresholding for Multi-
     composition Spectroscopic Analysis, Mach. Learn. Knowl. Extr. (2019).
[21] A. Adhikari, A. Ram, R. Tang, J. Lin, DocBERT: BERT for Document Classification,
     arXiv:1904.08398 [cs] (2019).
[22] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, HuggingFace’s Transformers: State-of-the-art
     Natural Language Processing, Technical Report arXiv:1910.03771, arXiv, 2020.
[23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv:1907.11692
     [cs] (2019).
[24] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller,
     faster, cheaper and lighter, arXiv:1910.01108 [cs] (2020).
[25] S. Kullback, R. A. Leibler, On Information and Sufficiency, The Annals of Mathematical
     Statistics 22 (1951) 79–86.
[26] P. He, X. Liu, J. Gao, W. Chen, DeBERTa: Decoding-enhanced BERT with Disentangled
     Attention, arXiv:2006.03654 [cs] (2021). URL: http://arxiv.org/abs/2006.03654.
[27] L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, E. V. d. l. Clergerie, D. Seddah,
     B. Sagot, CamemBERT: a Tasty French Language Model, 2020.
[28] P. J. O. Suárez, B. Sagot, L. Romary, Asynchronous Pipeline for Processing Huge Corpora
     on Medium to Low Resource Infrastructures, Leibniz-Institut für Deutsche Sprache, 2019.
[29] H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux, B. Lecouteux, A. Allauzen, B. Crabbe,
     L. Besacier, D. Schwab, FlauBERT: Unsupervised Language Model Pre-training for French,
     in: LREC, 2020.
[30] Y. Kim, Convolutional Neural Networks for Sentence Classification, in: Proceedings of
     the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
     Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1746–1751.