<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Profiling Irony and Stereotype Spreaders with Language Models and Bayes' Theorem</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xinting Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Copenhagen</institution>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <fpage>5</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>The goal of profiling irony and stereotype spreaders is to classify authors as irony spreaders or not, based on a certain number of their tweets. In this paper, we present our novel system, which is diferent from typical approaches to classification tasks. Instead of extracting features and training machine learning classifiers, we exploit the Bayes' theorem and the property and function of language models to derive a classification system. We explain in detail why our system is able to make the right predictions, and also explore the characteristics of our new system by conducting further experiments. Finally, experimental results show that our approach can efectively classify ironic and non-ironic users and achieve a good performance on the shared task IROSTEREO.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;author profiling</kwd>
        <kwd>irony detection</kwd>
        <kwd>language models</kwd>
        <kwd>Bayes' theorem</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Social media plays a more and more important role in our daily life. It has a lot of advantages for
us, but it is also a hotbed of ofensive and aggressive speech. It is therefore of great significance
to profile users on social media. In the task Profiling Irony and Stereotype Spreaders on Twitter
(IROSTEREO) 2022 [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], we focus on profiling ironic authors on Twitter, especially those who
use irony to spread stereotypes. The goal is to classify users on Twitter as ironic or not, given a
certain number of their tweets. This task is to deal with the subtlety and complexity of human
language. When using irony, language can be used to mean something opposite to its literal
meaning. Moreover, the stereotype can be well hidden, which requires suficient knowledge
of human society and a good understanding of the context to be detected. Thus this task is
somewhat challenging.
      </p>
      <p>
        In this paper, we present our solution for the IROSTEREO task. We propose a novel approach
to classify the users with their tweets. Instead of extracting features and training a classifier
to explicitly do classification on the data, we make use of the function of language models
and apply Bayes’ Theorem to make predictions. In this way, we avoid selecting and extracting
features manually, which is quite necessary when doing classification with very long text inputs.
Experimental results have shown that our approach is efective on the IROSTEREO task. It can
classify users as ironic or not with high probability. Our approach achieves 92.78% accuracy on
the oficial test set on TIRA platform [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        There is a lot of research in the field of author profiling. Many similar tasks are presented
[
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ] these years, with diferent focuses such as author gender profiling, bot author
detection, fake news spreaders detection, and hate speech spreaders detection. While they
focus on diferent aspects of authors, the general goals are to classify authors on social media,
depending on the text content they write. In these tasks, both traditional machine learning
models and deep learning models are exploited commonly. Besides, external resources like word
embeddings [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have been widely used to extract features from raw texts. Typical framework
for these tasks can be summarized as 1) extract features from the text, which includes word
ngrams, TF-IDF features, word embeddings, sentence embeddings, representation from large-scale
pretrained language models, and sometimes linguistic features like POS tagging [
        <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">8, 9, 10, 11</xref>
        ].
2) these features are then fed to classifiers like SVM, BiLSTM, or pretrained language models
[
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ] in order to classify the authors. In these works, there are usually many kinds of
features and classifiers selected, and many diferent combinations of features and classifiers
are tested to select the best system. While this framework usually requires arduous work, the
resulting system can usually achieve very good performance on the task.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Framework</title>
        <p>First of all, let  be the tweets of a user, and let  be the label of the user. We intend to predict
the label of a user, given his/her tweet data. In other words, we try to estimate the probability
distribution  (|). Instead of directly building up a model to do this task, we apply Bayes’
theorem to the target distribution.</p>
        <p>(|) ∝  (|) ()
(1)
where the target distribution  (|) (posterior) is proportional to likelihood  (|) times
prior  (). In this way, we convert the desired probability into two components. Once we
know  (|) and  (), we will know  (|) and thus which label should be assigned to the
given tweets.</p>
        <p>The prior  () is the probability distribution of labels in the dataset, which can be easily
obtained by counting the number of a specific label  and counting the number of labels in
total , then  () = /.</p>
        <p>The likelihood  (|) means that given the label (e.g., the tweets are produced by an irony
and stereotype spreader), how likely are these tweets to appear. We can use language models to
estimate this probability. The language models are able to predict the probability of any given
sequence of words.</p>
        <p>During the training stage, we separate the training data according to the labels, more
specifically, one collection of data consists of all the ironic (we will use this word as a shorthand for
tweets produced by irony and stereotype spreaders) tweets, another collection consists of all the
non-ironic tweets. Namely,  = {| = "I"}, and  = {| = "NI"}, where "I" denotes
ironic, and "NI" denotes non-ironic. Then we train two separate language models on the two
collections of data respectively, which brings about ironic domain language model  and
non-ironic domain language model  . Since the language model  is trained on all the
ironic tweets, it can predict the probability of a given sequence of words appearing in the ironic
domain. In other words,  can estimate the probability  (| = "I"), that is, given that the
writer is an ironic user, how likely is the sequence to appear. Similarly,  can estimate the
probability  (| = "NI"). To make it clear, in the training stage, we do not explicitly predict
the labels. We only use the labels to divide all the tweets into two collections.</p>
        <p>When using our models to predict the label of a given set of tweets, we follow the framework
of Bayes’ theorem specified above. Firstly, the tweets  are fed into  and  , giving
rise to  (| = "I") and  (| = "NI") respectively. Then the likelihood is multiplied
by  () which can be computed as the frequency of each label in training data, producing
 (| = "I") ( = "I") and  (| = "NI") ( = "NI"). By comparing these two, we choose
the label that produces the greater posterior as our prediction. Note that in the training data, the
labels are evenly distributed. The number of ironic users is equal to the number of non-ironic
users, namely  ( = "I") =  ( = "NI"). Thus we can further remove the term  () and
compare only the  (|).</p>
        <p>For example, given a sentence "I love COVID!", which is pretty ironic, the probability of that
occurring in the ironic domain is relatively larger than it occurring in the non-ironic domain. It
is not likely for a normal user to say so, while it is likely that an ironic spreader would say some
sentences like this. Thus for  = "I love COVID!",  (| = "I") &gt;  (| = "NI") if the two
language models work efectively. We can then predict that the corresponding label is "I", ironic.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Language Models</title>
        <p>
          As for the language model, one of our choices is transformer-based neural language
models. Specifically, we use mainly use the causal language models, GPT2 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and DistilGPT2
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Given a sequence of tokens, the models are able to predict the next token based on
the previous tokens. In other words, the models can estimate the probability of each token
 (|1, 2, ..., − 1), where (1, 2, ..., − 1, , +1, ..., ) is a sequence of tokens. The
probability of a whole sequence can be obtained as
        </p>
        <p>(1, 2, ..., ) = ∏︁  (|1, 2, ..., − 1)
=1
(2)</p>
        <p>In the task IROSTEREO, we need to predict the label for a given set of tweets generated by
one user (200 tweets for each user). To estimate the probability of all the tweets  (tweets),
we do it in two alternative ways: 1)  (tweets) = ∏︀
=1  (tweet), where  (tweet) is the
probability of a single tweet, which is computed using equation 2. In this way, the tweets
are considered independent. 2) Concatenate all the tweets as a single sequence, then divide it
into pieces whose length is equal to the maximum input length of the language model. Since
the maximum input length is quite long (e.g., 1024 tokens for DistilGPT2), one piece typically
contains multiple tweets. The probability is computed as  (tweets) = ∏︀
=1  (piece ). Thus in
this way, the language model is able to attend much longer context on average when estimating
the probability for each token. In the following part, we refer to our approach as BayLMs,
which contains two language models. We refer the approach using 1) to train and predict as
BayLMs, the one using 2) as BayLMs.</p>
        <p>Another kind of important language model we use is the n-gram language model. The main
idea of the n-gram language model is that instead of computing the probability of the word
(now we use "word" instead of "token" because words are the basic units in this case) based on
its whole history (i.e.  (|1, 2, ..., − 1)), it approximates the history with a few words
before (i.e.,  (|− 2, − 1) in the case of trigram). In other words, the n-gram language
model assumes the probability of a word only depends on its previous n-1 words. Therefore, an
n-gram language model estimates the probability of a sequence as</p>
        <p>(1, 2, ..., ) = ∏︁  (|− +1, ..., − 1)
=1
(3)
where N denotes N-gram.</p>
        <p>To compute the probability of all the tweets written by a single user, we simply concatenate
all the tweets as a single sequence, then apply the n-gram language model as 3.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Further Analysis</title>
        <p>In sum, we do not train the models to explicitly do classification. We train two separate language
models with two diferent corpora. With their ability to estimate the probability of any given
sequence, the predictions can be derived following Bayes’ theorem.</p>
        <p>By this approach, we avoid manually selecting and extracting features. Otherwise, due to
the long length of input (200 tweets) of each instance, we have to carefully extract features
to reduce the length of the input, which is tricky and is possible to lose information. But this
framework is only suitable when there are just a few classes ("I" and "NI" in our case), since for
each class we have to train a language model.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>4.1. Data
Since we only have the ground truth included in the training data, we evaluate our approach
using training data. Because of the small number of instances in the dataset, we need to evaluate
multiple times to get a reliable result and to reduce randomness. Therefore, we use 5-fold cross
validation with the training data. More specifically, there are 420 instances in training data with
the equal number of ironic and non-ironic instances, thus each fold includes 84 instances. Note
that we also keep the ironic and non-ironic instances equal in each fold, thus in each time of
training and validation, the number of ironic and non-ironic instances are always equal.</p>
      <p>To better compare the performance of the diferent schemes, we keep the 5 folds always the
same. That is, we first randomly divide training data into 5 folds, then we always use these 5
folds to test diferent schemes and never re-divide again.</p>
      <p>Scheme
BayLMs with GPT2
BayLMs with GPT2
BayLMs with DistilGPT2
BayLMs with DistilGPT2
BayLMs with unigram LM
BayLMs with bigram LM
BayLMs with trigram LM
BayLMs with 4-gram LM</p>
      <sec id="sec-4-1">
        <title>4.2. Implementation</title>
        <p>
          Our neural language models are implemented using HuggingFace Transformers [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. They
are pretrained on large-scale text corpus beforehand and do causal language modeling during
our training process (same task as they did in pretraining stage). They are optimized using
stochastic gradient descent with AdamW [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] as the optimizer, with weight decay of 0.01. The
language models are trained for just one epoch with a constant learning rate. The batch size
and learning rate are adjusted to suit each scheme respectively. The batch size and learning rate
for BayLMs are 8 and 1e-4. The batch size and learning rate for BayLMs are 4 and 2e-5.
        </p>
        <p>As for n-gram language models, we apply add-k smoothing (Laplace smoothing when k=1).
We also set the vocabulary of  and  as the union of the vocabulary of ironic tweets
and the vocabulary of non-ironic tweets in order to make the comparison between  and
 fair when making the prediction. We conduct experiments with unigram, bigram,
trigram, and 4-gram language models. For each language model, we select the best k value
ranging from 0.01 to 10.0 for add-k smoothing according to the accuracy of cross validation.
The selected k for each models are: 0.01 (unigram), 3.0 (bigram), 0.1 (trigram), 0.01 (4-gram).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Results</title>
        <p>The results are presented in table 1. In each cell, the number is the mean accuracy in the
cross-validation procedure. The number in the bracket is 2 times the standard deviation of the
accuracy values, which covers a range that most of the accuracy values would fall in theoretically,
assuming the values are normally distributed. We can see that the accuracy varies wildly in
cross validation. This is because of the small number of instances in training data, which results
in only 84 instances each time of validation. In the table, GPT2 represents using pretrained GPT2
as the language models, DistilGPT2 means using pretrained DistilGPT2 as language models,
and n-gram LM means using n-gram language models.</p>
        <p>First of all, we can see that even though the models are not trained under classification tasks,
they are able to distinguish between ironic spreaders and normal users efectively. With high
probability, our approach can make accurate predictions.</p>
        <p>The results also show that our framework is able to work with entirely diferent language
models, since both BayLMs with GPT2/DistilGPT2 and BayLMs with n-gram language models
achieve good performance.</p>
        <p>It is interesting that n-gram language models achieve comparable results with large-scale
pretrained language models, even though they are quite simple, do not contain any trainable
parameters, and do not require any optimization. They achieve good results simply by counting
frequencies in the training data, while neural network-based language models have millions of
parameters and require costly optimization.</p>
        <p>As for neural network language models, we can see that using DistilGPT2 is slightly better
than using GPT2 in general, even though DistilGPT2 has fewer parameters. BayLMs is
slightly better than BayLMs when they both use GPT2, while BayLMs and BayLMs has
similar performance when using DistilGPT2. In general, even though neural network language
models perform slightly better than n-gram language models, the performance under diferent
setting are similar, which means diferent setting does not matter significantly. Our final result
on the test set of the IROSTEREO task is 92.78%, which is obtained by training BayLMs with
DistilGPT2 on the whole training data.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Other Experiments</title>
      <p>5.1. Data
There are also some interesting results in our experiments that are noteworthy. In this section,
our experiments are conducted using another way to divide training data. We randomly divide
the training instances into two sets, the training set which consists of 320 users, and the valid
set which consists of 100 users. Again, the ironic and non-ironic instances are equal in both sets.
As we mentioned before, we also use the same training and valid sets throughout this section
and do not re-divide again. But we do not do cross validation anymore, so the experiments are
much less time-consuming. In this setting, the valid set contains only 100 instances and we will
only validate once, so the accuracy is not as reliable as the one in table 1. But it is still enough
to reflect a change on real accuracy when there is a drastic change occurred.</p>
      <sec id="sec-5-1">
        <title>5.2. The Efect of Pretraining</title>
        <p>In our approach, we compare the probability of the given tweets produced by two language
models. The prediction of our approach depends on the relative magnitude of probability.
 (| = "I") and  (| = "NI") can both increase or decrease, as long as relative relationship
(i.e.,  (| = "I") &gt;  (| = "NI")) does not change, the prediction will not change.</p>
        <p>In this experiment, we use pretrained and un-pretrained DistilGPT2 as our language models.
Un-pretrained DistilGPT2 means that the parameters of the language model are randomly
initialized, and the language models are then trained from scratch using our training data.
While the pretrained DistilGPT2 language models have the parameters that are suficiently
optimized on a large-scale corpus. We use BayLMs. As for hyper-parameters, we use batch
size of 32, a constant learning rate of 1e-4 and weight decay of 0.01, and train language models</p>
        <sec id="sec-5-1-1">
          <title>BayLMs with un-pretrained DistilGPT2 BayLMs with pretrained DistilGPT2</title>
          <p>Accuracy %
91
91
for one epoch. The results are shown in table 2. Note that the accuracy is obtained on the valid
set. The loss values are the training loss of the two language models on the two training corpus
respectively.</p>
          <p>From table 2 we can see that using un-pretrained language model can even achieve the
comparable accuracy as using pretrained language model. While pretraining does afect the
model performance on language modeling tasks, as the training loss is much larger when
the model is not pretrained, pretraining does not afect the performance of our BayLMs to
a significant extent. This unexpected result is because of the intuition we mentioned above.
Without pretraining, the language models are worse at estimating the probability of a given
sequence, in other words, fail to assign a high probability to the occurring sequence. Without
pretraining, both  (| = "I") and  (| = "NI") are likely to decrease, while the relative
relationship is not likely to change.</p>
          <p>Note that in other parts of this paper, we use pretrained language models in our BayLMs.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.3. Masked Language Models</title>
        <p>
          It is also possible to use masked language models in BayLMs, (e.g. BERT [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]). Even though
we cannot estimate strictly the probability of the whole tweets, we can do it in an analogous
way to causal language models. During training, the masked language models are trained to
predict the masked token in the sequence. When we use them in BayLMs to make predictions,
we randomly mask tokens in the input tweets, then the identical masked inputs are fed into the
two masked language models, giving rise to probabilities on the masked tokens, which are then
multiplied together to produce  (| = "I") and  (| = "NI").
        </p>
        <p>
          Specifically, we use pretrained DistilRoBERTa-base [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and BERTweet-base [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The latter
is pretrained on a large-scale tweet corpus, and has special tokens for emojis, user mentions,
and url links. Thus it has a strong ability to understand tweets. We also use BayLMs as
framework. As for hyper-parameters, both DistilRoBERTa and BERTweet are trained with batch
size=32, constant learning rate=2e-5, weight decay=0.01 for one epoch. The masking rate for
random masking on inputs is 0.15 during training and evaluation. The results are shown in
table 3.
        </p>
        <p>We can see that the DistilRoBERTa and BERTweet give comparable accuracy. But the
performance of both models is noticeably worse than BayLMs with causal language models. It may be
because they fail to estimate the probability of all the tokens and finally the probability of the
whole input tweets.</p>
        <sec id="sec-5-2-1">
          <title>BayLMs with DistilRoBERTa-base BayLMs with BERTweet-base</title>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.4. Efect of Training Epochs</title>
        <p>In most cases, training only for one epoch is considered to be insuficient. But our approach is
diferent from the typical method, the situation can be a little bit diferent. We use the BayLMs
with pretrained DistilGPT2, and train language models with exactly the same hyper-parameters
(batch size=8, constant learning rate=1e-4, weight decay=0.01) but for the diferent number of
epochs, ranging from 1 to 6. The results are shown in table 4.</p>
        <p>As we can see, when the number of epochs exceeds 1, the accuracy starts to drop. BayLMs
trained for 6 epochs performs significantly worse than BayLMs trained for 1 epoch. Typically,
machine learning models are trained for much more epochs to achieve the best results, and
models’ performance starts to decline only after training for many epochs, and it usually declines
mildly. While the decline in other machine learning models can be attributed to over-fitting,
the reason for the decline in our case can not be explained as over-fitting, since the models do
not do the same task in the evaluation as in training. In our approach, it is necessary to train
the model, since the  (| = "I") and  (| = "NI") will always be the same if the language
models are not trained. But the accuracy will not necessarily increase with more training. Like
we mentioned in section 5.2, the prediction accuracy depends on the relative relationship of
 (| = "I") and  (| = "NI"), instead of the absolute magnitude of probabilities. It is also
noteworthy that our system has a strong bias towards the class "NI" when trained for too many
epochs. In the case of 6 epochs, there are 38 out of 39 wrong predictions are "NI" (corresponding
ground truths are "I"). While in the case of 1 epoch, there are 5 wrong predictions are "NI" and
3 are "I".</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This paper has shown a novel approach when dealing with classification tasks. We present a
simple scheme, which only makes use of language models and trains with the language modeling
task. The proposed system is simple to implement while it can achieve good performance on
the task.</p>
      <p>In theory, our approach is compatible with any kind of language model, as long as the language
model is able to give the probability of a given sequence. This point is well demonstrated by
our experimental results. Thus, we can further test with other language models in future work,
including diferent types of neural network language models and other traditional ones. In
addition, regarding n-gram language models there are some sophisticated smoothing methods
that are usually believed to achieve better results than simple add-k smoothing. So we can also
do some experiments with them.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O</given-names>
            <surname>.-B. Reynier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berta</surname>
          </string-name>
          , R. Francisco,
          <string-name>
            <given-names>R.</given-names>
            <surname>Paolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Elisabetta</surname>
          </string-name>
          ,
          <article-title>Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO) at PAN 2022, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Heini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kredens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ortega-Bueno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pezik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wolska</surname>
          </string-name>
          , E. Zangerle, Overview of PAN 2022:
          <article-title>Authorship Verification, Profiling Irony and Stereotype Spreaders, and Style Change Detection</article-title>
          , in: M.
          <string-name>
            <surname>D. E. F. S. C. M. G. P. A. H. M. P. G. F. N. F. Alberto</surname>
          </string-name>
          Barron-Cedeno, Giovanni Da San Martino (Ed.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF</source>
          <year>2022</year>
          ), volume
          <volume>13390</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gollub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , TIRA Integrated Research Architecture, in: N.
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Peters (Eds.),
          <source>Information Retrieval Evaluation in a Changing World, The Information Retrieval Series</source>
          , Springer, Berlin Heidelberg New York,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>030</fpage>
          -22948-1\_5.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes-y Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter</article-title>
          ,
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Overview of the 7th author profiling task at pan 2019: bots and gender profiling in twitter</article-title>
          ,
          <source>in: Working Notes Papers of the CLEF 2019 Evaluation Labs Volume 2380 of CEUR Workshop</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giachanou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. H. H.</given-names>
            <surname>Ghanem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Overview of the 8th author profiling task at pan 2020: Profiling fake news spreaders on twitter</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>2696</volume>
          ,
          <string-name>
            <surname>Sun</surname>
            <given-names>SITE</given-names>
          </string-name>
          Central Europe,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          , G. L.
          <string-name>
            <surname>De la Peña Sarracén</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chulvi</surname>
            , E. Fersini,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Profiling hate speech spreaders on twitter task at pan 2021</article-title>
          ., in: CLEF (Working Notes),
          <year>2021</year>
          , pp.
          <fpage>1772</fpage>
          -
          <lpage>1789</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations in vector space</article-title>
          ,
          <year>2013</year>
          . URL: https://arxiv.org/abs/1301.3781. doi:
          <volume>10</volume>
          .48550/ARXIV.1301. 3781.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Glove:
          <article-title>Global vectors for word representation</article-title>
          ,
          <source>in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , S.-y. Kong,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Limtiaco</surname>
          </string-name>
          ,
          <string-name>
            R. S. John,
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          , M. GuajardoCespedes, S. Yuan,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tar</surname>
          </string-name>
          , et al.,
          <source>Universal sentence encoder</source>
          , arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>11175</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , Deep contextualized word representations,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1802</year>
          .05365. doi:
          <volume>10</volume>
          . 48550/ARXIV.
          <year>1802</year>
          .
          <volume>05365</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          ,
          <article-title>Support-vector networks</article-title>
          ,
          <source>Machine learning 20</source>
          (
          <year>1995</year>
          )
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural computation 9</source>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper</article-title>
          and lighter,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .01108. doi:
          <volume>10</volume>
          .48550/ ARXIV.
          <year>1910</year>
          .
          <volume>01108</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <article-title>Huggingface's transformers: State-of-the-art natural language processing</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .03771. doi:
          <volume>10</volume>
          .48550/ ARXIV.
          <year>1910</year>
          .
          <volume>03771</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , Decoupled weight decay regularization,
          <year>2017</year>
          . URL: https://arxiv. org/abs/1711.05101. doi:
          <volume>10</volume>
          .48550/ARXIV.1711.05101.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D. Q.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vu</surname>
          </string-name>
          , A. T. Nguyen,
          <article-title>BERTweet: A pre-trained language model for English Tweets</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>