<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Tabernero);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>In Unity, There Is Strength: On Weighted Voting Ensembles for Hurtful Humour Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Javier Cruz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucas Elvira</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Tabernero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabel Segura-Bedmar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Carlos III de Madrid (UC3M University), Av. de la Universidad</institution>
          ,
          <addr-line>30, 28911 Leganés</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper describes our participation in the HUrtful HUmour (HUHU) task at IberLEF 2023, geared towards detecting prejudice-fostering humour on Twitter. A novel weighted voting system of ensembles composed of diferent popular transformer models is proposed. We empirically demonstrate that ensembles exceed individual transformers in humour and prejudice detection. Our system ranked 12th (beating 46 teams), 1st (beating 48 teams) and 22nd (beating 26 teams) in the binary classification, multilabel classification and regression tasks, respectively. We conclude that combining state-of-the-art transformer models depicts a promising research direction to yield robust systems for detecting humour spreading prejudice in social media. The code is publicly available online: https://github.com/mtabernerop/JUJUNLP.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Humour Detection</kwd>
        <kwd>Ensemble Learning</kwd>
        <kwd>Transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Hurtful humour refers to a form of humour targeted at a particular individual or group with the
objective of causing emotional pain or ofense. It often involves making derogatory, insulting,
or ofensive comments about the physical appearance, beliefs, culture, race, gender, sexual
orientation, or other personal attributes of an individual or group. Hurtful humour can contribute
to the perpetuation of harmful stereotypes and discrimination, thus leading to feelings of
humiliation, shame and marginalization in the target [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The detection of ofensive comments disguised by the mask of humour and protected by the
subjectivity of the latter poses a challenging still worthwhile task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This becomes particularly
interesting in social media platforms such as Twitter, where content can be shared widely and
quickly, potentially reaching millions of users across the globe. In this context, hurtful humour
is often used to reinforce negative stereotypes and discriminatory attitudes towards minorities
such as women, the LGBTIQ community or immigrants, among others [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Hence, identifying
this content in tweets becomes a crucial step towards ensuring a more inclusive and respectful
online environment [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The HUHU@IberLEF 2023 shared task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] motivates research aimed at identifying prejudice
and stereotyping towards marginalized groups (specifically, women and feminists, the
LGBTIQ community, immigrants and individuals who have experienced racial discrimination, or
those who are overweight) through the use of humour in Twitter posts, which can be used to
disseminate hurtful messages and avoid moral judgment.
      </p>
      <p>This paper describes the participation of our group, jujunlp, at the HUHU@IberLEF 2023
competition. Our work proposes the use of ensembles of state-of-the-art transformer models to
detect, by joint weighted voting, humorous content, prejudiced groups and degree of prejudice in
Spanish-written tweets, which to the best of our knowledge portrays a novel approach to identify
hurtful humour in social media content. The empirical results show that the combination of
transformer predictions weighted by their individual performance on the task allows achieving
competitive results in the aforementioned context.</p>
      <p>The rest of the paper is organized as follows. Section 2 reviews the most significant aspects
of the HUHU@IberLEF 2023 task. Then, a summary of related work is provided in Section 3.
Section 4 thoroughly covers the proposed approach. Sections 5 and 6 describe the empirical
evaluation and discuss the results. Finally, Section 7 portrays valuable conclusions and an
outline of open avenues for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Overview</title>
      <p>
        HUHU@IberLEF 2023 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a competition to boost research on the detection of humorous tweets
expressing prejudice in social networks towards minorities, including women and feminists, the
LGBTIQ community, immigrants and racially discriminated people, and overweight people. For
this purpose, the organizers have created a dataset containing a wide spectrum of texts written
in Spanish from Twitter. We now describe the subtasks that are defined in this competition and
the provided dataset.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Subtasks</title>
        <p>This IberLEF 2023 track allows to participate in three diferent subtasks. The main specifications
of each one are listed below:
HUrtful HUmour Detection (Task 1) Binary classification task aimed at determining
whether a prejudicial tweet is meant to be humorous or not. The metric employed
will be the F1-score over the positive class.</p>
        <p>Prejudice Target Detection (Task 2A) Multilabel classification task where the objective is
to identify the aforementioned minority groups on each tweet. The participating systems
will be evaluated and ranked using the macro F1-score.</p>
        <p>Degree of Prejudice Prediction (Task 2B) Regression task where systems must predict (on
a continuous scale from 1 to 5) how prejudicial a tweet is on average among minority
groups (5 corresponds to the maximum level of prejudicial). The predictions will be
assessed using the Root Mean Squared Error (RMSE).</p>
        <p>The core of this article describes the participation of our group in all subtasks.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Dataset</title>
        <p>
          This section is aimed towards portraying a more thorough insight into the dataset [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] provided
by the organizers of HUHU@IberLEF 2023 shared task. The training and test sets contain 2,671
and 778 hand-annotated tweets in Spanish, respectively. Each instance has an ID, a text, a
binary value that identifies whether the tweet is humorous or not, a real value from 1 to 5
that determines how prejudicial the message is on average among minority groups, and four
binary values to identify the groups that are prejudiced in the text; the target minorities are
women and feminists, the LGBTIQ community, immigrants and racially discriminated people,
and overweight people. Table 1 illustrates two instances from the dataset with humorous and
non-humorous content.
        </p>
        <p>In the training set, 67% (1,802) of the instances were identified as humour tweets, while the
remaining 33% (869) were considered non-humorous. This distribution is maintained in the
test division (522 tweets marked as “no-humor” and 256 tagged as “humor”), thus evidencing a
strong unbalanced distribution of the classes in favour of non-humorous texts. Figure 1 shows
the distribution of tweet length for humour and non-humour instances in both dataset divisions.
Funny tweets tend to be longer, with 27.5 ± 16.0 and 29.9 ± 15.4 tokens on average in the
training and test datasets, while non-humorous texts are slightly shorter, with 23.8 ± 9.9 and
24.4 ± 9.6 tokens, respectively.</p>
        <p>Attending to the class distribution in the multilabel classification task, a particularly
interesting aspect was identified. In the provided training dataset, all tweets were labeled to target
at least one minority group and at most two, i.e., either one or two of the four labels take
the value 1, while the others take the value 0. However, this event does not occur in the test
dataset, where many instances are marked as prejudicial towards three or even all four groups.
Considering this feature, Figure 2 plots the number of instances labeled with each class in the
training and test datasets. For simplicity, each label is referred to by a representative capital
letter, namely “W” for women and feminists, “L” for the LGBTIQ community, “I” for immigrants
and racially discriminated people, and “G” for fatphobic prejudices (“gordofobia” in the original
datasets); this label encoding is maintained throughout the rest of this paper. Note that in the
analysis of the training set the main diagonal contains the instances that are only tagged with a
single class (see Figure 2a); recall that since subtask 2B is posed as a multilabel classification
problem, the matrix trace does not necessarily have to be equal to the total number of instances
in the dataset (and in fact it is not), provided that several instances are labeled as prejudicial
towards more than one minority group. Here, a clear correlation between labels can be seen; for
instance, tweets that are ofensive to women and feminists have a high probability of expressing
prejudice against overweight people as well. In the test set, texts ofensive towards women are
the most common. Tweets that at the same time target this group and the LGBTIQ community
are frequent as well.</p>
        <p>Lastly, Figure 3 plots a density graph of the prejudice scores in the training and test datasets,
separated into two curves to diferentiate between humorous tweets and those with only
hurtful content (i.e., “no-humor” class). It is straightforward to determine that humorous
texts tend to register a higher prejudice score. This again emphasizes the relevance of the
HUHU@IberLEF 2023 shared task: identifying ofensive comments in social media content that
may be hidden behind humorous undertones is essential to ensure a safe, respectful, and diverse
online environment.</p>
        <p>
          During the development phase, the training dataset [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] provided by the organizers was divided
into three splits with a ratio of 70:20:10, i.e., 1,870 tweets for training, 534 for validation, and 267
for testing. Data stratification was performed for the binary and multilabel classification tasks
in order to preserve the class distribution of the original dataset in each split. For regression,
the data subsets obtained through random sampling resulted representative enough.
W L I G ,LW I,W ,GW I,L ,LG I,G I,,LW ,,LGW I,,GW ,L ,,
        </p>
        <p>G ,G
I I
, L</p>
        <p>W
Labels
(b) Test set.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <p>
        The computational detection of humour is a well-established and actively researched topic
within the field of Natural Language Processing (NLP) [
        <xref ref-type="bibr" rid="ref2 ref6 ref7">6, 7, 2</xref>
        ]. In 2017, Zhang et al. introduced
the concept of Contextual Knowledge and diverse features to capture the emotionality and
subjectivity behind humorous content [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Recent eforts have been also directed towards
detecting humour in social media content, such as the work from Zhang and Liu (2014) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
which resorts to the use of Machine Learning (ML) techniques to handle sentiment analysis and
opinion mining to distinguish humorous texts from non-humorous posts.
      </p>
      <p>
        However, based on the dynamic nature of language, cultural context and the subtleties
involved in detecting sarcasm, irony and other forms of non-benign humour, its automatic
recognition is far from triviality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Actually, the fun may sometimes be indistinguishable even
for the human reader. For the sake of involving scientists and fostering research in this field,
various annual events on humour recognition are held. SemEval-2015 Task 11 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] was oriented
to the study of three broad classes of figurative language: irony, sarcasm and metaphor. Further,
Task 6 of the 11th edition of this workshop [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] was aimed towards capturing the specific sense
of humour in tweets submitted to a comedy show. HAHA@IberLEF 2018 [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ] was the
ifrst Spanish-language humour detection challenge, followed by the celebration of the same
competition in 2019 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Here, humour detection was posed as a binary classification task and
the funniness of crow-annotated tweets had to be scored as a regression problem. Additionally,
SemEval-2021 Task 7 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] later extended the participating tracks to recognize ofensive content
in controversial humorous posts. In the same line, SemEval-2017 Task 7 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] channeled studies
towards the analysis of puns.
      </p>
      <p>
        There is little doubt that the introduction of Transformers [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] marked the beginning of a
new thrilling chapter in the NLP domain. After the proposal of BERT [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the “blue-eyed boy”
of this emerging era, multiple alternative architectures have been designed to handle complex
language processing tasks [
        <xref ref-type="bibr" rid="ref17 ref18 ref19">17, 18, 19</xref>
        ]. It is no wonder that their application has ranged through
various practical scenarios, including the recognition of humorous content. For instance, Weller
and Seppi (2019) proposed a transformer-based method that acquires the ability to recognize
jokes by analyzing ratings obtained from Reddit pages [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. They empirically demonstrated
that this contribution outperformed previous approaches in this domain, obtaining an F1-score
of 93.1% and 98.6% for two datasets of puns (32,003 instances) and jokes (231,657 instances),
respectively.
      </p>
      <p>
        Reasonably, the individual success of transformer models raises the question of whether
their combination could potentially ease humour detection. In this context, ensembles that use
multiple ML techniques jointly have shown robust performance in humour detection tasks.
In particular, hitachi [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], the winning team at the SemEval-2020 Task 7 [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], uses stacking
in ensembles of pre-trained language models (PLMs) to compute the final predictions. In this
workshop, two tasks were defined: scoring of funniness in the range [
        <xref ref-type="bibr" rid="ref3">0, 3</xref>
        ] and prediction of the
funnier headline between pairs of these. hitachi ranked first in both substasks, achieving an
RMSE of 0.449 and an accuracy of 67.4%, respectively. The dataset originally contained a total
of 5,000 news headlines.
      </p>
      <p>
        Furthermore, the winner of the HAHA@IberLEF 2019 shared task [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] introduced an
ensembling system of a fine-tuned multilingual version of BERT and a Naïve Bayes classifier,
yielding an F1-score of 82.1% and a 0.736 RMSE for humour detection (binary classification)
and funniness score prediction (regression) tasks, respectively. The dataset consisted of 30,000
hand-annotated Spanish tweets, out of which 38.7% were labeled as humorous.
      </p>
      <p>
        The top system of the 2021 edition of the HAHA competition was jocoso [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], an ensemble
of diverse transformer architectures (plus a Naive Bayes classifier) fine-tuned on the dataset
provided by the organizers. The training and development splits of the latter matched the
training and test sets of the 2019 edition; in addition, a new test split of 6,000 tweets was
provided. jocoso ranked first in the (binary) humour classification task (F1-score = 88.5% ) while
performing competitively in the rest: it obtained the third place in the (regression) humour
rating task (0.6296 RMSE) and was runner-up in the (multiclass) humour logic mechanism
classification and (multilabel) humour target classification tasks, with F1-scores of 29.1% and
35.8%, respectively. These last works have heavily inspired the notions presented in this paper.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. System Overview</title>
      <sec id="sec-4-1">
        <title>4.1. Models</title>
        <p>We now provide a brief description of the state-of-the-art transformers that were used during
the development phase.</p>
        <p>
          Presented in 2018, BERT [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is the most popular transformer model due to its outstanding
performance in many NLP tasks. Since its release, many state-of-the-art transformers have been
developed based on it, including RoBERTa [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], ALBERT [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] and DistilBERT [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], among many
others. BERT was trained under two tasks: masked language modeling (MLM) and next sentence
prediction (NSP). In particular, the multilingual version used in this work was pre-trained in a
self-supervised fashion on the top 104 languages with the most extensive Wikipedia.
        </p>
        <p>
          DistilBERT [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] is a smaller version of BERT that can be trained faster. This is achieved
through distillation, i.e., the number of layers in the initial version of BERT is reduced by a factor
of 2, and token embeddings and poolers are removed to yield a cheaper and lighter transformer
model. In this work, a multilingual version of DistilBERT is assessed.
        </p>
        <p>
          RoBERTa [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] seeks to provide a highly optimized version of BERT by tweaking various
methodological parameters. It was originally trained using texts from five English language
datasets: the BookCorpus dataset [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], the English Wikipedia, the CC-News data, the Stories
dataset [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], and the Open Web data.
        </p>
        <p>
          Lastly, BETO [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] is a variation of BERT trained on a big Spanish corpus [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] with the Whole
Word Masking technique, which masks, in addition to the original token, all tokens of the same
word.
        </p>
        <p>For the sake of completeness, this study uses both the cased and uncased versions of BERT
and BETO. In the remainder of the document, this aspect will be specified with subscript “c” or
“uc” for cased and uncased, respectively.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ensemble voting system</title>
        <p>We present a novel system for detecting hurtful humour in tweets by using ensembles of the
transformer models described in subsection 4.1. Our system combines the raw predictions
of transformer models (which were fine-tuned separately) using a weighted voting system,
resulting in joint predictions that leverage the strengths and address the weaknesses of individual
transformers. The weights assigned to each transformer correspond to the normalized value of
the task score, which is computed over the validation test. Thus, greater importance is given to
the predictions of the transformers that ofer better performance in the task at hand, without
disregarding the contributions of other transformers that would allow a stronger prediction
consensus to be reached. Accordingly, we employ the F1-score over the positive class for the
binary classification task (subtask 1) and the macro F1-score for the multilabel classification
problem (subtask 2A), while for the regression task (subtask 2B) we use the inverse of the RMSE,
all calculated over the validation set. By summing the weighted predictions, a raw output
is yielded. Remark that subtasks 1 and 2A are posed as binary and multilabel classification
tasks, respectively; hence, this raw value is approximated to the nearest binary value in these
scenarios.</p>
        <p>Figure 4 illustrates how the label of a given instance is predicted in the binary classification
task. The ensemble used in this case is composed of four models. The weights refer to the
normalized F1-scores, so that the sum of all the resulting values equals 100%; thus, these weights
represent the percentage of importance assigned to each transformer. Further, the ensemble
output has to be rounded of to produce the final prediction.</p>
        <p>For the sake of completeness, all possible ensembles are defined as a variation with
repetition of the 6 transformer models studied (BERTc,uc, RoBERTa, DistilBERT, and
BETOc,uc). Each ensemble can be represented by a binary string of 6 bits (since
we are evaluating 6 transformers), where each bit  determines whether the model
 is present in the ensemble at hand (1) or not (0). This results in a total of
26-1 = 63 ensembles (one is subtracted since the empty ensemble, encoded by the string
containing six 0’s, is neglected). It is important to note that this only involves calculating the
predictions by varying the weight used in the voting phase. For each subtask, the ensemble
model that performed best on the test division represented the architecture used to estimate the
predictions that were submitted to the competition.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <p>
        Unlike rule-based models or recurrent neural networks (RNNs), transformer models can learn
complex language patterns without extensive preprocessing or human intervention [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Their
architecture, including multi-head attention and encoding-decoding layers, allows them to
handle various NLP tasks without modifying the input data. Transformers excel in unstructured or
loosely structured language scenarios, thus eliminating the need for extensive data preparation.
For the participation in the HUHU@IberLEF 2023 competition, preprocessing tasks such as
tokenization, stemming, lemmatization or stop-word removal did not give better results than
when using the raw data; consequently, these were kept in their original form. A noteworthy
issue is that the treatment of hashtags, URLs, and mentions to other Twitter users were already
addressed in the dataset provided by the organizers, represented in a unified way in the training
and test sets by the words “HASHTAG”, “URL” and “MENTION”, respectively.
      </p>
      <p>All transformer models were individually trained for 10 epochs and a batch size of 8 on the
training split. To avoid overfitting, early stopping was applied with 3-epochs patience. For
the sake of achieving better performance of the transformers in the tasks, hyperparameter
tuning was done via grid-search using diferent learning rates ({2 e-5, 4e-5, 8e-5}) and optimizers
({AdamW, Adafactor}). The best hyperparameter values were chosen evaluating the models on
the validation split. All transformer models were trained on NVIDIA T4 Tensor Core GPUs on
Google Colab.</p>
      <p>We performed an extensive evaluation of all possible ensembles with diferent
hyperparameters on our test split. As the organizers allowed competitors to send two submissions of
predictions for each subtask, we chose the two approaches that reported the best scores on our
test split. Table 2 summarizes these approaches.</p>
      <p>
        The work presented in this paper is implemented in Python 3.10. The development of the
ensembles of transformers is primarily based in the simpletransformers library (version
0.63.9) [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], which allows to quickly train, evaluate and make predictions with fine-tuned
state-of-the-art transformer models within few lines of code. To date, it ofers support to various
NLP tasks including text and token classification, question answering, language modeling and
generation, multi-modal classification, conversational AI, and text representation generation.
      </p>
      <p>Many other libraries have been used to plot, visualize and evaluate the dataset and the
empirical results. Some of these include but are not limited to (in alphabetical order) matplotlib,
numpy, pandas, seaborn and scikit-learn.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>The HUHU@IberLEF 2023 shared task allowed for the submission of two diferent runs of
predictions per task. Hence, our team, jujunlp, participated in the three subtasks by using
the two best-performing ensembles in each. These alongside the hyperparameter values that
yielded the best results on the test split (10% of the training dataset divided for experimentation)
are reported in Table 2.</p>
      <p>After the the organizers published the labeled test dataset, we have been able to evaluate all
transformers on this set (see Table 3). As it was explained above, our experimentation during
the development phase shows that individual transformers perform worse than some ensembles
when they were evaluated on our test split, which was created by randomly stratified sampling.
However, the assessment of all transformers and ensembles on the final test dataset does not
show the same behavior. In fact, some transformers, such as BETOc or BETOuc (both fine-tuned
using AdamW optimizer and 4e-05 as learning rate), overcome the best ensembles that we found
during the development phase (see Table 2). We have pointed out in subsection 2.2 that the
class distribution for subtask 2A (multilabel text classification) is diferent in the training and
test dataset provided by the organizers. As previously stated, all tweets in the training dataset
were labeled with at most two out of the four labels. However, many tweets in the test dataset
are annotated with three or even four labels.</p>
      <p>Additional valuable conclusions can be drawn from Table 2 regarding the use of ensembles
of transformers: although transformer models were also evaluated individually in each subtask,
none exhibited better performance than when they were combined through the voting system
described above. This proves that ensembles achieve better results in the hurtful humour
detection tasks of this competition than any transformer model independently. Attending to the
transformer models, it can be directly observed that BETO is present in all best ensembles of
each task, while BERT does not appear only in the second-best system for the regression track.
On the other hand, DistilBERT exhibited the worst performance among the 6 state-of-the-art
models, being present only in the second-best ensemble for subtask 2B.</p>
      <p>Based on the oficial results reported by the organizers of the HUHU@IberLEF 2023 shared
task, 58, 49 and 48 teams participated in subtasks 1 (binary classification), 2A (multilabel
classification) and 2B (regression), respectively. Table 4 summarizes the participation of our team
(jujunlp) in the competition, which managed to rank 12th, 1st and 22nd in the aforementioned
tracks with the first run of predictions, while the second run finished 27 th, 4th and 25th. In
the following analysis, jujunlp stands for the system that produced the results for run  (see
Table 2). Notice that jujunlp1 for subtask 1 (BERTc+BETOc) difers, for instance, from jujunlp1
for subtask 2B (BERTc+BETOc+BETOuc); the task being analyzed will be explicitly mentioned
as appropriate.</p>
      <p>Regarding subtask 1, our team did not make the top 10. The performance of our system is
clearly improvable, since the diference of F1-scores between retuyt-inco1 and jujunlp1 is
almost 5 points. Further, the best performing baseline ranked atop of both of our submitted
entries in this subtask, recording a value of the aforementioned metric almost 2 points better.</p>
      <p>In subtask 2A, we managed to win the rest of participants. Here, the ensembles of
transformers used appear to suit the detection of prejudiced groups in Spanish tweets, since jujunlp2
also achieved a valuable position in the ranking (4th). Further, our team exceeds all baseline
approaches in this track. A plausible justification for this fact is that ensembles allow to
incorporate diverse perspectives and to take into account the independence of labels in the context
of multilabel classification. This diversity ensures that ensembles can handle cases where labels
are interdependent or co-occur in complex ways. In addition, ensemble systems can handle
noisy labels (due to the inherent subjectivity of those who label the data) more efectively by
leveraging predictions from multiple models.</p>
      <p>Lastly, in subtask 2B jujunlp obtained an RMSE 0.079 units worse than that of the winner of
this task (0.084 for run 2). In fact, all baseline approaches outperform our system, including
beto which ranked 2nd. Although BETO is also present in jujunlp1 and jujunlp2, a diferent
dataset division or model fine-tuning process (among other options) may have been followed
by the organizers, which would explain the notable diferences in the results achieved by our
approach.</p>
      <p>For the sake of performing a thorough error analysis and further experiments with our
systems, Figures 5 to 8 evaluate the results of jujunlp1 and jujunlp2 on the test dataset, whose
labels were publicly released after the deadline for submission of prediction runs was reached.</p>
      <p>Figure 5 shows a similar performance by jujunlp1 and jujunlp2. Both display a lower
precision (0.806 and 0.739) in comparison to their recall (0.875 and 0.897), implying that multiple
non-humorous tweets are incorrectly marked as humour. However, funny tweets are correctly
identified in almost 90% of the cases. These can be considered as competitive results, since
several errors are attributed to instances that are on the borderline of humour (and, in fact,
could give rise to discussion). As an example, the humorous tweet “Lo géneros son como las
torres gemelas, antes eran dos pero ahora es un tema sensible.” is classified by our systems as
“no-humor”.</p>
      <p>(a) Predictions computed by jujunlp1.
(b) Predictions computed by jujunlp2.
r
o
m
u
l h
a
u
t
cA ro
m
u
h
o
n</p>
      <p>The confusion matrices corresponding to the four binary classes that comprise subtask 2A
based on the results achieved by jujunlp1 and jujunlp2 are portrayed in Figures 6 and 7. Again,
both systems exhibit a similar performance. In every class, the precision achieved by the
ensembles is higher than the recall (calculated over the positive class). In other words, the
number of False Negatives (FN) is higher than the amount of False Positives (FP). jujunlp1
and jujunlp2 excel in determining whether a tweet is ofensive towards overweight people,
i.e., “Gordofobia” (fatphobia) is the label where jujunlp1 and jujunlp2 achieve a higher
F1score: 0.930 and 0.898, respectively. The second class in which they show decent results is
“prejudice_woman”. The opposite scenario is presented by the “prejudice_lgbtiq” class (0.683
and 0.669 F1-scores), where these ensembles are practically unable to distinguish tweets that
express prejudice towards the LGBTIQ community. jujunlp1 and jujunlp2 also behave poorly
in recognizing texts with ofense towards immigrants or expressing racial discrimination. A
remarkable aspect that was identified in this subtask is that both ensembles tend to set the
“prejudice_woman” label to true (1) whenever the text at hand mentions women, even when it
does not apply; this explains why the proportion of FPs in the first class is higher than in any
other label. In addition, if the word “negro” appears in a tweet, the ensemble systems directly
mark it as racist. Definitely, this kind of scenarios that give rise to incorrect predictions could
be solved if a superior and more varied training dataset was available.</p>
      <p>Finally, the regression task (subtask 2B) posed arguably the most complex prediction scenario.
Figure 8 plots the predicted versus actual scores of prejudice degree for the tweets in the
test dataset. Remark that correct predictions lie on the main diagonal. The number of points
portrayed under this line evidence that the systems tend to find the content of the tweets in
the dataset more prejudicial than what their annotators have deemed. Handling this task by
ensembling transformer models does not seem to ofer many benefits. In fact, BETO (used as
a baseline approach in the competition) yielded a lower RMSE on the test dataset. A possible
explanation for this is that in this subtask ensembles find it rather dificult to distinguish between
humorous and non-humorous texts. For this reason, when their content is ofensive, they tend
to be rated as highly prejudicial, while for human understanding they may not be so hurtful.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper, we introduced a novel approach of transformer ensembles that use a weighted
voting system to make predictions on Spanish tweets. In particular, it has been applied to detect
hurtful humour, prejudiced groups, and degree of prejudice in the form of binary classification,
multilabel classification, and regression tasks, respectively. We empirically assessed the
performance of several state-of-the-art transformer models, including RoBERTa, DistilBERT, BERT
and BETO (the last two were evaluated on both cased and uncased versions). The predictions
of the ensemble systems were calculated as the sum of the individual transformer predictions,
weighted by the (normalized) value of the metric they achieved in each task. The
experimentation carried out on our test split (a random and stratified sample from the training dataset)
reveals that ensembles consistently exceed individual transformers in all studied tasks.</p>
      <p>The participation of our approach at the HUHU@IberLEF 2023 competition obtained
competitive results. In particular, our ensembles achieved an F1-score of 77.2% in hurtful humour
detection, an F1-score of 79.6% in prejudice target detection, and an RMSE of 0.934 in degree of
prejudice prediction, thus ranking 12th (27th), 1st (4th) and 22nd (25th), respectively. Specifically,
they exhibited strong performance on the multilabel classification task (subtask 2A),
outperforming the rest of competitors, by leveraging model specialization, balancing the accuracy and
recall of the individual models, managing label imbalance, mitigating biases, and improving the
generalization capability of the system to unseen cases during training. We must emphasize
that the class distribution in the training dataset for subtask 2A (multilabel) seems to be quite
diferent from the class distribution of the test set. Further, certain instances of the test dataset
register a prejudice score smaller than 1 even though this is not contemplated in the description
of subtask 2B. Overall, we value these results as encouraging outputs and we firmly believe that
with a larger corpus formed of training and test datasets with similar class distributions, the
learning process could be considerably improved.</p>
      <p>As future work, we plan to include more transformer models pre-trained on Spanish text as
part of the ensemble mechanism. We seek to empirically compare the weighted voting system
described in this work with alternative ensemble methods, including classical (soft) voting,
stacking, bagging and boosting.</p>
      <p>In addition, a motivating line is depicted by the translation of Spanish tweets to English, thus
opening the possibility to use state-of-the-art transformers pre-trained on large English corpora.
Back translation also emerges as a promising approach for data augmentation.
The fundamental metrics considered in this work during the evaluation of the performance of
the transformer ensembles are described below:
• F1-score ranges from 0 to 1 and represents the harmonic mean of precision and recall.</p>
      <p>precision × recall</p>
      <p>F1-score = 2 × precision + recall
• Macro F1-score provides the arithmetic mean of the F1-score for the diferent classes.
∑︀=1 F1-score
Macro F1-Score = ,

where F1-score is the F1-score of class  and  is the data split size.
• Root Mean Squared Error (RMSE) is the root of the squared distance between actual and
predicted values.</p>
      <p>RMSE =
∑︀=1(^ − )2 ,

where ^ and  are the predicted and actual values for instance , respectively, and  is
the data split size.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L. I.</given-names>
            <surname>Merlo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>When humour hurts: linguistic features to foster explainability</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>70</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Meaney</surname>
          </string-name>
          , S. Wilson,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiruzzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lopez</surname>
          </string-name>
          , W. Magdy,
          <article-title>SemEval 2021 task 7: HaHackathon, detecting and rating humor and ofense</article-title>
          ,
          <source>in: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>119</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .semeval-
          <volume>1</volume>
          .9. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .semeval-
          <volume>1</volume>
          .9.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ortega-Bueno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , E. Fersini,
          <article-title>Profiling irony and stereotype spreaders on twitter (irostereo).</article-title>
          ,
          <source>in: PAN</source>
          <year>2022</year>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Labadie-Tamayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Everybody Hurts, Sometimes. Overview of HUrtful HUmour at IberLEF 2023:
          <article-title>Detection of Humour Spreading Prejudice in Twitter</article-title>
          ,
          <source>in: Procesamiento del Lenguaje Natural (SEPLN)</source>
          , volume
          <volume>71</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Labadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          , P. Rosso,
          <article-title>HUrtful HUmour (HUHU): Detection of humour spreading prejudice in Twitter, 2023</article-title>
          . URL: https://doi.org/10.5281/zenodo.7967255. doi:
          <volume>10</volume>
          .5281/ zenodo.7967255.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiruzzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Etcheverry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Prada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosá</surname>
          </string-name>
          , Overview of haha at iberlef 2019:
          <article-title>Humor analysis based on human annotation</article-title>
          ., in: IberLEF@ SEPLN,
          <year>2019</year>
          , pp.
          <fpage>132</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiruzzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Góngora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosá</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Meaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          , Overview of haha at iberlef 2021:
          <article-title>Detecting, rating and analyzing humor in spanish</article-title>
          .,
          <source>Procesamiento de Lenguaje Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>132</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Song, L. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Investigations in automatic humor recognition</article-title>
          ,
          <source>in: 2017 10th International Symposium on Computational Intelligence and Design (ISCID)</source>
          , volume
          <volume>1</volume>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>272</fpage>
          -
          <lpage>275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , N. Liu,
          <article-title>Recognizing humor on twitter</article-title>
          ,
          <source>in: Proceedings of the 23rd ACM international conference on conference on information and knowledge management</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>889</fpage>
          -
          <lpage>898</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Veale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shutova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barnden</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Reyes, SemEval-2015 task 11:
          <article-title>Sentiment analysis of figurative language in Twitter</article-title>
          ,
          <source>in: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2015</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Denver, Colorado,
          <year>2015</year>
          , pp.
          <fpage>470</fpage>
          -
          <lpage>478</lpage>
          . URL: https://aclanthology. org/S15-2080. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>S15</fpage>
          -2080.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Potash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Romanov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Rumshisky, SemEval
          <article-title>-2017 task 6: #HashtagWars: Learning a sense of humor</article-title>
          ,
          <source>in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>57</lpage>
          . URL: https://aclanthology.org/S17-2004. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>S17</fpage>
          -2004.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiruzzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosa</surname>
          </string-name>
          ,
          <article-title>Overview of the haha task: Humor analysis based on human annotation at ibereval 2018</article-title>
          , in: CEUR workshop proceedings, volume
          <volume>2150</volume>
          ,
          <year>2018</year>
          , pp.
          <fpage>187</fpage>
          -
          <lpage>194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiruzzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosá</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Moncecchi</surname>
          </string-name>
          ,
          <article-title>A crowd-annotated Spanish corpus for humor analysis</article-title>
          ,
          <source>in: Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media</source>
          , Association for Computational Linguistics, Melbourne, Australia,
          <year>2018</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>11</lpage>
          . URL: https://aclanthology.org/W18-3502. doi:
          <volume>10</volume>
          . 18653/v1/
          <fpage>W18</fpage>
          -3502.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hempelmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , SemEval
          <article-title>-2017 task 7: Detection and interpretation of English puns</article-title>
          ,
          <source>in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <fpage>58</fpage>
          -
          <lpage>68</lpage>
          . URL: https://aclanthology.org/S17-2005. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>S17</fpage>
          -2005.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv. org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized BERT pretraining approach</article-title>
          , CoRR abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1907</year>
          .11692. arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter</article-title>
          , CoRR abs/
          <year>1910</year>
          .01108 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1910</year>
          . 01108. arXiv:
          <year>1910</year>
          .01108.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          , G. Chaperon,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Spanish pre-trained bert model and evaluation data</article-title>
          ,
          <source>in: PML4DC at ICLR</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>O.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seppi</surname>
          </string-name>
          ,
          <article-title>Humor detection: A transformer gets the last laugh</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3621</fpage>
          -
          <lpage>3625</lpage>
          . URL: https://aclanthology.org/D19-1372. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1372.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Morishita</surname>
          </string-name>
          , G. Morio,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ozaki</surname>
          </string-name>
          , T. Miyoshi, Hitachi at semeval
          <article-title>-2020 task 7: Stacking at scale with heterogeneous language models for humor recognition</article-title>
          ,
          <source>in: Proceedings of the Fourteenth Workshop on Semantic Evaluation</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>791</fpage>
          -
          <lpage>803</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krumm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gamon</surname>
          </string-name>
          , H. Kautz, SemEval
          <article-title>-2020 task 7: Assessing humor in edited news headlines</article-title>
          ,
          <source>in: Proceedings of the Fourteenth Workshop on Semantic Evaluation</source>
          , International Committee for Computational Linguistics,
          <source>Barcelona (online)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>746</fpage>
          -
          <lpage>758</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .semeval-
          <volume>1</volume>
          .98. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>98</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ismailov</surname>
          </string-name>
          ,
          <article-title>Humor analysis based on human annotation challenge at iberlef 2019: Firstplace solution</article-title>
          ., in: IberLEF@ SEPLN,
          <year>2019</year>
          , pp.
          <fpage>160</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Grover</surname>
          </string-name>
          , T. Goel, Haha@ iberlef2021:
          <article-title>Humor analysis using ensembles of simple transformers</article-title>
          ., in: IberLEF@ SEPLN,
          <year>2021</year>
          , pp.
          <fpage>883</fpage>
          -
          <lpage>890</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , R. Soricut,
          <string-name>
            <surname>ALBERT:</surname>
          </string-name>
          <article-title>A lite BERT for self-supervised learning of language representations</article-title>
          , CoRR abs/
          <year>1909</year>
          .11942 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1909</year>
          .11942. arXiv:
          <year>1909</year>
          .11942.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fidler</surname>
          </string-name>
          ,
          <article-title>Aligning books and movies: Towards story-like visual explanations by watching movies and reading books</article-title>
          ,
          <source>in: arXiv preprint arXiv:1506.06724</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Trinh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>A simple method for commonsense reasoning</article-title>
          , CoRR abs/
          <year>1806</year>
          .02847 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1806</year>
          .02847. arXiv:
          <year>1806</year>
          .02847.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          ,
          <source>Compilation of large spanish unannotated corpora</source>
          ,
          <year>2019</year>
          . URL: https://doi.org/ 10.5281/zenodo.3247731. doi:
          <volume>10</volume>
          .5281/zenodo.3247731.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Rajapakse</surname>
          </string-name>
          , Simple Transformers, https://github.com/ThilinaRajapakse/ simpletransformers,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>