<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Humor Analysis Based on Human Annotation Challenge at IberLEF 2019: First-place Solution</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Pompeu Fabra</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>160</fpage>
      <lpage>164</lpage>
      <abstract>
        <p>This paper describes the winning solution to the Humor Analysis based on Human Annotation (HAHA) task at IberLEF 2019. The main classi cation task is solved using an ensemble of a ne-tuned multilingual BERT (Bidirectional Encoder Representations from Transformers) model and a naive Bayes classi er. Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 September 2019, Bilbao, Spain.</p>
      </abstract>
      <kwd-group>
        <kwd>Natural Language Processing Deep Neural Networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Humor Analysis based on Human Annotation (HAHA) challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposes
two tasks: to classify tweets in Spanish as humorous or not, and rate how funny
they are on a given scale. This paper describes the winning solution for both of
these tasks.
      </p>
      <p>
        The main classi cation task is solved using an ensemble of a ne-tuned
multilingual BERT (Bidirectional Encoder Representations from Transformers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ])
model and a naive Bayes classi er. The solution achieves the F1 score of 0.821
with the second-place score of 0.816.
      </p>
      <p>The regression task is also solved by ne-tuning a multilingual BERT model.
The nal submission is a weighted average of the regression BERT model and
a LightGBM model (https://github.com/microsoft/LightGBM) estimated on
TFIDF features. The solution achieves the root-mean-square error (RMSE) score
of 0.736 with the second-place team achieving the RMSE of 0.746.</p>
      <p>
        The rest of the paper is organised as follows. Section 2 describes the challenge
and Section 3 describes the solutions for each of the tasks. Section 4 concludes.
The challenge asks to classify tweets in Spanish as humorous or not, and rate
how funny they are on a scale from one (not humorous) to ve. The dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
is a corpus of crowd-annotated Spanish-language tweets split into a train and a
test set. The train set consists of 24000 tweets out of which 38.6% are considered
humorous with an average rating of 2.05. The test set comprises 6000 tweets for
which only text is given. There are two tasks: Humour detection and Funniness
score prediction:
Humour Detection: the goal is to classify tweets into jokes (intended humour
by the author) and not jokes. The performance is measured using F1 score.
Funniness Score Prediction: the goal is to predict a funniness score (average
of crowd-sourced ranks) for a tweet supposing it is a joke. The performance is
measured using root-mean-square error.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Solution Description</title>
      <p>For every model the nal predictions were obtained by averaging predictions
from each fold of a ve-fold cross-validation schemes, in other words by
averaging predictions from ve models each estimated using 80% of the data. The
validation scores reported below are F1 scores and RMSEs calculated on
out-offold predictions for the full train set.
3.1</p>
      <p>Classi cation task
The results are summarised in Table 1. The baseline provided by the organisers
considers a tweet a joke randomly with a probability of 0.5.</p>
      <p>
        The main model used in the nal solution is ne-tuned multilingual cased
BERT model (12-layer, 768-hidden, 12-heads, 110M parameters) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that
supports 104 languages including Spanish. We use PyTorch implementation by
HuggingFace1 that also provides pretrained weights and vocabulary, and build on
top of that. We tokenize the text by basic tokenization followed by WordPiece
tokenization, following the original implementation and do not apply any
preprocessing to the text.
      </p>
      <p>
        For the classi cation task we use binary cross-entropy loss and one-cycle
learning rate schedule [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Given the small dataset and large model capacity
over tting is a major issue, and to combat that we follow [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and apply di erential
learning rates across the layers, with the maximum learning rate for the rst layer
set at a half of the maximum learning rate for the last layer, the classi cation
head (set at 2e-5). The batch size used is 32 and we train the model for each fold
for four epochs, and use the checkpoint for the epoch with the best validation
F1 score for test-set predictions. The di erential learning rates and one-cycle
learning-rate schedule are implemented using FastAI library2. This base model
achieves the score of 0.818 on cross-validation and 0.807 on the test set.
1 https://github.com/huggingface/pytorch-pretrained-BERT
2 https://docs.fast.ai/
      </p>
      <p>
        To further reduce over tting we ne-tune the pretrained weights by using
unsupervised learning and text data from both train and test sets. The idea
comes from the ULMFiT paper [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where unsupervised ne-tuning on the text
from the same domain as the target task before classi cation step signi cantly
improved results. However while in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] the unsupervised task is predicting the
next word here we use the tasks used for training the original BERT language
model: combination of masked language modelling and next sentence prediction
loss, again by modifying HuggingFace library implementation to this task. We
ne-tune the language model to the domain for ten epochs and use the obtained
weights in place of the original pretrained weights for the classi cation task and
repeat the steps above. This model achieves 0.829 on cross validation and 0.815
score on the test set, a signi cant improvement over the base model.
      </p>
      <p>
        We then average these predictions with predictions from a naive Bayes model
following [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] with a logistic regression3 estimated on uni- and bi-grams features
picked using TFIDF.
      </p>
      <p>The model alone achieves only 0.771 on cross-validation, however an ensemble
of the BERT model above and the naive Bayes model with weights 0.72 and 0.28
respectively scores 0.833 on cross validation and 0.821 on the leaderboard. The
optimal weights are derived from cross-validation. This ensemble was the nal
solution for the classi cation task.
The results are summarised in Table 2. Here the baseline is a constant prediction
of 3 assigned to all items in the test. For the regression task we also ensemble two
models: a ne-tuned BERT model and a LightGBM model based on
gradientboosted trees.</p>
      <p>The only di erence between the BERT model for the regression task and
the BERT model for the classi cation task is the loss function - for former we
use the mean-squared loss given that the challenge metric is RMSE. This model
achieves RMSE of 0.726 on cross-validation and 0.746 on the test set.
3 Based on the code
nb-svm-strong-linear-baseline
from</p>
      <p>https://www.kaggle.com/jhoward/</p>
      <p>The second model is a LightGBM model estimated on the same features as
the naive Bayes model above. The loss function is also mean-squared error and
we set bagging and feature fraction parameters to 0.7 to add regularisation. This
model scores 0.795 on cross validation.</p>
      <p>For the nal solution we average predictions from these two models linearly
with weight of 0.71 assigned to the prediction from the BERT model and 0.29
- to the LightGBM model, with weights coming from the cross-validation. This
ensemble scores 0.712 on cross-validation and 0.736 on the test set.
This paper describes the winning solution for both classi cation and regression
tasks of the Humor Analysis based on Human Annotation challenge at IberLEF
2019, which consists of an ensemble of a ne-tuned BERT model and a
complementary model estimated on TFIDF features derived from uni- and bi-grams.</p>
      <p>Firstly, we can see that a high score can be achieved by solely appropriately
ne-tuning a BERT model. In fact that model alone would rank second in this
competition, and with longer hyperparameter search or longer language model
ne-tuning - possibly rank rst.</p>
      <p>Secondly, unsupervised learning can help to reduce over tting and improve
the score signi cantly when the size of the dataset is small.</p>
      <p>And, nally, adding weaker but di erent and diverse models to the ensemble
helps to boost the score as has been demonstrated above.</p>
      <p>Further work can be done both in improving the quality of the predictions
and studying how the models assign them. For instance it would be interesting
to examine predictions from di erent layers of the neural network to study where
does attention fall and what token combinations make the model decide whether
a tweet is humorous or not.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moncecchi</surname>
          </string-name>
          , G.:
          <article-title>A Crowd-Annotated Spanish Corpus for Humor Analysis</article-title>
          .
          <source>Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media</source>
          ,
          <volume>7</volume>
          {
          <fpage>11</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etcheverry</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prada</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          : Overview of HAHA at IberLEF 2019:
          <article-title>Humor Analysis based on Human Annotation</article-title>
          .
          <source>Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          ).
          <source>CEUR Workshop Proceedings</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <fpage>04805</fpage>
          . (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Universal language model ne-tuning for text classi cation</article-title>
          . arXiv preprint arXiv:
          <year>1801</year>
          .
          <fpage>06146</fpage>
          . (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Wang</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            <given-names>C</given-names>
          </string-name>
          .
          <article-title>Baselines and bigrams: Simple, good sentiment and topic classi cation. Proceedings of the 50th annual meeting of the association for computational linguistics: Short papers</article-title>
          -volume
          <volume>2</volume>
          ,
          <fpage>90</fpage>
          -
          <lpage>94</lpage>
          . Association for Computational Linguistics. (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>