<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bobak Farzin</string-name>
          <email>bfarzin@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Piotr Czapla</string-name>
          <email>Piotr.Czapla@n-waves.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jeremy Howard</string-name>
          <email>j@fast.ai</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>USF Data Institue, WAMRI Visting Scholar</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of San Francisco &amp; Fast.ai</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>n-waves</institution>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>172</fpage>
      <lpage>179</lpage>
      <abstract>
        <p>Our entry into the HAHA 2019 Challenge placed 3rd in the classi cation task and 2nd in the regression task. We describe our system and innovations, as well as comparing our results to a Naive Bayes baseline. A large Twitter based corpus allowed us to train a language model from scratch focused on Spanish and transfer that knowledge to our competition model. To overcome the inherent errors in some labels we reduce our class con dence with label smoothing in the loss function. All the code for our project is included in a GitHub4 repository for easy reference and to enable replication by others.</p>
      </abstract>
      <kwd-group>
        <kwd>Natural Language Processing ment Analysis</kwd>
        <kwd>Humor Classi cation</kwd>
        <kwd>Transfer Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>- &lt;Socorro, me ha picado una v bora!
- &gt;Cobra?
- No, gratis.5
- Help, I was bitten by a snake!
- Does it charge?
- Not free.</p>
      <p>
        Humor does not translate well because it often relies on double-meaning or
a subtle play on word choice, pronunciation, or context. These issues are further
exacerbated in areas where space is a premium (as frequent on social media
platforms), often leading to usage and development of shorthand, in-jokes, and
self-reference. Thus, building a system to classify the humor of tweets is a di cult
task. However, with transfer-learning and the Fast.ai library6, we can build a high
quality classi er in a foreign language. Our system outperforms a Naive Bayes
Support Vector Machine (NBSVM) baseline, which is frequently considered a
"strong baseline" for many Natural Language Processing (NLP) related tasks
(see Wang et al [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]).
      </p>
      <p>Rather than hand-crafted language features, we have taken an "end to end"
approach building from the raw text to a nal model that achieves the tasks
as presented. Our paper lays out the details of the system and our code can be
found in a GitHub repository for use by other researchers to extend the state of
the art in sentiment analysis.</p>
      <p>Contribution Our contributions are three fold. First, we apply transfer-learning
of a language model based on a larger corpus of tweets. Second, we use a label
smoothed loss, which provides regularization and allows full training of the nal
model without gradual unfreezing. Third, we select the best model for each task
based on cross-validation and 20 random-seed initialization in the nal network
training step.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Task and Dataset Description</title>
      <p>
        The Humor Analysis based on Human Annotation (HAHA) 2019 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] competition
asked for analysis of two tasks in the Spanish language based on a corpus of
publicly collected data described in Castro et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
{ Task1: Humor Detection:Determine if a tweet is humorous. System
ranking is based on F1 score which balances precision and accuracy.
{ Task2: Funniness Score:If humorous, what is the average humor rating
of the tweet? System ranking is based on root mean-squared error (RMSE).
The HAHA dataset includes labeled data for 24,000 tweets and a test set of
6,000 tweets (80%/20% train/test split.) Each record includes the raw tweet
text (including accents and emoticons), a binary humor label, the number of
votes for each of ve star ratings and a \Funniness Score" that is the average
of the 1 to 5 star votes cast. Examples and data can be found on the CodaLab
competition webpage7.
4 https://github.com/bfarzin/haha 2019 nal, Accessed on 19 June 2019
5 https://www. uentin3months.com/spanish-jokes/, Accessed on 19 June 2019
6 https://docs.fast.ai/, Accessed on 19 June 2019
7 http://competitions.codalab.org/competitions/22194/ Accessed on 19 June 2019
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>System Description</title>
      <p>
        We modify the method of Universal Langage Model Fine-tuning for Text
Classi cation (ULMFiT) presented in Howard and Ruder [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The primary steps
are:
1. Train a language model (LM) on a large corpus of data
2. Fine-tune the LM based on the target task language data
3. Replace the nal layer of the LM with a softmax or linear output layer and
then ne-tune on the particular task at hand (classi cation or regression)
Below we will give more detail on each step and the parameters used to generate
our system.
3.1
3.2
      </p>
      <sec id="sec-3-1">
        <title>Data, Cleanup &amp; Tokenization</title>
      </sec>
      <sec id="sec-3-2">
        <title>Additional Data</title>
        <p>We collected a corpus for our LM based on Spanish Twitter using tweepy8
run for three 4-hour sessions and collecting any tweet with any of the terms
'el','su','lo','y' or 'en'. We excluded retweets to minimize repeated examples in
our language model training. In total, we collected 475,143 tweets - a data set
is nearly 16 times larger than the text provided by the competition alone. The
frequency of terms, punctuation and vocabulary used on Twitter can be quite
di erent from the standard Wikipedia corpus that is often used to train an LM
from scratch.</p>
        <p>In the ne-tuning step, we combined the train and test text data without
labels from the contest data.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Cleaning</title>
        <p>We applied a list of default cleanup functions in sequence (see list below). They
are close to the standard clean-up included in the Fast.ai library with the
addition of one function for the Twitter dataset. Cleanup of data is key to expressing
information in a compact way so that the LM can use the relevant data when
trying to predict the next word in a sequence.
1. Replace more than 3 repetitions of the same character (ie. grrrreat becomes
g xxrep r 4 eat)
2. Replace repetition at the word level (similar to above)
3. Deal with ALL CAPS words replacing with a token and converting to lower
case.
4. Add spaces between special chars (ie. !!! to ! ! !)
5. Remove useless spaces (remove more than 2 spaces in sequence)
6. Addition: Move all text onto a single line by replacing new-lines inside a
tweet with a reserved word (ie. \n to xxnl)
8 http://github.com/tweepy/tweepy, Accessed on 19 June 2019
The following example shows the application of this data cleaning to a single
tweet:
Saber, entender y estar convencides que la frase \
#LaESILaDefendemosEntreTodes es nuestra linea es nuestro eje.\
#AlertaESI!!!!
Vamos por mas!!! e invitamos a todas aquellas personas que quieran \
se parte.
xxbos saber , entender y estar convencides que la frase \
# laesiladefendemosentretodes es nuestra linea es nuestro eje.\
xxnl # alertaesi xxrep 4 ! xxnl vamos por mas ! ! ! e invitamos a \
todas aquellas personas que quieran se parte.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Tokenization</title>
        <p>
          We used sentencepiece [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] to parse into sub-word units and reduce the possible
out-of-vocabulary terms in the data set. We selected a vocab size of 30,000 and
used the byte-pair encoding (BPE) model. To our knowledge this is the rst time
that the BPE toenization has been used with ULMFiT in a competition model.
4
4.1
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Training and Results</title>
      <sec id="sec-4-1">
        <title>LM Training and Fine-tuning</title>
        <p>
          We train the LM using a 90/10 training/validation split, reporting the validation
loss and accuracy of next-word prediction on the validation set. For the LM, we
selected an ASGD Weight-Dropped Long Short Term Memory (AWD LSTM,
described in Merity et al.[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]) model included in Fast.ai. We replaced the typical
Long Short Term Memory (LSTM) units with Quasi Recurrent Neural Network
(QRNN, described in Bradbury et al.[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]) units. Our network has 2304
hiddenstates, 3 layers and a softmax layer to predict the next-word. We tied the
embedding weights[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] on the encoder and decoder for training. We performed some
simple tests with LSTM units and a Transformer Language model, nding all
models were similar in performance during LM training. We thus chose to use
QRNN units due to improved training speed compared to the alternatives. This
model has about 60 million trainable parameters.
        </p>
        <p>Parameters used for training and ne-tuning are shown in Table 1. For all
networks we applied a dropout multiplier which scales the dropout used
throughout the network. We used the Adam optimizer with weight decay as indicated
in the table.</p>
        <p>
          Following the work of Smith[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] we found the largest learning-rate that we
could apply and then ran a one-cycle policy for a single epoch. This largest
weight is shown in Table 1 under "Learning Rate." Subsequent training epochs
were run with one-cycle and lower learning rates indicated in Table 1 under
"Continued Training."
Again, following the play-book from Howard and Ruder[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we change the
pretrained network head to a softmax or linear output layer (as appropriate for the
transfer task) and then load the LM weights for the layers below. We train just
the new head from random initialization, then unfreeze the entire network and
train with di erential learning rates. We layout our training parameters in Table
2.
        </p>
        <p>
          With the same learning rate and weight decay we apply a 5-fold
crossvalidation on the outputs and take the mean across the folds as our ensemble. We
sample 20 random seeds (see more in section 4.3) to nd the best initialization
for our gradient descent search. From these samples, we select the best validation
F1 metric or Mean Squared Error (MSE) for use in our test submission.
Classi er setup For the classi er, we have a hidden layer and softmax head.
We over-sample the minority class to balance the outcomes for better training
using Synthetic Minority Oversampling Technique (SMOTE, described in Chawla
et al.[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]). Our loss is label smoothing as described in Pereyra et al.[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] of the
attened cross-entropy loss. In ULMFiT, gradual unfreezing allows us to avoid
catastropic forgetting, focus each stage of training and preventing over- tting of
the parameters to the training cases. We take an alternative approach to
regularization and in our experiments found that we got similar results with label
smoothing but without the separate steps and learning rate re nement required
of gradual unfreezing.
        </p>
        <p>Regression setup For the regression task, we ll all #N/A labels with scores of
0. We add a hidden layer and linear output head and MSE loss function.
4.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Random Seed as a Hyperparamter</title>
        <p>For classi cation and regression, the random seed sets the initial random weights
of the head layer. This initialization a ects the nal F1 metric achievable.</p>
        <p>
          Across each of the 20 random seeds, we average the 5-folds and obtain a
single F1 metric on the validation set. The histogram of 20-seed outcomes is
shown in Figure 1 and covers a range from 0.820 to 0.825 over the validation
set. We selected our single best random seed for the test submission. With more
exploration, a better seed could likely be found. Though we only use a single
seed for the LM training, one could do a similar search with random seeds for
LM pre-training, and further select the best down-stream seed similar to Czapla
et al [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
Table 3 gives three results from our submissions in the competition. The rst is
the baseline NBSVM solution, with an F1 of 0.7548. Second is our rst random
seed selected for the classi er which produces a 0.8083 result. While better than
the NBSVM solution, we pick the best validation F1 from the 20 seeds we tried.
This produced our nal submission of 0.8099. Our best model achieved an
vefold average F1 of 0.8254 on the validation set shown in Figure 1 but a test set
F1 of 0.8099 - a drop of 0.0155 in F1 for the true out-of-sample data. Also note
that our third place entry was 1.1% worse in F1 score than rst place but 1.2%
better in F1 than the 4th place entry.
This paper describes our implementation of a neural net model for classi cation
and regression in the HAHA 2019 challenge. Our solution placed 3rd in Task
1 and 2nd in Task 2 in the nal competition standings. We describe the data
collection, pre-training, and nal model building steps for this contest. Twitter
has slang and abbreviations that are unique to the short-format as well as
generous use of emoticons. To capture these features, we collected our own dataset
based on Spanish Tweets that is 16 times larger than the competition data set
and allowed us to pre-train a language model. Humor is subtle and using a label
smoothed loss prevented us from becoming overcon dent in our predictions and
train more quickly without the gradual unfreezing required by ULMFiT. We
have open-sourced all code used in this contest to further enable research on this
task in the future.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Author Contributions</title>
      <p>BF was the primary researcher. PC contributed with suggestions for the random
seeds as a hyper-parameters and label smoothing to speed up training. JH
contributed with suggestion for higher dropout throughout the network for more
generalization.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>The author would like to thank all the participants on the fast.ai forums 9 for
their ideas and suggestions. Also, Kyle Kastner for his edits, suggestions and
recommendations in writing up these results.
9 http://forums.fasta.ai</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bradbury</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Merity</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
          </string-name>
          , R.:
          <article-title>Quasi-recurrent neural networks</article-title>
          .
          <source>CoRR abs/1611</source>
          .01576 (
          <year>2016</year>
          ), http://arxiv.org/abs/1611.01576
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moncecchi</surname>
          </string-name>
          , G.:
          <article-title>A crowd-annotated spanish corpus for humor analysis</article-title>
          .
          <source>In: Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media</source>
          . pp.
          <volume>7</volume>
          {
          <issue>11</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>N.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowyer</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>L.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kegelmeyer</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          : Smote:
          <article-title>Synthetic minority over-sampling technique</article-title>
          .
          <source>J. Artif. Int. Res</source>
          .
          <volume>16</volume>
          (
          <issue>1</issue>
          ),
          <volume>321</volume>
          {357 (Jun
          <year>2002</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>1622407</volume>
          .
          <fpage>1622416</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etcheverry</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prada</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          : Overview of HAHA at IberLEF 2019:
          <article-title>Humor Analysis based on Human Annotation</article-title>
          .
          <source>In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          ).
          <source>CEUR Workshop Proceedings</source>
          , CEUR-WS, Bilbao,
          <source>Spain (9</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Czapla</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kardas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Universal language model netuning with subword tokenization for polish</article-title>
          . CoRR abs/
          <year>1810</year>
          .10222 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1810</year>
          .10222
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Universal language model ne-tuning for text classi cation</article-title>
          . CoRR abs/
          <year>1801</year>
          .06146 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1801</year>
          .06146
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kudo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richardson</surname>
          </string-name>
          , J.:
          <article-title>Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing</article-title>
          . CoRR abs/
          <year>1808</year>
          .06226 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1808</year>
          .06226
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Merity</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keskar</surname>
            ,
            <given-names>N.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
          </string-name>
          , R.:
          <article-title>Regularizing and optimizing LSTM language models</article-title>
          .
          <source>CoRR abs/1708</source>
          .02182 (
          <year>2017</year>
          ), http://arxiv.org/abs/1708.02182
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pereyra</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tucker</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chorowski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Regularizing neural networks by penalizing con dent output distributions</article-title>
          .
          <source>CoRR abs/1701</source>
          .06548 (
          <year>2017</year>
          ), http://arxiv.org/abs/1701.06548
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Press,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Using the output embedding to improve language models</article-title>
          .
          <source>CoRR abs/1608</source>
          .05859 (
          <year>2016</year>
          ), http://arxiv.org/abs/1608.05859
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>L.N.:</given-names>
          </string-name>
          <article-title>A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay</article-title>
          . CoRR abs/
          <year>1803</year>
          .09820 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1803</year>
          .09820
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Baselines and bigrams: Simple, good sentiment and topic classi cation</article-title>
          .
          <source>In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2</source>
          . pp.
          <volume>90</volume>
          {
          <fpage>94</fpage>
          . ACL '
          <volume>12</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2012</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>2390665</volume>
          .
          <fpage>2390688</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>