<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IberLEF</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A BERT-based Approach for Automatic Humor Detection and Scoring</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jihang Mao</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wanli Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Text Classification</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Score Prediction</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Multilingual Model</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1214</institution>
          ,
          <addr-line>Bethesda, MD 20814</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TAJ Technologies, Inc.</institution>
          ,
          <addr-line>7910 Woodmont Ave</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University Blvd E</institution>
          ,
          <addr-line>Silver Spring, MD 20901</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>24</volume>
      <fpage>197</fpage>
      <lpage>202</lpage>
      <abstract>
        <p>In this paper we report our participation in the 2019 HAHA task where a corpus of crowd-annotated tweets is provided and required to tell if a tweet is a joke or not and predict a funniness score value for a tweet. Our approach utilizes BERT, a multi-layer bidirectional transformer encoder which can help learn deep bi-directional representations, and the pretrained model is fine-tuned on training data for HAHA task. The representation of a tweet is fed into an output layer for classification. To predict the funniness score, we apply another output layer to generate scores by using float labels and train it with the mean squad error between the prediction scores and the labels. Our best F-Score on the test set for Task 1 is 0.784 and RMSE for Task 2 is 0.910. We find that our approach is competitive and applicable to multilingual text classification tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        topic. Semeval-2015 Task 11 proposed to work on figurative language, such as
metaphors and irony, but focused on Sentiment Analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Semeval-2017 Task 6
presented a similar task to this one as well. Majority of the researches on social media texts
is focused on English. However, Schroeder’s work [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] shows that a high percentage of
these texts are in non-English languages. HAHA - Humor Analysis based on Human
Annotation, is a task to classify tweets in Spanish as humorous or not, and to determine
how funny they are [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The 2019 edition of HAHA is the 2nd year round of the shared
task. The aim of this task is to gain better insight in what is humorous and what causes
laughter.
      </p>
      <p>
        Based on tweets written in Spanish, the HAHA task comprises two subtasks: Humor
Detection (Task 1) - telling if a tweet is a joke or not (intended humor by the author or
not) and Humor Scoring (Task 2) - predicting a funniness score value (average stars)
for a tweet in a 5-star ranking. We participated in both sub-task this year. A corpus of
crowd-annotated tweets based on [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is provided and divided in 80% for training and
20% tweets for testing. Participating teams should include a row for each one of the
6000 tweets in the test corpus. All tweets are classified as humorous or not humorous
by the “is_humor” column. For Task 2, all the rows have a “funniness_average” as a
predicted score.
      </p>
      <p>A brief description of our method for HAHA task is presented in Section 2. In Section
3 we show the results of our method on the official HAHA test datasets. In section 4
we present a discussion of the results and conclusions of our participation in this
challenge.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        For HAHA Task, our approach builds on BERT, which has obtained state-of-the-art
performance on most NLP tasks [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. More specifically, given a tweet, our method
first obtains its token representation from the pre-trained BERT model using a
casepreserving WordPiece model, including the maximal document context provided by the
data. Next, we formulate this as a single-sentence classification task by feeding the
representation into an output layer, a binary classifier over the class labels. Finally, we
apply another output layer to generate scores by using float labels and train it with the
mean squad error between the prediction scores and the labels.
      </p>
      <p>
        BERT utilizes a multi-layer bidirectional transformer encoder which can learn deep
bidirectional representations and can be later fine-tuned for a variety of tasks such as
NER. Before BERT, deep learning models, such as Bi-directional Long Short-Term
Memory (Bi-LSTM) and convolutional neural network (CNN) have greatly improved
the performance in text classification over the last few years [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. More recently,
ULMFiT [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] has proved the ability of transfer learning for any NLP task and achieved
great results.
The pre-trained BERT models are trained on a large corpus (Wikipedia + BookCorpus).
There are several pre-trained models releases. In HAHA, we chose BERT-Base,
multilingual cased model for the following reasons: First, multilingual model is better for
Spanish documents in HAHA because the English-only model splits tokens not
available in its vocabulary into sub-tokens, which will affect the accuracy of the classification
task. Second, although BERT-Large generally outperforms BERT-Base in English
NLP tasks, BERT-Large versions of multilingual models haven’t been released. Third,
the multilingual cased model fixes normalization issues in many languages, so it is
recommended in languages with non-Latin alphabets (and is often better for most
languages with Latin alphabets). In Task 1, we use the final hidden state corresponding to
a special token ([CLS]) as the aggregate sequence representation, then feed it into an
output layer for classification (Figure 1).
      </p>
    </sec>
    <sec id="sec-3">
      <title>BERT</title>
      <p>[CLS]
Tok1
Tok2
TokN</p>
      <sec id="sec-3-1">
        <title>Input text</title>
        <p>E [ C L S ]
E 1
E 2</p>
        <p>E 3
For Task 2, we make several changes to the above output layer. First, we change the
label type to a float instead of an int. Second, we change the measure for training to
mean squad error, Pearson and Spearman correlation instead of accuracy. Finally, we
change the output to a scorer instead of a classifier. Since high recall (0.825) and low
precision (0.724) were observed while submitting results, we utilize the scores in Task
Class Label
C
T1
T2
TN</p>
      </sec>
      <sec id="sec-3-2">
        <title>Output Layer</title>
        <p>2 to optimize the F1 score in Task 1, i.e. changing the tweet classified as “Is humorous”
in Task 1 to “Not humor” if it has received a low score in Task 2.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results &amp; Discussion</title>
      <p>The HAHA corpus has been divided into two subsets: the train and the test set. The
training set contains 24000 tweets, and the test set 6000 tweets.</p>
      <p>
        In Task 1, the results are measured using accuracy and F-measure for the humorous
category. F-measure is the main measure for this task. In Task 2, The results are
measured using root-mean-squared error. Here we present the results on the test set. In our
best submission, the model was fine-tuned using the hyperparameter values suggested
in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]: learning rate (Adam)=2e-5, number of epochs=3, max sequence length=256,
and batch size=16. When fine-tuning the model for HAHA task, we randomly sample
the training set into two subsets: 18000 tweets for training, and 6000 tweets for
development. We use a new set of random seeds each time to prevent over-fitting. In
addition, tweets classified as “Is humorous” with prediction score &lt; 0.2 are reclassified as
“Not humor”, while tweets classified as “Not humor” with prediction score &gt; 1.7 are
reclassified as “Is humorous” in the final submission. Table 1 shows the improvement
obtained using the scores of Task 2 to optimize Task 1.
      </p>
      <p>As shown in Table 2 and Table 3, our best submission significantly outperformed the
Baseline “hahaPLN” in both F-measure for Task1 and root mean squared error for Task
2, while the F1 score of our submission is near to the highest score for Task 1 (-0.037).
We are in the first third of all participants in both Task 1 and Task 2, which
demonstrates a good performance of our system in detecting humor tweets in Spanish and
predicting the funniness scores.
We described our BERT-based approach that participated in the HAHA - Humor
Analysis based on Human Annotation task in IberLEF 2019. Compared to previous
methods, our approach has several significant differences from system architecture to the
processing flow. It is a general and robust framework and showed competitive
performance during the HAHA evaluations. With more and more training corpora available,
we plan to explore the application of it in other text classification task in future work.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The authors would like to thank Dr. Yutao Zhang for providing Jihang Mao the intern
opportunity at George Mason University and valuable suggestions and comments on
the manuscript. The authors would also like to thank the HAHA task organizers for
providing the data of the task.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Attardo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Linguistic theories of humor</article-title>
          , volume
          <volume>1</volume>
          . Walter de Gruyter (
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Raz</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>Automatic humor classification on twitter</article-title>
          .
          <source>In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop</source>
          , pages
          <fpage>66</fpage>
          -
          <lpage>70</lpage>
          . Association for Computational Linguistics. (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Mihalcea</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Strapparava</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Making Computers Laugh: Investigations in Automatic Humor Recognition</article-title>
          .
          <source>In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. HLT '05</source>
          , (pp.
          <fpage>531</fpage>
          -
          <lpage>538</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          , Vancouver, British Columbia, Canada. (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Sjöbergh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Araki</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Recognizing Humor Without Recognizing Meaning. In</surname>
            <given-names>WILF</given-names>
          </string-name>
          , (pp.
          <fpage>469</fpage>
          -
          <lpage>476</lpage>
          ). Springer, (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cubero</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Moncecchi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Is This a Joke? Detecting Humor in Spanish Tweets</article-title>
          .
          <source>In Ibero-American Conference on Artificial Intelligence</source>
          (pp.
          <fpage>139</fpage>
          -
          <lpage>150</lpage>
          ). Springer International Publishing. (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Reyes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Veale</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <article-title>A multidimensional approach for detecting irony in twitter</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>1</volume>
          -
          <fpage>30</fpage>
          . (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L. Opinion</given-names>
          </string-name>
          <string-name>
            <surname>Mining</surname>
            and
            <given-names>Sentiment</given-names>
          </string-name>
          <string-name>
            <surname>Analysis</surname>
          </string-name>
          .
          <source>Found. Trends Information Retrieval</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          -2):
          <fpage>1</fpage>
          -
          <lpage>135</lpage>
          . (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veale</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shutova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barnden</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Reyes</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Semeval-2015 task 11:
          <article-title>Sentiment analysis of figurative language in twitter</article-title>
          .
          <source>In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2015</year>
          ) (pp.
          <fpage>470</fpage>
          -
          <lpage>478</lpage>
          ). (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Schroeder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Half of messages on twitter arenat in english[stats]</article-title>
          . (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etcheverry</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prada</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Rosá</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Overview of HAHA at IberLEF 2019:
          <article-title>Humor Analysis based on Human Annotation</article-title>
          .
          <source>Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          ), CEUR Workshop Proceedings, Bilbao, Spain, (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosá</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Moncecchi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>A crowdannotated spanish corpus for humor analysis</article-title>
          .
          <source>In Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media</source>
          (pp.
          <fpage>7</fpage>
          -
          <lpage>11</lpage>
          ). (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          <year>2019</year>
          , pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . Minneapolis, Minnesota, USA, (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>Learning structured representation for text classification via reinforcement learning</article-title>
          .
          <source>In Thirty-Second AAAI Conference on Artificial Intelligence</source>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Universal language model fine-tuning for text classification</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>328</fpage>
          -
          <lpage>339</lpage>
          . Association for Computational Linguistics. (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>