<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BERT-based Ensembles for Modeling Disclosure and Support in Conversational Social Media Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kartikey Pant</string-name>
          <email>kartikey.pant@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tanvi Dadu</string-name>
          <email>tanvid.co.16@nsit.net.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Radhika Mamidi</string-name>
          <email>radhika.mamidi@iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>International Institute of Information Technology</institution>
          ,
          <addr-line>Hyderabad</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Netaji Subhas Institute of Technology</institution>
          ,
          <addr-line>Delhi</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>There is a growing interest in understanding how humans initiate and hold conversations. The a ective understanding of conversations focuses on the problem of how speakers use emotions to react to a situation and to each other. In the CL-A Shared Task, the organizers released Get it #O MyChest dataset, which contains Reddit comments from casual and confessional conversations, labeled for their disclosure and supportiveness characteristics. In this paper, we introduce a predictive ensemble model exploiting the netuned contextualized word embeddings, RoBERTa and ALBERT. We show that our model outperforms the base models in all considered metrics, achieving an improvement of 3% in the F1 score. We further conduct statistical analysis and outline deeper insights into the given dataset while providing a new characterization of impact for the dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>emotion recognition</kwd>
        <kwd>sentiment analysis</kwd>
        <kwd>natural language processing</kwd>
        <kwd>social media analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The word `A ective' refers to emotions, mood, sentiment, personality,
subjective evaluations, opinions, and attitude. A ect analysis refers to the techniques
used to identify and measure the `experience of emotion' in multimodal content
containing text, audio, images, and videos.[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] A ect has become an essential
part of the human experience, which directly in uences their reaction towards
a particular situation. Therefore, it has become crucial to analyze how speakers
use emotions and sentiment to react to di erent situations and each other.
      </p>
      <p>This paper addresses the challenge put forward in the CL-A Shared Task
at the AAAI-2020 Workshop on A ective Content Analysis to Model A ect in
Response (A Con 2020). The theme of this task is to the study a ect in response
to the interactive content which grows over time. The task o ers two datasets (
? The rst two authors contributed equally to the work.
a small labeled dataset and a large unlabeled dataset) sampled from casual and
confessional conversations on Reddit in the subreddit /r/CasualConversations
and the /r/O MyChest. This shared task comprises two subtasks. The rst
subtask is a semi-supervised text classi cation task predicting Disclosure and
Supportiveness labels based on the given two datasets. Whereas, the second subtask
is an open-ended task, which requires authors to propose new characterizations
and insights to capture conversation dynamics.</p>
      <p>
        Recent works in the task of text classi cation have used pre-trained
contextualized word representations rather than context-independent word
representations. Some of these representations include BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], RoBERTa[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and
ALBERT [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. These models perform contextualized word representation and are
pre-trained using bidirectional transformers[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. These BERT-based pre-trained
models have outperformed many existing techniques on most NLP tasks with
minimal task-speci c architectural changes.
      </p>
      <p>Ensemble models exploiting features learned from multiple pre-trained
models are hypothesized to perform competitively. In this work, we propose an
ensemble-based model exploiting pre-trained BERT-based word representations.
We document the experimental results for the CL-A Shared Task of our
proposed model in comparison to the baseline models. We further perform
attributebased statistical analysis using attributes like word count, day of the week, and
comment per parent post. We conclude the paper by proposing impact as a new
characterization to model conversation dynamics.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Our Model</title>
      <p>In this section, we introduce our predictive model that uses Transfer learning
in the form of pretrained BERT-based models. We propose an ensemble of two
pre-trained models: RoBERTa and ALBERT. In this section, we rst outline the
pre-trained models incorporated and then discuss the ensemble technique used.
2.1</p>
      <sec id="sec-2-1">
        <title>Preliminaries</title>
        <p>
          Transfer learning is the process of extracting knowledge from a source
problem domain and applying it to a di erent target problem or domain. Recent
works on text classi cation use transfer learning in the form of pre-trained
embeddings.[
          <xref ref-type="bibr" rid="ref13 ref14 ref15">13,14,15</xref>
          ] These pre-trained embeddings have outperformed many
of the existing techniques with minimal architectural structure. The use of
pretrained embeddings reduces the need for annotated data and allows one to
perform the downstream task with minimal resources for the netuning of the model.
        </p>
        <p>
          Devlin et al.[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] introduced BERT, a contextualized word representation,
pretrained using a bi-directional Transformer-based encoder. These embeddings use
a linear combination of masked language modeling and the next sentence
prediction objectives. It is pre-trained on 3.3B words from various sources, including
BooksCorpus [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and the English Wikipedia.
        </p>
        <p>
          Liu et al. introduced RoBERTa, a replication study of BERT, with
carefully tuned hyperparameters and more extensive training data[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. It is trained
with a batch size eight times larger for half as many optimization steps, thus
taking signi cantly lesser time to train in comparison. It is trained on more
than twelve times the data used to train BERTlarge, using data from
OpenWebText [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], CC-News [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], and STORIES [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] datasets. These optimizations lead
the RoBERT alarge pre-trained model to perform better than the BERT-large
model in all benchmarking tests, including SQuAD [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and GLUE [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          Lan et al. introduced ALBERT, a BERT-based model with two
parameterreduction techniques: factorized embedding parameterization, and cross-layer
parameter sharing.[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] These techniques help in lowering memory consumption
and increasing training speed. Moreover, this model uses a self-supervised loss
that focuses on modeling inter-sentence coherence and improves on downstream
tasks with multi-sentence input. ALBERTxxlarge;v2 achieves signi cant
improvements over BERTlarge on multiple tasks.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Our Approach</title>
        <p>
          Ensemble methodology entails constructing a predictive model by integrating
multiple models in order to improve prediction performance. They are
metaalgorithms that combine several machine learning and deep learning classi ers
into one predictive model to decrease variance, bias, and improve predictions.
Recent works show that ensemble-based classi ers utilizing contextual
embeddings outperform single-model classi ers.[
          <xref ref-type="bibr" rid="ref14 ref15">14,15</xref>
          ] Hence, we use ensembling
techniques to combine predictions from multiple models for the tasks for making a
prediction for the given task.
        </p>
        <p>Figure 1 depicts our proposed ensemble model. In this model, a sentence is
parallelly computed by RoBERTa and ALBERT netuned for predicting that
label. The results from these base models are then combined using a weighted
average based ensembling technique to predict the nal label set, which includes
predictions for the six labels.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <p>In this section, we outline the experimental setup, the baselines for the task, and
a comparative analysis of our proposed ensemble model with the two base
models netuned for the task, RoBERT alarge and ALBERTxxlarge;v2. We further
compare our ensemble model with four other ensemble models and show that
our model performs the best among all the models in four out of ve evaluation
metrics using 10-fold cross validation.</p>
      <p>For our baselines, we netune RoBERT aLarge and ALBERTxxlarge;v2
models for three epochs with a maximum sequence length of 50 and a batch size of
16 for predicting each label separately. We netune the model with a learning
rate of 2 10 5, a weight decay of 0:01, and 20 steps for warm-up. We evaluate</p>
      <p>Model/Metrics Accuracy Precision-1 Recall-1 F1 Acc&amp;F1
RoBERT aLarge 84.86% 0.585 0.514 0.541 0.695
ALBERTxxlarge;v2 84.90% 0.596 0.472 0.524 0.686
Our Model 85.55% 0.623 0.515 0.558 0.707
Table 1. Label-averaged values for each metric for RoBERTa,ALBERT, and our best
performing ensemble model.
all the models on the following metrics: Accuracy, F1, Precision-1, Recall-1, and
the mean of Accuracy and F1, denoted as Acc&amp;F1 from hereon.</p>
      <p>From Table 1, we can discern that our ensemble-based model achieves the
best results when compared with base models: RoBERTa and ALBERT. We
observe a signi cant increase in Accuracy, Precision-1, and F1 and a slight increase
in Recall-1 and Acc&amp;F1 in our best-performing ensemble model as compared to
the base models.</p>
      <p>Label/Metrics Accuracy Precision-1 Recall-1 F1 Acc&amp;F1
Informational Disclosure 74.12% 0.710 0.551 0.620 0.681
Emotional Disclosure 74.20% 0.636 0.510 0.566 0.654
Support 84.38% 0.685 0.724 0.704 0.774
General Support 95.42% 0.483 0.241 0.322 0.638
Informational Support 91.30% 0.592 0.485 0.533 0.723
Emotional Support 93.86% 0.632 0.577 0.603 0.771
Table 2. Label-wise values for each metric for our best performing ensemble model.</p>
      <p>Table 2 further shows the performance of our ensemble-based model on
individual labels. Its performance on di erent labels is evaluated using the above
metrics.</p>
      <p>Labels/Model Model 1 Model 2 Model 3 Model 4 Model 5
Informational Disclosure 0.0,1.0 0.5,0.5 0.0,1.0 0.0,1.0 0.1,0.9
Emotional Disclosure 0.0,1.0 0.5,0.5 0.5,0.5 0.5,0.5 0.5,0.5
Support 1.0,0.0 0.5,0.5 1.0,0.0 1.0,0.0 1.0,0.0
General Support 0.0,1.0 0.5,0.5 0.5,0.5 0.6,0.4 0.6,0.4
Informational Support 1.0,0.0 0.5,0.5 1.0,0.0 1.0,0.0 1.0,0.0
Emotional Support 1.0,0.0 0.5,0.5 0.5,0.5 0.5,0.5 0.5,0.5
Table 3. Weights assigned to each model in di erent Ensemble Models. Each cell
contains a pair (x; y) where x denotes the weight assigned to RoBERTa and y denotes
the weight assigned to ALBERT.</p>
      <p>We further performed a comparative study on ensembling techniques by
choosing di erent weights for RoBERTa and ALBERT, as given in Table 3.
It shows di erent combinations of weights assigned to each label for RoBERTa
and ALBERT respectively. This gives rise to ve di erent models, which are
then compared using the above metrics.</p>
      <p>Table 4 depicts the results of the comparative study conducted on the ve
di erent ensemble models. We discern that Model 5 performs the best for
Accuracy, Precision-1, F1, and Acc&amp;F1 metrics, and Model 1 performs the best for
Recall-1 metric among all the compared models. Since Model 5 outperforms all
other models in four out of ve metrics, it is the best predictive model for the
task and is referred to as Our Model in the paper.</p>
      <p>For the shared task, our System Run 1 to System Run 5 are predictions
generated by the Model 1 to Model 5 respectively. System Run 6 and
SysModel/Metrics Accuracy Precision-1 Recall-1 F1 Acc&amp;F1
Model 1 85.18% 0.595 0.516 0.547 0.699
Model 2 85.42% 0.622 0.490 0.544 0.699
Model 3 85.47% 0.619 0.514 0.557 0.706
Model 4 85.48% 0.622 0.480 0.557 0.706
Model 5 85.54% 0.623 0.515 0.558 0.707</p>
      <p>Table 4. Label-averaged values for each metric for di erent ensemble models.
tem Run 7 are the predictions generated by
ALBERTxxlarge;v2 respectively.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <p>netuned RoBERT a large and
In this section, we provide a comprehensive statistical analysis of the dataset
Get it #O MyChest, which comprises of comments and parent posts from the
subreddit /r/CasualConversations, and /r/O MyChest. We further propose new
characterizations and outline semantic features for the given dataset.
4.1</p>
      <sec id="sec-4-1">
        <title>Analysis</title>
        <p>Statistical analysis of the labels, Emotional Disclosure, Informational
Disclosure, Support, General Support, Information Support, and Emotional Support
show signi cant variations in the number of positive and negative labels. The
percentage of positive labels is maximum for Information Disclosure with 37:99%
and minimum for General Support with 5:37%. Therefore, the given dataset is
highly imbalanced, which makes the training of predictive models a strenuous
task.</p>
        <p>Further analysis of the labeled dataset shows that there are 3; 511 parent
posts for 11; 573 comments. We observe an average of 3:29 comments per parent
post ranging from one comment per parent post to 52 comments per parent
post. In the given dataset, there are 6; 999 unique users with an average of 1:653
comments per user and a signi cant variation in the number of comments per
user ranging from 1 to 159 comments per user, with a standard deviation of
2:669. From this, we conclude that multiple comments within the same parent
post and by the same author may be related to each other.</p>
        <p>
          We also observe signi cant variations in the word count of the comments,
with an average comment being of 14:7 words, which translates to around one
sentence[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. However, the comment length varies signi cantly from 3 words to 151
words per comment, with the distribution having a standard deviation of 9:670.
The dataset is thus, well-rounded, and represents realistic discourse setting with
participants exchanging comments of varying lengths.
        </p>
        <p>We intuitively proceeded to predict the e ect of the day of the week in
the characterized labels representing disclosure and support in a comment. It
Weekday/Label EDmiscoltoiosunrael IDnifsocrlmosautrieonal Support SGuepnpeorarlt ISnufpoprmorattional EmotionalSupport
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Overall
was expected that the users would behave di erently as the week progresses.
However, as illustrated in Table 5, we do not see any signi cant variation in
the existing characterizations with a change in the day of the week. Thus, we
conclude, in this dataset, that the week of the day doesn't a ect the users to be
either more supportive or disclose more information.
The score assigned to a comment quanti es its Impact since, on Reddit, it is the
di erence between the upvotes and downvotes that it obtains. We observed the
posts to have a moderately positive Impact of 10:938 on average. We also see
that the breadth of the spectrum in the Impact is captured well by the dataset,
with a standard deviation of 57:198, and a range of 49 to 2; 637. This paves
the way for a need to characterize and predict the Impact of a post.</p>
        <p>
          Upon performing a correlation study between Impact and the previously
characterized labels using Pearson's correlation coe cient [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], we observe a very
small positive correlation between the variables. As is illustrated in Table 6, the
maximum of 0:046 between Impact and Emotional Disclosure represents that
Impact is characteristically distinct from the previously predicted labels.
        </p>
        <p>
          We further analyze the in uence of Impact, characterized by the score on the
semantic structure of the comments. We perform a correlation study between
Impact and semantic features selected, as is explored previously in Yang et al[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
Semantic structure is captured by the following features:
1. Positive words: The number of occurrences of positive words in a comment.
2. Negative words: The number of occurrences of negative words in a
comment.
3. Positive Polarity Con dence: The probability that a sentence is positive.
        </p>
        <p>
          This metric is used to capture the polarity of comments and is calculated
using Fasttext[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
4. Subjective words: The number of occurrences of subjectivity oriented
words in a comment. It is used to capture the linguistic expression of people's
opinions, beliefs, and speculations.
5. Sense Combination: It is computed as the log( ik=1nwi) where nwi is the
total number of senses of word wi.
6. Sense Farmost: The largest Path Similarity of any word sense in a sentence.
7. Sense Closest: The smallest Path Similarity of any word sense in a sentence.
        </p>
        <p>
          From Table 7, we observe a minimal correlation between Impact and the
selected semantic features. The maximum of 0:040 between Impact and the
feature Sense Closest [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] depicts that the new characterization is distinct from
semantic features of the comment.
        </p>
        <p>Although it is essential to understand that predicting Impact is bene cial for
numerous applications like nance, product marketing and provides insights on
social dynamics, it is a hard problem dependent on various factors. Our attempt
to capture relationships between Impact and some selected semantic features was
not able to establish a strong correlation between the features. Thus, this implies
that the use of sophisticated architectures in the task of Impact prediction would
be valuable.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper presents a novel BERT-based predictive ensemble model to predict
given labels: Emotional Disclosure, Informational Disclosure, Support, General
Support, Information Support, and Emotional Support. Our model gives
competitive results for the label prediction on the given dataset Get it #O MyChest.
Analysis of dataset shows the highly imbalanced distribution of the given labels,
and high variations in some features like score, word count, comments per parent
post, and comments per user. We further discerned that day of the week has no
signi cant impact on the frequency of Disclosure and Support based comments
on Reddit. Future work may involve exploring more ensembling techniques and
exploring sophisticated architectures to predict the impact of a comment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information (</article-title>
          <year>2016</year>
          ), http://arxiv.org/abs/1607.04606
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cutts</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Oxford Guide to Plain English (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding (</article-title>
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gokaslan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Openwebtext corpus</article-title>
          . http://Skylion007.github.io/ OpenWebTextCorpus (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nagel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Cc-news (
          <year>2016</year>
          ), http://web.archive.org/save/http://commoncrawl. org/
          <year>2016</year>
          /10/newsdataset-available/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rajendran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abdul-Mageed</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Happy together: Learning and understanding appraisal from natural language</article-title>
          .
          <source>In: A Con@AAAI</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Rajpurkar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Zhang, J.,
          <string-name>
            <surname>Lopyrev</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Squad:
          <volume>100</volume>
          ,000+
          <article-title>questions for machine comprehension of text</article-title>
          .
          <source>arXiv preprint arXiv:1606.05250</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Rodgers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicewander</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Thirteen ways to look at the correlation coe cient</article-title>
          .
          <source>The American Statistician</source>
          <volume>42</volume>
          (
          <issue>1</issue>
          ),
          <volume>59</volume>
          {
          <fpage>66</fpage>
          (
          <year>1988</year>
          ), http://www.jstor.org/stable/ 2685263
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Trinh</surname>
            ,
            <given-names>T.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>A simple method for commonsense reasoning</article-title>
          . CoRR abs/
          <year>1806</year>
          .02847 (
          <year>2018</year>
          ), http://dblp.uni-trier.de/db/journals/corr/ corr1806.html#abs-1806
          <source>-02847</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Attention is all you need</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <volume>5998</volume>
          {
          <issue>6008</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michael</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowman</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>GLUE: A multitask benchmark and analysis platform for natural language understanding</article-title>
          .
          <source>In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          . pp.
          <volume>353</volume>
          {
          <fpage>355</fpage>
          . Association for Computational Linguistics, Brussels, Belgium (Nov
          <year>2018</year>
          ). https://doi.org/10.18653/v1/
          <fpage>W18</fpage>
          -5446, https://www.aclweb.org/anthology/W18-5446
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.H.</given-names>
          </string-name>
          :
          <article-title>Humor recognition and humor anchor extraction</article-title>
          . In: Marquez,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Pighin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Marton</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y</surname>
          </string-name>
          . (eds.) EMNLP. pp.
          <volume>2367</volume>
          {
          <fpage>2376</fpage>
          .
          <article-title>The Association for Computational Linguistics (</article-title>
          <year>2015</year>
          ), http://dblp.uni-trier.de/db/conf/emnlp/emnlp2015.html#YangLDH15
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Carbonell, J.,
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>Xlnet: Generalized autoregressive pretraining for language understanding (</article-title>
          <year>2019</year>
          ), http: //arxiv.org/abs/
          <year>1906</year>
          .08237
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Yinhan</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Myle Ott,
          <string-name>
            <given-names>N.G.</given-names>
            <surname>J.D.M.J.D.C.O.L.M.L.L.Z</surname>
          </string-name>
          .V.S.:
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . In: Submitted to International Conference on Learning Representations (
          <year>2020</year>
          ), https://openreview.net/forum?id= SyxS0T4tvS, under review
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Zhenzhong</surname>
            <given-names>Lan</given-names>
          </string-name>
          , Mingda Chen,
          <string-name>
            <surname>S.G.K.G.P.S.R.S.: Albert:</surname>
          </string-name>
          <article-title>A lite bert for selfsupervised learning of language representations</article-title>
          . In: Submitted to International Conference on Learning Representations (
          <year>2020</year>
          ), https://openreview.net/forum? id=H1eA7AEtvS, under review
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urtasun</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fidler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Aligning books and movies: Towards story-like visual explanations by watching movies and reading books (</article-title>
          <year>2015</year>
          ), http://arxiv.org/abs/1506.06724
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>