<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ALBERTO: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Polignano</string-name>
          <email>marco.polignano@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <email>pierpaolo.basile@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco de Gemmis</string-name>
          <email>marco.degemmis@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <email>giovanni.semeraro@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Basile</string-name>
          <email>valerio.basile@unito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bari A. Moro, Dept. Computer Science</institution>
          ,
          <addr-line>E.Orabona 4</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Turin, Dept. Computer Science</institution>
          ,
          <addr-line>Via Verdi 8</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. Recent scientific studies on natural language processing (NLP) report the outstanding effectiveness observed in the use of context-dependent and task-free language understanding models such as ELMo, GPT, and BERT. Specifically, they have proved to achieve state of the art performance in numerous complex NLP tasks such as question answering and sentiment analysis in the English language. Following the great popularity and effectiveness that these models are gaining in the scientific community, we trained a BERT language understanding model for the Italian language (AlBERTo). In particular, AlBERTo is focused on the language used in social networks, specifically on Twitter. To demonstrate its robustness, we evaluated AlBERTo on the EVALITA 2016 task SENTIPOLC (SENTIment POLarity Classification) obtaining state of the art results in subjectivity, polarity and irony detection on Italian tweets. The pre-trained AlBERTo model will be publicly distributed through the GitHub platform at the following web address: https://github.com/ marcopoli/AlBERTo-it in order to facilitate future research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The recent spread of pre-trained text
representation models has enabled important progress in</p>
      <p>
        Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
Natural Language Processing. In particular,
numerous tasks such as part of speech tagging,
question answering, machine translation, and text
classification have obtained significant contributions
in terms of performance through the use of
distributional semantics techniques such as word
embedding. Mikolov et al. (2013) notably
contributed to the genesis of numerous strategies for
representing terms based on the idea that
semantically related terms have a similar vector
representations. Such technologies as Word2Vec
        <xref ref-type="bibr" rid="ref6 ref9">(Mikolov
et al., 2013)</xref>
        , Glove
        <xref ref-type="bibr" rid="ref10">(Pennington et al., 2014)</xref>
        , and
FastText
        <xref ref-type="bibr" rid="ref5">(Bojanowski et al., 2017)</xref>
        suffer from a
problem that multiple concepts, associated with
the same term, are not represented by different
wordembedding vectors in the distributional space
(context-free). New strategies such as ELMo
        <xref ref-type="bibr" rid="ref11">(Peters et al., 2018)</xref>
        , GPT/GPT-2
        <xref ref-type="bibr" rid="ref13">(Radford et al.,
2019)</xref>
        , and BERT
        <xref ref-type="bibr" rid="ref7">(Devlin et al., 2019)</xref>
        overcome
this limit by learning a language understanding
model for a contextual and task-independent
representation of terms. In their multilingual version,
they mainly use a mix of text obtained from large
corpora in different languages to build a general
language model to be reused for every application
in any language. As reported by the BERT
documentation ”the Multilingual model is somewhat
worse than a single-language model. However, it
is not feasible for us to train and maintain dozens
of single-language model.” This entails significant
limitations related to the type of language learned
(with respect to the document style) and the size
of the vocabulary. These reasons have led us to
create the equivalent of the BERT model for the
Italian language and specifically on the language
style used on Twitter: AlBERTo. This idea was
supported by the intuition that many of the NLP
tasks for the Italian language are carried out for
the analysis of social media data, both in business
and research contexts.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        A Task-Independent Sentence Understanding
Model is based on the idea of creating a deep
learning architecture, particularly an encoder and
a decoder, so that the encoding level can be used
in more than one NLP task. In this way, it is
possible to obtain a decoding level with weights
optimized for the specific task (fine-tuning). A
general-purpose encoder should, therefore, be able
to provide an efficient representation of the terms,
their position in the sentence, context,
grammatical structure of the sentence, semantics of the
terms. One of the first systems able to satisfy
these requirements was ELMo
        <xref ref-type="bibr" rid="ref11">(Peters et al., 2018)</xref>
        based on a large neural network biLSTM (2
biLSTM layers with 4096 units and 512 dimension
projections and a residual connection from the first
to the second layer) trained for 10 epochs on the
1B WordBenchmark
        <xref ref-type="bibr" rid="ref6">(Chelba et al., 2013)</xref>
        . The
goal of the network was to predict the same
starting sentence in the same initial language (like an
autoencoder). It has guaranteed the correct
management of polysemy of terms by demonstrating
its efficacy on six different NLP tasks for which
it obtained state-of-the-art results: Question
Answering, Textual Entailment, Semantic Role
labeling, Coreference Resolution, Name Entity
Extraction, and Sentiment Analysis. Following the
basic idea of ELMo, another language model called
GPT has been developed in order to improve the
performance on the tasks included in the GLUE
benchmark
        <xref ref-type="bibr" rid="ref15">(Wang et al., 2018)</xref>
        . GPT replaces
the biLSTM network with a Transformer
architecture
        <xref ref-type="bibr" rid="ref14">(Vaswani et al., 2017)</xref>
        . A Transformer
is an encoder-decoder architecture that is mainly
based on feed-forward and multi-head attention
layers. Moreover, in Transformers terms are
provided as input without a specific order and
consequently a positional vector is added to the term
embeddings. Unlike ELMo, in GPT, for each
new task, the weights of all levels of the
network are optimized, and the complexity of the
network (in terms of parameters) remains almost
constant. Moreover, during the learning phase, the
network does not limit itself to work on a
single sentence but it splits the text into spans to
improve the predictive capacity and the
generalization power of the network. The deep neural
network used is a 12-layer decoder-only
transformer with masked self-attention heads (768
dimensional states and 12 attention heads) trained
for 100 epochs on the BooksCorpus dataset
        <xref ref-type="bibr" rid="ref16">(Zhu
et al., 2015)</xref>
        . This strategy proved to be
successful compared to the results obtained by ELMo on
the same NLP tasks. BERT (Bidirectional
Encoder Representations from Transformers)
        <xref ref-type="bibr" rid="ref7">(Devlin et al., 2019)</xref>
        was developed to work with a
strategy very similar to GPT. In its basic version,
it is also trained on a Transformer network with
12 levels, 768 dimensional states and 12 heads of
attention for a total of 110M of parameters and
trained on BooksCorpus
        <xref ref-type="bibr" rid="ref16">(Zhu et al., 2015)</xref>
        and
Wikipedia English for 1M of steps. The main
difference is that the learning phase is performed
by scanning the span of text in both directions,
from left to right and from right to left, as was
already done in biLSTMs. Moreover, BERT uses
a “masked language model”: during the training,
random terms are masked in order to be predicted
by the net. Jointly, the network is also designed
to potentially learn the next span of text from the
one given in input. These variations on the GPT
model allow BERT to be the current state of the art
language understanding model. Larger versions of
BERT (BERT large) and GPT (GPT-2) have been
released and are scoring better results than the
normal scale models but require much more
computational power. The base BERT model for English
language is exactly the same used for learning the
Italian Language Understang Model (AlBERTo)
but we are considering the possibility to develop
a large version of it soon.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>AlBERTo</title>
      <p>
        As pointed out in the previous sections, the aim
of this work is to create a linguistic resource for
Italian that would follow the most recent
strategies used to address NLP problems in English. It
is well known that the language used on social
networks is different from the formal one, also as a
consequence of the presence of mentions,
uncommon terms, links, and hashtags that are not present
elsewhere. Moreover multiple language models in
their multilingual version, are not performing well
in every specific language, especially with a
writing style different from that of books and
encyclopedic descriptions
        <xref ref-type="bibr" rid="ref12">(Polignano et al., 2019)</xref>
        .
AlBERTo aims to be the first Italian language
understanding model to represent the social media
language, Twitter in particular, written in Italian. The
model proposed in this work is based on the
software distributed through GitHub by Devlin et al.
(2019) 1 with the endorsement of Google. It has
been trained, without consequences, on text spans
containing typical social media characters
including emojis, links hashtags and mentions.
      </p>
      <p>
        Figure 1 shows the BERT and AlBERTo
strategy of learning. The “masked learning” is
applied on a 12x Transformer Encoder, where, for
each input, a percentage of terms is hidden and
then predicted for optimizing network weights in
back-propagation. In AlBERTo, we implement
only the “masked learning” strategy, excluding the
step based on “next following sentence”. This is a
crucial aspect to be aware of because, in the case
of tweets, we do not have cognition of a flow of
tweets as it happens in a dialog. For this reason,
we are aware that AlBERTo is not suitable for the
task of question answering, where this property is
essential. On the contrary, the model is well suited
for classification and prediction tasks. The
decision to train AlBERTo, excluding the ”next
following sentence” strategy, makes the model similar
in purposes to ELMo. Differently from it, BERT
and AlBERTo use transformer architecture instead
on biLSTM which have been demonstrated to
perform better in natural language processing tasks.
In any case, we are considering the possibility to
learn an Italian ELMo model and to compare it
with the here proposed model.
In order to tailor the tweet text to BERT’s
input structure, it is necessary to carry out
preprocessing operations. More specifically, using
Python as the programming language, two
libraries were mainly adopted: Ekphrasis
        <xref ref-type="bibr" rid="ref4">(Baziotis et al., 2017)</xref>
        and SentencePiece2
        <xref ref-type="bibr" rid="ref8">(Kudo, 2018)</xref>
        .
Ekphrasis is a popular tool comprising an NLP
pipeline for text extracted from Twitter. It has been
used for:
      </p>
      <p>Normalizing URL, emails, mentions,
percents, money, time, date, phone numbers,
numbers, emoticons;</p>
      <p>Tagging and unpacking hashtags.</p>
      <p>The normalization phase consists in replacing
each term with a fixed tuple &lt; [entity type] &gt;.
The tagging phase consists of enclosing hashtags
with two tags &lt; hashtag &gt; ::: &lt; =hashtag &gt;
representing their beginning and end in the
sentence. Whenever possible, the hashtag has been
unpacked into known words. The text is cleaned
and made easily readable by the network by
converting it to its lowercase form and all characters
except emojis, !, ? and accented characters have
been deleted. An example of preprocessed tweet
is shown in Figure 2.</p>
      <p>SentencePiece is a segmentation algorithm used
for learning the best strategy for splitting text
into terms in an unsupervised and
languageindependent way. It can process up to 50k
sentences per seconds and generate an extensive
vocabulary. It includes the most common terms
in the training set and the subwords which
occur in the middle of words, annotating them with
’##’ in order to be able to encode also slang,
incomplete or uncommon words. An example of a
piece of the vocabulary generated for AlBERTo
is shown in Figure 3. SentencePiece also
produced a tokenizer, used to generate a list of tokens
for each tweet further processed by BERT’s
create pretraining data.py module.</p>
      <p>1https://github.com/google-research/
bert/</p>
      <sec id="sec-3-1">
        <title>2https://github.com/google/</title>
        <p>
          sentencepiece
The dataset used for the learning phase of
AlBERTo is TWITA
          <xref ref-type="bibr" rid="ref3">(Basile et al., 2018)</xref>
          a huge
corpus of Tweets in the Italian language collected
from February 2012 to the present day from
Twitter’s official streaming API. In our configuration,
we randomly selected 200 million Tweets
removing re-tweets, and processed them with the
preprocessing pipeline described previously. In total,
we obtained 191GB of raw data.
3.3
        </p>
        <sec id="sec-3-1-1">
          <title>Learning Configuration</title>
          <p>The AlBERTo model has been trained using the
following configuration:
b e r t b a s e c o n f i g = f
” a t t e n t i o n p r o b s d r o p o u t p r o b ” : 0 . 1 ,
” d i r e c t i o n a l i t y ” : ” b i d i ” ,
” h i d d e n a c t ” : ” g e l u ” ,
” h i d d e n d r o p o u t p r o b ” : 0 . 1 ,
” h i d d e n s i z e ” : 7 6 8 ,
” i n i t i a l i z e r r a n g e ” : 0 . 0 2 ,
” i n t e r m e d i a t e s i z e ” : 3 0 7 2 ,
” m a x p o s i t i o n e m b e d d i n g s ” : 5 1 2 ,
” n u m a t t e n t i o n h e a d s ” : 1 2 ,
” n u m h i d d e n l a y e r s ” : 1 2 ,
” p o o l e r f c s i z e ” : 7 6 8 ,
” p o o l e r n u m a t t e n t i o n h e a d s ” : 1 2 ,
” p o o l e r n u m f c l a y e r s ” : 3 ,
” p o o l e r s i z e p e r h e a d ” : 1 2 8 ,
” p o o l e r t y p e ” : ” f i r s t t o k e n t r a n s f o r m ” ,
” t y p e v o c a b s i z e ” : 2 ,
” v o c a b s i z e ” : 128000
g
# Input data pipeline config
TRAIN BATCH SIZE = 128
MAX PREDICTIONS = 20
MAX SEQ LENGTH = 128
MASKED LM PROB = 0 . 1 5
# Training procedure config
EVAL BATCH SIZE = 64
LEARNING RATE = 2e 5
TRAIN STEPS = 1000000
SAVE CHECKPOINTS STEPS = 2500
NUM TPU CORES = 8</p>
          <p>The training has been performed over the
Google Collaborative Environment (Colab)3,
Using a 8 core Google TPU-V24 and a Google Cloud
Storage Bucket5. In total, it took 50 hours to
create a complete AlBERTo model. More
technical details are available in the Notebook ”Italian
Pre-training BERT from scratch with cloud TPU”
into the project repository.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Discussion of Results</title>
      <p>
        We evaluate AlBERTo on a task of sentiment
analysis for the Italian language. In
particular, we decided to use the data released for
the SENTIPOLC (SENTIment Polarity
Classification) shared task
        <xref ref-type="bibr" rid="ref1">(Barbieri et al., 2016)</xref>
        carried
out at EVALITA 2016
        <xref ref-type="bibr" rid="ref1 ref2">(Basile et al., 2016)</xref>
        whose
tweets comes from a distribution different from
them used for training AlBERTo. It includes three
subtasks:
      </p>
      <sec id="sec-4-1">
        <title>Subjectivity Classification: “a system must</title>
        <p>decide whether a given message is subjective
or objective”;</p>
      </sec>
      <sec id="sec-4-2">
        <title>Polarity Classification: “a system must de</title>
        <p>cide whether a given message is of positive,
negative, neutral or mixed sentiment”;
Irony Detection: “a system must decide
whether a given message is ironic or not”.</p>
        <p>Data provided for training and test are tagged
with six fields containing values related to manual
annotation: subj, opos, oneg, iro, lpos, lneg.
These labels describe consequently if the sentence
is subjective, positive, negative, ironical, literal
positive, literal negative. For each of these classes,
there is a 1 where the sentence satisfy the label, a
0 instead.</p>
        <p>The last two labels “lpos” and “lneg” that describe
the literal polarity of the tweet have not been
considered in the current evaluation (nor in the
official shared task evaluation). In total, 7410
tweets have been released for training and 2000
for testing. We do not used any validation set
because we do not performed any phase of model
selection during the fine-tuning of AlBERTo. The
evaluation was performed considering precision
(p), recall (r) and F1-score (F1) for each class and
for each classification task.</p>
        <sec id="sec-4-2-1">
          <title>3https://colab.research.google.com 4https://cloud.google.com/tpu/ 5https://cloud.google.com/storage/</title>
          <p>Subjectivity
Polarity Pos.</p>
          <p>Polarity Neg.</p>
          <p>Irony
Subjectivity
Polarity Pos.</p>
          <p>Polarity Neg.</p>
          <p>Irony</p>
          <p>System
AlBERTo
Unitor.1.u
Unitor.2.u
samskara.1.c
ItaliaNLP.2.c
AlBERTo fine-tuning. We fine-tuned AlBERTo
four different times, in order to obtain one
classifier for each task except for the polarity where
we have two of them. In particular, we created
one classifier for the Subjectivity Classification,
one for Polarity Positive, one for Polarity
Negative and one for the Irony Detection. Each time
we have re-trained the model for three epochs,
using a learning rate of 2e-5 with 1000 steps per
loops on batches of 512 example from the training
set of the specific task. For the fine-tuning of the
Irony Detection classifier, we increased the
number of epochs of training to ten observing low
performances using only three epochs as for the other
classification tasks. The fine-tuning process lasted
4 minutes every time.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Discussion of the results. The results reported in</title>
        <p>Table 1 show the output obtained from the
official evaluation script of SENTIPOLC 2016. It is
important to note that the values on the individual
classes of precision, recall and, F1 are not
compared with them of the systems that participated in
the competition because they are not reported in
the overview paper of the task. Nevertheless, some
considerations can be drawn. The classifier based
on AlBERTo achieves, on average, high recall on
class 0 and low values on class 1. The opposite
situation is instead observed on the precision, where
for the class 1 it is on average superior to the
recall values. This note suggests that the system is
very good at classifying a phenomenon and when
it does, it is sure of the prediction made even at the
cost of generating false negatives.</p>
        <p>On each of the sub-tasks of SENTIPOLC, it
can be observed that AlBERTo has obtained state
of the art results without any heuristic tuning of
learning parameters (model as it is after
finetuning training) except in the case of irony
detection where it was necessary to increase the
number of epochs of the learning phase of fine-tuning.
Comparing AlBERTo with the best system of each
subtask, we observe an increase in results between
7% and 11%. The results obtained are exciting,
from our point of view, for further future work.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this work, we described AlBERTo, the first
Italian language understanding model based on
social media writing style. The model has been
trained using the official BERT source code on
a Google TPU-V2 on 200M tweets in the Italian
language. The pre-trained model has been
finetuned on the data available for the classification
task SENTIPOLC 2016, showing SOTA results.
The results allow us to promote AlBERTo as the
starting point for future research in this direction.
Model repository: https://github.com/
marcopoli/AlBERTo-it</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>The work of Marco Polignano is funded by project
”DECiSION” codice raggruppamento: BQS5153,
under the Apulian INNONETWORK programme,
Italy. The work of Valerio Basile is partially
funded by Progetto di Ateneo/CSP 2016
(Immigrants, Hate and Prejudice in Social Media,
S1618 L2 BOSC 01.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Barbieri</surname>
          </string-name>
          , Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli, and
          <string-name>
            <given-names>Viviana</given-names>
            <surname>Patti</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Overview of the evalita 2016 sentiment polarity classification task</article-title>
          .
          <source>In Proceedings of third Italian conference on computational linguistics (CLiCit</source>
          <year>2016</year>
          )
          <article-title>&amp; fifth evaluation campaign of natural language processing and speech tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Pierpaolo</given-names>
            <surname>Basile</surname>
          </string-name>
          , Franco Cutugno, Malvina Nissim, Viviana Patti,
          <string-name>
            <given-names>Rachele</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Evalita 2016: Overview of the 5th evaluation campaign of natural language processing and speech tools for italian</article-title>
          .
          <source>In 3rd Italian Conference on Computational Linguistics</source>
          , CLiC-it
          <year>2016</year>
          and
          <article-title>5th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          ,
          <source>EVALITA</source>
          <year>2016</year>
          , volume
          <volume>1749</volume>
          , pages
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . CEUR-WS.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Mirko Lai, and
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Long-term social media data collection at the university of turin</article-title>
          .
          <source>In Fifth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2018</year>
          ), pages
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . CEUR-WS.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Christos</given-names>
            <surname>Baziotis</surname>
          </string-name>
          , Nikos Pelekis, and
          <string-name>
            <given-names>Christos</given-names>
            <surname>Doulkeridis</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Datastories at semeval-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis</article-title>
          .
          <source>In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          , pages
          <fpage>747</fpage>
          -
          <lpage>754</lpage>
          , Vancouver, Canada, August. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>5</volume>
          :
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Ciprian</given-names>
            <surname>Chelba</surname>
          </string-name>
          , Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and
          <string-name>
            <given-names>Tony</given-names>
            <surname>Robinson</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>One billion word benchmark for measuring progress in statistical language modeling</article-title>
          .
          <source>arXiv preprint arXiv:1312</source>
          .
          <fpage>3005</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Taku</given-names>
            <surname>Kudo</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Subword regularization: Improving neural network translation models with multiple subword candidates</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .10959.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Peters</surname>
          </string-name>
          , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep contextualized word representations</article-title>
          . pages
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          , June.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Polignano</surname>
          </string-name>
          , Pierpaolo Basile, Marco de Gemmis, and
          <string-name>
            <given-names>Giovanni</given-names>
            <surname>Semeraro</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A comparison of word-embeddings in emotion detection from text using bilstm, cnn and self-attention</article-title>
          .
          <source>In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization</source>
          , pages
          <fpage>63</fpage>
          -
          <lpage>68</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Jeff Wu, Rewon Child, David Luan,
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Language models are unsupervised multitask learners</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Łukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Amapreet</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Julian Michael</surname>
          </string-name>
          , Felix Hill,
          <string-name>
            <given-names>Omer Levy</given-names>
            , and
            <surname>Samuel R Bowman</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Glue: A multi-task benchmark and analysis platform for natural language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .07461.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Yukun</given-names>
            <surname>Zhu</surname>
          </string-name>
          , Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and
          <string-name>
            <given-names>Sanja</given-names>
            <surname>Fidler</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Aligning books and movies: Towards story-like visual explanations by watching movies and reading books</article-title>
          .
          <source>In Proceedings of the IEEE international conference on computer vision</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>