A L BERT O: Italian BERT Language Understanding Model
                  for NLP Challenging Tasks Based on Tweets
        Marco Polignano                      Pierpaolo Basile                  Marco de Gemmis
    University of Bari A. Moro           University of Bari A. Moro         University of Bari A. Moro
     Dept. Computer Science               Dept. Computer Science             Dept. Computer Science
        E.Orabona 4, Italy                   E.Orabona 4, Italy                 E.Orabona 4, Italy
    marco.polignano@uniba.it            pierpaolo.basile@uniba.it            marco.degemmis@uniba.it

                 Giovanni Semeraro                                    Valerio Basile
               University of Bari A. Moro                           University of Turin
                Dept. Computer Science                            Dept. Computer Science
                   E.Orabona 4, Italy                                Via Verdi 8, Italy
              giovanni.semeraro@uniba.it                         valerio.basile@unito.it


                      Abstract                             Natural Language Processing. In particular, nu-
                                                           merous tasks such as part of speech tagging, ques-
    English. Recent scientific studies on nat-
                                                           tion answering, machine translation, and text clas-
    ural language processing (NLP) report the
                                                           sification have obtained significant contributions
    outstanding effectiveness observed in the
                                                           in terms of performance through the use of distri-
    use of context-dependent and task-free
                                                           butional semantics techniques such as word em-
    language understanding models such as
                                                           bedding. Mikolov et al. (2013) notably con-
    ELMo, GPT, and BERT. Specifically, they
                                                           tributed to the genesis of numerous strategies for
    have proved to achieve state of the art
                                                           representing terms based on the idea that semanti-
    performance in numerous complex NLP
                                                           cally related terms have a similar vector represen-
    tasks such as question answering and sen-
                                                           tations. Such technologies as Word2Vec (Mikolov
    timent analysis in the English language.
                                                           et al., 2013), Glove (Pennington et al., 2014), and
    Following the great popularity and effec-
                                                           FastText (Bojanowski et al., 2017) suffer from a
    tiveness that these models are gaining in
                                                           problem that multiple concepts, associated with
    the scientific community, we trained a
                                                           the same term, are not represented by different
    BERT language understanding model for
                                                           wordembedding vectors in the distributional space
    the Italian language (AlBERTo). In par-
                                                           (context-free). New strategies such as ELMo (Pe-
    ticular, AlBERTo is focused on the lan-
                                                           ters et al., 2018), GPT/GPT-2 (Radford et al.,
    guage used in social networks, specifi-
                                                           2019), and BERT (Devlin et al., 2019) overcome
    cally on Twitter. To demonstrate its ro-
                                                           this limit by learning a language understanding
    bustness, we evaluated AlBERTo on the
                                                           model for a contextual and task-independent rep-
    EVALITA 2016 task SENTIPOLC (SEN-
                                                           resentation of terms. In their multilingual version,
    TIment POLarity Classification) obtain-
                                                           they mainly use a mix of text obtained from large
    ing state of the art results in subjectiv-
                                                           corpora in different languages to build a general
    ity, polarity and irony detection on Ital-
                                                           language model to be reused for every application
    ian tweets. The pre-trained AlBERTo
                                                           in any language. As reported by the BERT doc-
    model will be publicly distributed through
                                                           umentation ”the Multilingual model is somewhat
    the GitHub platform at the following web
                                                           worse than a single-language model. However, it
    address:       https://github.com/
                                                           is not feasible for us to train and maintain dozens
    marcopoli/AlBERTo-it
                                                           of single-language model.” This entails significant
    in order to facilitate future research.
                                                           limitations related to the type of language learned
1    Introduction                                          (with respect to the document style) and the size
                                                           of the vocabulary. These reasons have led us to
The recent spread of pre-trained text represen-
                                                           create the equivalent of the BERT model for the
tation models has enabled important progress in
                                                           Italian language and specifically on the language
     Copyright c 2019 for this paper by its authors. Use   style used on Twitter: AlBERTo. This idea was
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).                                 supported by the intuition that many of the NLP
tasks for the Italian language are carried out for     ization power of the network. The deep neural
the analysis of social media data, both in business    network used is a 12-layer decoder-only trans-
and research contexts.                                 former with masked self-attention heads (768 di-
                                                       mensional states and 12 attention heads) trained
2   Related Work                                       for 100 epochs on the BooksCorpus dataset (Zhu
                                                       et al., 2015). This strategy proved to be success-
A Task-Independent Sentence Understanding              ful compared to the results obtained by ELMo on
Model is based on the idea of creating a deep          the same NLP tasks. BERT (Bidirectional En-
learning architecture, particularly an encoder and     coder Representations from Transformers) (De-
a decoder, so that the encoding level can be used      vlin et al., 2019) was developed to work with a
in more than one NLP task. In this way, it is          strategy very similar to GPT. In its basic version,
possible to obtain a decoding level with weights       it is also trained on a Transformer network with
optimized for the specific task (fine-tuning). A       12 levels, 768 dimensional states and 12 heads of
general-purpose encoder should, therefore, be able     attention for a total of 110M of parameters and
to provide an efficient representation of the terms,   trained on BooksCorpus (Zhu et al., 2015) and
their position in the sentence, context, grammat-      Wikipedia English for 1M of steps. The main
ical structure of the sentence, semantics of the       difference is that the learning phase is performed
terms. One of the first systems able to satisfy        by scanning the span of text in both directions,
these requirements was ELMo (Peters et al., 2018)      from left to right and from right to left, as was al-
based on a large neural network biLSTM (2 biL-         ready done in biLSTMs. Moreover, BERT uses
STM layers with 4096 units and 512 dimension           a “masked language model”: during the training,
projections and a residual connection from the first   random terms are masked in order to be predicted
to the second layer) trained for 10 epochs on the      by the net. Jointly, the network is also designed
1B WordBenchmark (Chelba et al., 2013). The            to potentially learn the next span of text from the
goal of the network was to predict the same start-     one given in input. These variations on the GPT
ing sentence in the same initial language (like an     model allow BERT to be the current state of the art
autoencoder). It has guaranteed the correct man-       language understanding model. Larger versions of
agement of polysemy of terms by demonstrating          BERT (BERT large) and GPT (GPT-2) have been
its efficacy on six different NLP tasks for which      released and are scoring better results than the nor-
it obtained state-of-the-art results: Question An-     mal scale models but require much more compu-
swering, Textual Entailment, Semantic Role label-      tational power. The base BERT model for English
ing, Coreference Resolution, Name Entity Extrac-       language is exactly the same used for learning the
tion, and Sentiment Analysis. Following the ba-        Italian Language Understang Model (AlBERTo)
sic idea of ELMo, another language model called        but we are considering the possibility to develop
GPT has been developed in order to improve the         a large version of it soon.
performance on the tasks included in the GLUE
benchmark (Wang et al., 2018). GPT replaces            3   AlBERTo
the biLSTM network with a Transformer archi-
tecture (Vaswani et al., 2017). A Transformer          As pointed out in the previous sections, the aim
is an encoder-decoder architecture that is mainly      of this work is to create a linguistic resource for
based on feed-forward and multi-head attention         Italian that would follow the most recent strate-
layers. Moreover, in Transformers terms are pro-       gies used to address NLP problems in English. It
vided as input without a specific order and con-       is well known that the language used on social net-
sequently a positional vector is added to the term     works is different from the formal one, also as a
embeddings. Unlike ELMo, in GPT, for each              consequence of the presence of mentions, uncom-
new task, the weights of all levels of the net-        mon terms, links, and hashtags that are not present
work are optimized, and the complexity of the net-     elsewhere. Moreover multiple language models in
work (in terms of parameters) remains almost con-      their multilingual version, are not performing well
stant. Moreover, during the learning phase, the        in every specific language, especially with a writ-
network does not limit itself to work on a sin-        ing style different from that of books and ency-
gle sentence but it splits the text into spans to      clopedic descriptions (Polignano et al., 2019). Al-
improve the predictive capacity and the general-       BERTo aims to be the first Italian language under-
                                                              Figure 2: Example of preprocessed Tweet


                                                        3.1    Text Preprocessing
                                                        In order to tailor the tweet text to BERT’s in-
                                                        put structure, it is necessary to carry out pre-
                                                        processing operations. More specifically, using
                                                        Python as the programming language, two li-
                                                        braries were mainly adopted: Ekphrasis (Bazio-
                                                        tis et al., 2017) and SentencePiece2 (Kudo, 2018).
                                                        Ekphrasis is a popular tool comprising an NLP
                                                        pipeline for text extracted from Twitter. It has been
 Figure 1: BERT and AlBERTo learning strategy           used for:

                                                          • Normalizing URL, emails, mentions, per-
standing model to represent the social media lan-           cents, money, time, date, phone numbers,
guage, Twitter in particular, written in Italian. The       numbers, emoticons;
model proposed in this work is based on the soft-
ware distributed through GitHub by Devlin et al.          • Tagging and unpacking hashtags.
(2019) 1 with the endorsement of Google. It has         The normalization phase consists in replacing
been trained, without consequences, on text spans       each term with a fixed tuple < [entity type] >.
containing typical social media characters includ-      The tagging phase consists of enclosing hashtags
ing emojis, links hashtags and mentions.                with two tags < hashtag > ... < /hashtag >
   Figure 1 shows the BERT and AlBERTo strat-           representing their beginning and end in the sen-
egy of learning. The “masked learning” is ap-           tence. Whenever possible, the hashtag has been
plied on a 12x Transformer Encoder, where, for          unpacked into known words. The text is cleaned
each input, a percentage of terms is hidden and         and made easily readable by the network by con-
then predicted for optimizing network weights in        verting it to its lowercase form and all characters
back-propagation. In AlBERTo, we implement              except emojis, !, ? and accented characters have
only the “masked learning” strategy, excluding the      been deleted. An example of preprocessed tweet
step based on “next following sentence”. This is a      is shown in Figure 2.
crucial aspect to be aware of because, in the case
of tweets, we do not have cognition of a flow of           SentencePiece is a segmentation algorithm used
tweets as it happens in a dialog. For this reason,      for learning the best strategy for splitting text
we are aware that AlBERTo is not suitable for the       into terms in an unsupervised and language-
task of question answering, where this property is      independent way. It can process up to 50k sen-
essential. On the contrary, the model is well suited    tences per seconds and generate an extensive vo-
for classification and prediction tasks. The deci-      cabulary. It includes the most common terms
sion to train AlBERTo, excluding the ”next follow-      in the training set and the subwords which oc-
ing sentence” strategy, makes the model similar         cur in the middle of words, annotating them with
in purposes to ELMo. Differently from it, BERT          ’##’ in order to be able to encode also slang, in-
and AlBERTo use transformer architecture instead        complete or uncommon words. An example of a
on biLSTM which have been demonstrated to per-          piece of the vocabulary generated for AlBERTo
form better in natural language processing tasks.       is shown in Figure 3. SentencePiece also pro-
In any case, we are considering the possibility to      duced a tokenizer, used to generate a list of tokens
learn an Italian ELMo model and to compare it           for each tweet further processed by BERT’s cre-
with the here proposed model.                           ate pretraining data.py module.
  1                                                       2
    https://github.com/google-research/                     https://github.com/google/
bert/                                                   sentencepiece
                                                                The training has been performed over the
                                                             Google Collaborative Environment (Colab)3 , Us-
                                                             ing a 8 core Google TPU-V24 and a Google Cloud
                                                             Storage Bucket5 . In total, it took ∼ 50 hours to
                                                             create a complete AlBERTo model. More techni-
                                                             cal details are available in the Notebook ”Italian
                                                             Pre-training BERT from scratch with cloud TPU”
                                                             into the project repository.
Figure 3: An extract of the vocabulary created by
SentencePiece for AlBERTo                                    4       Evaluation and Discussion of Results
                                                             We evaluate AlBERTo on a task of sentiment
3.2    Dataset                                               analysis for the Italian language. In particu-
The dataset used for the learning phase of Al-               lar, we decided to use the data released for
BERTo is TWITA (Basile et al., 2018) a huge                  the SENTIPOLC (SENTIment Polarity Classifi-
corpus of Tweets in the Italian language collected           cation) shared task (Barbieri et al., 2016) carried
from February 2012 to the present day from Twit-             out at EVALITA 2016 (Basile et al., 2016) whose
ter’s official streaming API. In our configuration,          tweets comes from a distribution different from
we randomly selected 200 million Tweets remov-               them used for training AlBERTo. It includes three
ing re-tweets, and processed them with the pre-              subtasks:
processing pipeline described previously. In total,              • Subjectivity Classification: “a system must
we obtained 191GB of raw data.                                     decide whether a given message is subjective
                                                                   or objective”;
3.3    Learning Configuration
                                                                 • Polarity Classification: “a system must de-
The AlBERTo model has been trained using the                       cide whether a given message is of positive,
following configuration:                                           negative, neutral or mixed sentiment”;

bert base config = {                                             • Irony Detection: “a system must decide
  ” attention probs dropout prob ” : 0.1 ,                         whether a given message is ironic or not”.
  ” directionality ” : ” bidi ” ,
  ” hidden act ” : ” gelu ” ,                                   Data provided for training and test are tagged
  ” hidden dropout prob ” : 0.1 ,                            with six fields containing values related to manual
  ” h i d d e n s i z e ” : 768 ,
  ” i n i t i a l i z e r r a n g e ” : 0.02 ,               annotation: subj, opos, oneg, iro, lpos, lneg.
  ” i n t e r m e d i a t e s i z e ” : 3072 ,               These labels describe consequently if the sentence
  ” max position embeddings ” : 512 ,
  ” n u m a t t e n t i o n h e a d s ” : 12 ,
                                                             is subjective, positive, negative, ironical, literal
  ” n u m h i d d e n l a y e r s ” : 12 ,                   positive, literal negative. For each of these classes,
  ” p o o l e r f c s i z e ” : 768 ,                        there is a 1 where the sentence satisfy the label, a
  ” p o o l e r n u m a t t e n t i o n h e a d s ” : 12 ,
  ” pooler num fc layers ” : 3 ,                             0 instead.
  ” p o o l e r s i z e p e r h e a d ” : 128 ,              The last two labels “lpos” and “lneg” that describe
  ” pooler type ” : ” first token transform ” ,              the literal polarity of the tweet have not been
  ” type vocab size ” : 2 ,
  ” v o c a b s i z e ” : 128000                             considered in the current evaluation (nor in the
}                                                            official shared task evaluation). In total, 7410
                                                             tweets have been released for training and 2000
      # Input data pipeline config                           for testing. We do not used any validation set
      TRAIN BATCH SIZE = 128
      MAX PREDICTIONS = 20
                                                             because we do not performed any phase of model
      MAX SEQ LENGTH = 128                                   selection during the fine-tuning of AlBERTo. The
      MASKED LM PROB = 0 . 1 5                               evaluation was performed considering precision
      # Training procedure config                            (p), recall (r) and F1-score (F1) for each class and
      EVAL BATCH SIZE = 64                                   for each classification task.
      LEARNING RATE = 2 e−5
      TRAIN STEPS = 1000000
                                                                 3
      SAVE CHECKPOINTS STEPS = 2500                                https://colab.research.google.com
                                                                 4
      NUM TPU CORES = 8                                            https://cloud.google.com/tpu/
                                                                 5
                                                                   https://cloud.google.com/storage/
                       Prec. 0    Rec. 0    F1. 0      Discussion of the results. The results reported in
     Subjectivity      0.6838     0.8058    0.7398     Table 1 show the output obtained from the offi-
     Polarity Pos.     0.9262     0.8301    0.8755     cial evaluation script of SENTIPOLC 2016. It is
     Polarity Neg.     0.7537     0.9179    0.8277     important to note that the values on the individual
     Irony             0.9001     0.9853    0.9408     classes of precision, recall and, F1 are not com-
                       Prec. 1    Rec. 1    F1 . 1
                                                       pared with them of the systems that participated in
     Subjectivity      0.8857     0.8015    0.8415
                                                       the competition because they are not reported in
     Polarity Pos.     0.5818     0.5314    0.5554
                                                       the overview paper of the task. Nevertheless, some
     Polarity Neg.     0.7988     0.5208    0.6305
     Irony             0.6176     0.1787    0.2772
                                                       considerations can be drawn. The classifier based
                                                       on AlBERTo achieves, on average, high recall on
Table 1: Results obtained using the official evalu-    class 0 and low values on class 1. The opposite sit-
ation script of SENTIPOLC 2016                         uation is instead observed on the precision, where
                                                       for the class 1 it is on average superior to the re-
        System          Obj        Subj        F       call values. This note suggests that the system is
     AlBERTo           0.7398     0.8415    0.7906     very good at classifying a phenomenon and when
     Unitor.1.u        0.6784     0.8105    0.7444     it does, it is sure of the prediction made even at the
     Unitor.2.u        0.6723     0.7979    0.7351     cost of generating false negatives.
     samskara.1.c      0.6555     0.7814    0.7184        On each of the sub-tasks of SENTIPOLC, it
     ItaliaNLP.2.c     0.6733     0.7535    0.7134     can be observed that AlBERTo has obtained state
                                                       of the art results without any heuristic tuning of
        System           Pos       Neg         F       learning parameters (model as it is after fine-
     AlBERTo           0.7155     0.7291    0.7223     tuning training) except in the case of irony detec-
     UniPI.2.c         0.6850     0.6426    0.6638     tion where it was necessary to increase the num-
     Unitor.1.u        0.6354     0.6885    0.6620
                                                       ber of epochs of the learning phase of fine-tuning.
     Unitor.2.u        0.6312     0.6838    0.6575
                                                       Comparing AlBERTo with the best system of each
     ItaliaNLP.1.c     0.6265     0.6743    0.6504
                                                       subtask, we observe an increase in results between
                                                       7% and 11%. The results obtained are exciting,
         System         Non-Iro      Iro       F
     AlBERTo            0.9408     0.2772   0.6090     from our point of view, for further future work.
     tweet2check16.c     0.9115    0.1710   0.5412
     CoMoDI.c            0.8993    0.1509   0.5251     5   Conclusion
     tweet2check14.c     0.9166    0.1159   0.5162
                                                       In this work, we described AlBERTo, the first Ital-
     IRADABE.2.c         0.9241    0.1026   0.5133
                                                       ian language understanding model based on so-
Table 2: Comparison of results with the best sys-      cial media writing style. The model has been
tems of SENTIPOLC for each classification task         trained using the official BERT source code on
                                                       a Google TPU-V2 on 200M tweets in the Italian
                                                       language. The pre-trained model has been fine-
AlBERTo fine-tuning. We fine-tuned AlBERTo
                                                       tuned on the data available for the classification
four different times, in order to obtain one clas-
                                                       task SENTIPOLC 2016, showing SOTA results.
sifier for each task except for the polarity where
                                                       The results allow us to promote AlBERTo as the
we have two of them. In particular, we created
                                                       starting point for future research in this direction.
one classifier for the Subjectivity Classification,
                                                       Model repository: https://github.com/
one for Polarity Positive, one for Polarity Nega-
                                                       marcopoli/AlBERTo-it
tive and one for the Irony Detection. Each time
we have re-trained the model for three epochs, us-
                                                       Acknowledgment
ing a learning rate of 2e-5 with 1000 steps per
loops on batches of 512 example from the training      The work of Marco Polignano is funded by project
set of the specific task. For the fine-tuning of the   ”DECiSION” codice raggruppamento: BQS5153,
Irony Detection classifier, we increased the num-      under the Apulian INNONETWORK programme,
ber of epochs of training to ten observing low per-    Italy. The work of Valerio Basile is partially
formances using only three epochs as for the other     funded by Progetto di Ateneo/CSP 2016 (Im-
classification tasks. The fine-tuning process lasted   migrants, Hate and Prejudice in Social Media,
∼ 4 minutes every time.                                S1618 L2 BOSC 01.
References                                                 Jeffrey Pennington, Richard Socher, and Christopher
                                                              Manning. 2014. Glove: Global vectors for word
Francesco Barbieri, Valerio Basile, Danilo Croce,             representation. In Proceedings of the 2014 confer-
  Malvina Nissim, Nicole Novielli, and Viviana Patti.         ence on empirical methods in natural language pro-
  2016. Overview of the evalita 2016 sentiment polar-         cessing (EMNLP), pages 1532–1543.
  ity classification task. In Proceedings of third Ital-
  ian conference on computational linguistics (CLiC-       Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
  it 2016) & fifth evaluation campaign of natural lan-      Gardner, Christopher Clark, Kenton Lee, and Luke
  guage processing and speech tools for Italian. Final      Zettlemoyer. 2018. Deep contextualized word rep-
  Workshop (EVALITA 2016).                                  resentations. pages 2227–2237, June.
Pierpaolo Basile, Franco Cutugno, Malvina Nissim,          Marco Polignano, Pierpaolo Basile, Marco de Gem-
   Viviana Patti, Rachele Sprugnoli, et al. 2016.           mis, and Giovanni Semeraro. 2019. A compari-
   Evalita 2016: Overview of the 5th evaluation cam-        son of word-embeddings in emotion detection from
   paign of natural language processing and speech          text using bilstm, cnn and self-attention. In Adjunct
   tools for italian. In 3rd Italian Conference on Com-     Publication of the 27th Conference on User Model-
   putational Linguistics, CLiC-it 2016 and 5th Eval-       ing, Adaptation and Personalization, pages 63–68.
   uation Campaign of Natural Language Processing           ACM.
   and Speech Tools for Italian, EVALITA 2016, vol-
   ume 1749, pages 1–4. CEUR-WS.                           Alec Radford, Jeff Wu, Rewon Child, David Luan,
Valerio Basile, Mirko Lai, and Manuela Sanguinetti.          Dario Amodei, and Ilya Sutskever. 2019. Language
  2018. Long-term social media data collection at the        models are unsupervised multitask learners.
  university of turin. In Fifth Italian Conference on      Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Computational Linguistics (CLiC-it 2018), pages 1–         Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  6. CEUR-WS.                                                Kaiser, and Illia Polosukhin. 2017. Attention is all
Christos Baziotis, Nikos Pelekis, and Christos Doulk-        you need. In Advances in neural information pro-
  eridis. 2017. Datastories at semeval-2017 task             cessing systems, pages 5998–6008.
  4: Deep lstm with attention for message-level and
                                                           Alex Wang, Amapreet Singh, Julian Michael, Felix
  topic-based sentiment analysis. In Proceedings of
                                                             Hill, Omer Levy, and Samuel R Bowman. 2018.
  the 11th International Workshop on Semantic Eval-
                                                             Glue: A multi-task benchmark and analysis platform
  uation (SemEval-2017), pages 747–754, Vancou-
                                                             for natural language understanding. arXiv preprint
  ver, Canada, August. Association for Computational
                                                             arXiv:1804.07461.
  Linguistics.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and        Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
   Tomas Mikolov. 2017. Enriching word vectors with          dinov, Raquel Urtasun, Antonio Torralba, and Sanja
   subword information. Transactions of the Associa-         Fidler. 2015. Aligning books and movies: Towards
   tion for Computational Linguistics, 5:135–146.            story-like visual explanations by watching movies
                                                             and reading books. In Proceedings of the IEEE
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,         international conference on computer vision, pages
  Thorsten Brants, Phillipp Koehn, and Tony Robin-           19–27.
  son. 2013. One billion word benchmark for measur-
  ing progress in statistical language modeling. arXiv
  preprint arXiv:1312.3005.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of
   deep bidirectional transformers for language under-
   standing. In Proceedings of the 2019 Conference of
   the North American Chapter of the Association for
   Computational Linguistics: Human Language Tech-
   nologies, Volume 1 (Long and Short Papers), pages
   4171–4186, Minneapolis, Minnesota, June. Associ-
   ation for Computational Linguistics.
Taku Kudo. 2018. Subword regularization: Im-
  proving neural network translation models with
  multiple subword candidates.     arXiv preprint
  arXiv:1804.10959.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
  rado, and Jeff Dean. 2013. Distributed representa-
  tions of words and phrases and their compositional-
  ity. In Advances in neural information processing
  systems, pages 3111–3119.