A L BERT O: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets Marco Polignano Pierpaolo Basile Marco de Gemmis University of Bari A. Moro University of Bari A. Moro University of Bari A. Moro Dept. Computer Science Dept. Computer Science Dept. Computer Science E.Orabona 4, Italy E.Orabona 4, Italy E.Orabona 4, Italy marco.polignano@uniba.it pierpaolo.basile@uniba.it marco.degemmis@uniba.it Giovanni Semeraro Valerio Basile University of Bari A. Moro University of Turin Dept. Computer Science Dept. Computer Science E.Orabona 4, Italy Via Verdi 8, Italy giovanni.semeraro@uniba.it valerio.basile@unito.it Abstract Natural Language Processing. In particular, nu- merous tasks such as part of speech tagging, ques- English. Recent scientific studies on nat- tion answering, machine translation, and text clas- ural language processing (NLP) report the sification have obtained significant contributions outstanding effectiveness observed in the in terms of performance through the use of distri- use of context-dependent and task-free butional semantics techniques such as word em- language understanding models such as bedding. Mikolov et al. (2013) notably con- ELMo, GPT, and BERT. Specifically, they tributed to the genesis of numerous strategies for have proved to achieve state of the art representing terms based on the idea that semanti- performance in numerous complex NLP cally related terms have a similar vector represen- tasks such as question answering and sen- tations. Such technologies as Word2Vec (Mikolov timent analysis in the English language. et al., 2013), Glove (Pennington et al., 2014), and Following the great popularity and effec- FastText (Bojanowski et al., 2017) suffer from a tiveness that these models are gaining in problem that multiple concepts, associated with the scientific community, we trained a the same term, are not represented by different BERT language understanding model for wordembedding vectors in the distributional space the Italian language (AlBERTo). In par- (context-free). New strategies such as ELMo (Pe- ticular, AlBERTo is focused on the lan- ters et al., 2018), GPT/GPT-2 (Radford et al., guage used in social networks, specifi- 2019), and BERT (Devlin et al., 2019) overcome cally on Twitter. To demonstrate its ro- this limit by learning a language understanding bustness, we evaluated AlBERTo on the model for a contextual and task-independent rep- EVALITA 2016 task SENTIPOLC (SEN- resentation of terms. In their multilingual version, TIment POLarity Classification) obtain- they mainly use a mix of text obtained from large ing state of the art results in subjectiv- corpora in different languages to build a general ity, polarity and irony detection on Ital- language model to be reused for every application ian tweets. The pre-trained AlBERTo in any language. As reported by the BERT doc- model will be publicly distributed through umentation ”the Multilingual model is somewhat the GitHub platform at the following web worse than a single-language model. However, it address: https://github.com/ is not feasible for us to train and maintain dozens marcopoli/AlBERTo-it of single-language model.” This entails significant in order to facilitate future research. limitations related to the type of language learned 1 Introduction (with respect to the document style) and the size of the vocabulary. These reasons have led us to The recent spread of pre-trained text represen- create the equivalent of the BERT model for the tation models has enabled important progress in Italian language and specifically on the language Copyright c 2019 for this paper by its authors. Use style used on Twitter: AlBERTo. This idea was permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). supported by the intuition that many of the NLP tasks for the Italian language are carried out for ization power of the network. The deep neural the analysis of social media data, both in business network used is a 12-layer decoder-only trans- and research contexts. former with masked self-attention heads (768 di- mensional states and 12 attention heads) trained 2 Related Work for 100 epochs on the BooksCorpus dataset (Zhu et al., 2015). This strategy proved to be success- A Task-Independent Sentence Understanding ful compared to the results obtained by ELMo on Model is based on the idea of creating a deep the same NLP tasks. BERT (Bidirectional En- learning architecture, particularly an encoder and coder Representations from Transformers) (De- a decoder, so that the encoding level can be used vlin et al., 2019) was developed to work with a in more than one NLP task. In this way, it is strategy very similar to GPT. In its basic version, possible to obtain a decoding level with weights it is also trained on a Transformer network with optimized for the specific task (fine-tuning). A 12 levels, 768 dimensional states and 12 heads of general-purpose encoder should, therefore, be able attention for a total of 110M of parameters and to provide an efficient representation of the terms, trained on BooksCorpus (Zhu et al., 2015) and their position in the sentence, context, grammat- Wikipedia English for 1M of steps. The main ical structure of the sentence, semantics of the difference is that the learning phase is performed terms. One of the first systems able to satisfy by scanning the span of text in both directions, these requirements was ELMo (Peters et al., 2018) from left to right and from right to left, as was al- based on a large neural network biLSTM (2 biL- ready done in biLSTMs. Moreover, BERT uses STM layers with 4096 units and 512 dimension a “masked language model”: during the training, projections and a residual connection from the first random terms are masked in order to be predicted to the second layer) trained for 10 epochs on the by the net. Jointly, the network is also designed 1B WordBenchmark (Chelba et al., 2013). The to potentially learn the next span of text from the goal of the network was to predict the same start- one given in input. These variations on the GPT ing sentence in the same initial language (like an model allow BERT to be the current state of the art autoencoder). It has guaranteed the correct man- language understanding model. Larger versions of agement of polysemy of terms by demonstrating BERT (BERT large) and GPT (GPT-2) have been its efficacy on six different NLP tasks for which released and are scoring better results than the nor- it obtained state-of-the-art results: Question An- mal scale models but require much more compu- swering, Textual Entailment, Semantic Role label- tational power. The base BERT model for English ing, Coreference Resolution, Name Entity Extrac- language is exactly the same used for learning the tion, and Sentiment Analysis. Following the ba- Italian Language Understang Model (AlBERTo) sic idea of ELMo, another language model called but we are considering the possibility to develop GPT has been developed in order to improve the a large version of it soon. performance on the tasks included in the GLUE benchmark (Wang et al., 2018). GPT replaces 3 AlBERTo the biLSTM network with a Transformer archi- tecture (Vaswani et al., 2017). A Transformer As pointed out in the previous sections, the aim is an encoder-decoder architecture that is mainly of this work is to create a linguistic resource for based on feed-forward and multi-head attention Italian that would follow the most recent strate- layers. Moreover, in Transformers terms are pro- gies used to address NLP problems in English. It vided as input without a specific order and con- is well known that the language used on social net- sequently a positional vector is added to the term works is different from the formal one, also as a embeddings. Unlike ELMo, in GPT, for each consequence of the presence of mentions, uncom- new task, the weights of all levels of the net- mon terms, links, and hashtags that are not present work are optimized, and the complexity of the net- elsewhere. Moreover multiple language models in work (in terms of parameters) remains almost con- their multilingual version, are not performing well stant. Moreover, during the learning phase, the in every specific language, especially with a writ- network does not limit itself to work on a sin- ing style different from that of books and ency- gle sentence but it splits the text into spans to clopedic descriptions (Polignano et al., 2019). Al- improve the predictive capacity and the general- BERTo aims to be the first Italian language under- Figure 2: Example of preprocessed Tweet 3.1 Text Preprocessing In order to tailor the tweet text to BERT’s in- put structure, it is necessary to carry out pre- processing operations. More specifically, using Python as the programming language, two li- braries were mainly adopted: Ekphrasis (Bazio- tis et al., 2017) and SentencePiece2 (Kudo, 2018). Ekphrasis is a popular tool comprising an NLP pipeline for text extracted from Twitter. It has been Figure 1: BERT and AlBERTo learning strategy used for: • Normalizing URL, emails, mentions, per- standing model to represent the social media lan- cents, money, time, date, phone numbers, guage, Twitter in particular, written in Italian. The numbers, emoticons; model proposed in this work is based on the soft- ware distributed through GitHub by Devlin et al. • Tagging and unpacking hashtags. (2019) 1 with the endorsement of Google. It has The normalization phase consists in replacing been trained, without consequences, on text spans each term with a fixed tuple < [entity type] >. containing typical social media characters includ- The tagging phase consists of enclosing hashtags ing emojis, links hashtags and mentions. with two tags < hashtag > ... < /hashtag > Figure 1 shows the BERT and AlBERTo strat- representing their beginning and end in the sen- egy of learning. The “masked learning” is ap- tence. Whenever possible, the hashtag has been plied on a 12x Transformer Encoder, where, for unpacked into known words. The text is cleaned each input, a percentage of terms is hidden and and made easily readable by the network by con- then predicted for optimizing network weights in verting it to its lowercase form and all characters back-propagation. In AlBERTo, we implement except emojis, !, ? and accented characters have only the “masked learning” strategy, excluding the been deleted. An example of preprocessed tweet step based on “next following sentence”. This is a is shown in Figure 2. crucial aspect to be aware of because, in the case of tweets, we do not have cognition of a flow of SentencePiece is a segmentation algorithm used tweets as it happens in a dialog. For this reason, for learning the best strategy for splitting text we are aware that AlBERTo is not suitable for the into terms in an unsupervised and language- task of question answering, where this property is independent way. It can process up to 50k sen- essential. On the contrary, the model is well suited tences per seconds and generate an extensive vo- for classification and prediction tasks. The deci- cabulary. It includes the most common terms sion to train AlBERTo, excluding the ”next follow- in the training set and the subwords which oc- ing sentence” strategy, makes the model similar cur in the middle of words, annotating them with in purposes to ELMo. Differently from it, BERT ’##’ in order to be able to encode also slang, in- and AlBERTo use transformer architecture instead complete or uncommon words. An example of a on biLSTM which have been demonstrated to per- piece of the vocabulary generated for AlBERTo form better in natural language processing tasks. is shown in Figure 3. SentencePiece also pro- In any case, we are considering the possibility to duced a tokenizer, used to generate a list of tokens learn an Italian ELMo model and to compare it for each tweet further processed by BERT’s cre- with the here proposed model. ate pretraining data.py module. 1 2 https://github.com/google-research/ https://github.com/google/ bert/ sentencepiece The training has been performed over the Google Collaborative Environment (Colab)3 , Us- ing a 8 core Google TPU-V24 and a Google Cloud Storage Bucket5 . In total, it took ∼ 50 hours to create a complete AlBERTo model. More techni- cal details are available in the Notebook ”Italian Pre-training BERT from scratch with cloud TPU” into the project repository. Figure 3: An extract of the vocabulary created by SentencePiece for AlBERTo 4 Evaluation and Discussion of Results We evaluate AlBERTo on a task of sentiment 3.2 Dataset analysis for the Italian language. In particu- The dataset used for the learning phase of Al- lar, we decided to use the data released for BERTo is TWITA (Basile et al., 2018) a huge the SENTIPOLC (SENTIment Polarity Classifi- corpus of Tweets in the Italian language collected cation) shared task (Barbieri et al., 2016) carried from February 2012 to the present day from Twit- out at EVALITA 2016 (Basile et al., 2016) whose ter’s official streaming API. In our configuration, tweets comes from a distribution different from we randomly selected 200 million Tweets remov- them used for training AlBERTo. It includes three ing re-tweets, and processed them with the pre- subtasks: processing pipeline described previously. In total, • Subjectivity Classification: “a system must we obtained 191GB of raw data. decide whether a given message is subjective or objective”; 3.3 Learning Configuration • Polarity Classification: “a system must de- The AlBERTo model has been trained using the cide whether a given message is of positive, following configuration: negative, neutral or mixed sentiment”; bert base config = { • Irony Detection: “a system must decide ” attention probs dropout prob ” : 0.1 , whether a given message is ironic or not”. ” directionality ” : ” bidi ” , ” hidden act ” : ” gelu ” , Data provided for training and test are tagged ” hidden dropout prob ” : 0.1 , with six fields containing values related to manual ” h i d d e n s i z e ” : 768 , ” i n i t i a l i z e r r a n g e ” : 0.02 , annotation: subj, opos, oneg, iro, lpos, lneg. ” i n t e r m e d i a t e s i z e ” : 3072 , These labels describe consequently if the sentence ” max position embeddings ” : 512 , ” n u m a t t e n t i o n h e a d s ” : 12 , is subjective, positive, negative, ironical, literal ” n u m h i d d e n l a y e r s ” : 12 , positive, literal negative. For each of these classes, ” p o o l e r f c s i z e ” : 768 , there is a 1 where the sentence satisfy the label, a ” p o o l e r n u m a t t e n t i o n h e a d s ” : 12 , ” pooler num fc layers ” : 3 , 0 instead. ” p o o l e r s i z e p e r h e a d ” : 128 , The last two labels “lpos” and “lneg” that describe ” pooler type ” : ” first token transform ” , the literal polarity of the tweet have not been ” type vocab size ” : 2 , ” v o c a b s i z e ” : 128000 considered in the current evaluation (nor in the } official shared task evaluation). In total, 7410 tweets have been released for training and 2000 # Input data pipeline config for testing. We do not used any validation set TRAIN BATCH SIZE = 128 MAX PREDICTIONS = 20 because we do not performed any phase of model MAX SEQ LENGTH = 128 selection during the fine-tuning of AlBERTo. The MASKED LM PROB = 0 . 1 5 evaluation was performed considering precision # Training procedure config (p), recall (r) and F1-score (F1) for each class and EVAL BATCH SIZE = 64 for each classification task. LEARNING RATE = 2 e−5 TRAIN STEPS = 1000000 3 SAVE CHECKPOINTS STEPS = 2500 https://colab.research.google.com 4 NUM TPU CORES = 8 https://cloud.google.com/tpu/ 5 https://cloud.google.com/storage/ Prec. 0 Rec. 0 F1. 0 Discussion of the results. The results reported in Subjectivity 0.6838 0.8058 0.7398 Table 1 show the output obtained from the offi- Polarity Pos. 0.9262 0.8301 0.8755 cial evaluation script of SENTIPOLC 2016. It is Polarity Neg. 0.7537 0.9179 0.8277 important to note that the values on the individual Irony 0.9001 0.9853 0.9408 classes of precision, recall and, F1 are not com- Prec. 1 Rec. 1 F1 . 1 pared with them of the systems that participated in Subjectivity 0.8857 0.8015 0.8415 the competition because they are not reported in Polarity Pos. 0.5818 0.5314 0.5554 the overview paper of the task. Nevertheless, some Polarity Neg. 0.7988 0.5208 0.6305 Irony 0.6176 0.1787 0.2772 considerations can be drawn. The classifier based on AlBERTo achieves, on average, high recall on Table 1: Results obtained using the official evalu- class 0 and low values on class 1. The opposite sit- ation script of SENTIPOLC 2016 uation is instead observed on the precision, where for the class 1 it is on average superior to the re- System Obj Subj F call values. This note suggests that the system is AlBERTo 0.7398 0.8415 0.7906 very good at classifying a phenomenon and when Unitor.1.u 0.6784 0.8105 0.7444 it does, it is sure of the prediction made even at the Unitor.2.u 0.6723 0.7979 0.7351 cost of generating false negatives. samskara.1.c 0.6555 0.7814 0.7184 On each of the sub-tasks of SENTIPOLC, it ItaliaNLP.2.c 0.6733 0.7535 0.7134 can be observed that AlBERTo has obtained state of the art results without any heuristic tuning of System Pos Neg F learning parameters (model as it is after fine- AlBERTo 0.7155 0.7291 0.7223 tuning training) except in the case of irony detec- UniPI.2.c 0.6850 0.6426 0.6638 tion where it was necessary to increase the num- Unitor.1.u 0.6354 0.6885 0.6620 ber of epochs of the learning phase of fine-tuning. Unitor.2.u 0.6312 0.6838 0.6575 Comparing AlBERTo with the best system of each ItaliaNLP.1.c 0.6265 0.6743 0.6504 subtask, we observe an increase in results between 7% and 11%. The results obtained are exciting, System Non-Iro Iro F AlBERTo 0.9408 0.2772 0.6090 from our point of view, for further future work. tweet2check16.c 0.9115 0.1710 0.5412 CoMoDI.c 0.8993 0.1509 0.5251 5 Conclusion tweet2check14.c 0.9166 0.1159 0.5162 In this work, we described AlBERTo, the first Ital- IRADABE.2.c 0.9241 0.1026 0.5133 ian language understanding model based on so- Table 2: Comparison of results with the best sys- cial media writing style. The model has been tems of SENTIPOLC for each classification task trained using the official BERT source code on a Google TPU-V2 on 200M tweets in the Italian language. The pre-trained model has been fine- AlBERTo fine-tuning. We fine-tuned AlBERTo tuned on the data available for the classification four different times, in order to obtain one clas- task SENTIPOLC 2016, showing SOTA results. sifier for each task except for the polarity where The results allow us to promote AlBERTo as the we have two of them. In particular, we created starting point for future research in this direction. one classifier for the Subjectivity Classification, Model repository: https://github.com/ one for Polarity Positive, one for Polarity Nega- marcopoli/AlBERTo-it tive and one for the Irony Detection. Each time we have re-trained the model for three epochs, us- Acknowledgment ing a learning rate of 2e-5 with 1000 steps per loops on batches of 512 example from the training The work of Marco Polignano is funded by project set of the specific task. For the fine-tuning of the ”DECiSION” codice raggruppamento: BQS5153, Irony Detection classifier, we increased the num- under the Apulian INNONETWORK programme, ber of epochs of training to ten observing low per- Italy. The work of Valerio Basile is partially formances using only three epochs as for the other funded by Progetto di Ateneo/CSP 2016 (Im- classification tasks. The fine-tuning process lasted migrants, Hate and Prejudice in Social Media, ∼ 4 minutes every time. S1618 L2 BOSC 01. References Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word Francesco Barbieri, Valerio Basile, Danilo Croce, representation. In Proceedings of the 2014 confer- Malvina Nissim, Nicole Novielli, and Viviana Patti. ence on empirical methods in natural language pro- 2016. Overview of the evalita 2016 sentiment polar- cessing (EMNLP), pages 1532–1543. ity classification task. In Proceedings of third Ital- ian conference on computational linguistics (CLiC- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt it 2016) & fifth evaluation campaign of natural lan- Gardner, Christopher Clark, Kenton Lee, and Luke guage processing and speech tools for Italian. Final Zettlemoyer. 2018. Deep contextualized word rep- Workshop (EVALITA 2016). resentations. pages 2227–2237, June. Pierpaolo Basile, Franco Cutugno, Malvina Nissim, Marco Polignano, Pierpaolo Basile, Marco de Gem- Viviana Patti, Rachele Sprugnoli, et al. 2016. mis, and Giovanni Semeraro. 2019. A compari- Evalita 2016: Overview of the 5th evaluation cam- son of word-embeddings in emotion detection from paign of natural language processing and speech text using bilstm, cnn and self-attention. In Adjunct tools for italian. In 3rd Italian Conference on Com- Publication of the 27th Conference on User Model- putational Linguistics, CLiC-it 2016 and 5th Eval- ing, Adaptation and Personalization, pages 63–68. uation Campaign of Natural Language Processing ACM. and Speech Tools for Italian, EVALITA 2016, vol- ume 1749, pages 1–4. CEUR-WS. Alec Radford, Jeff Wu, Rewon Child, David Luan, Valerio Basile, Mirko Lai, and Manuela Sanguinetti. Dario Amodei, and Ilya Sutskever. 2019. Language 2018. Long-term social media data collection at the models are unsupervised multitask learners. university of turin. In Fifth Italian Conference on Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Computational Linguistics (CLiC-it 2018), pages 1– Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz 6. CEUR-WS. Kaiser, and Illia Polosukhin. 2017. Attention is all Christos Baziotis, Nikos Pelekis, and Christos Doulk- you need. In Advances in neural information pro- eridis. 2017. Datastories at semeval-2017 task cessing systems, pages 5998–6008. 4: Deep lstm with attention for message-level and Alex Wang, Amapreet Singh, Julian Michael, Felix topic-based sentiment analysis. In Proceedings of Hill, Omer Levy, and Samuel R Bowman. 2018. the 11th International Workshop on Semantic Eval- Glue: A multi-task benchmark and analysis platform uation (SemEval-2017), pages 747–754, Vancou- for natural language understanding. arXiv preprint ver, Canada, August. Association for Computational arXiv:1804.07461. Linguistics. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- Tomas Mikolov. 2017. Enriching word vectors with dinov, Raquel Urtasun, Antonio Torralba, and Sanja subword information. Transactions of the Associa- Fidler. 2015. Aligning books and movies: Towards tion for Computational Linguistics, 5:135–146. story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, international conference on computer vision, pages Thorsten Brants, Phillipp Koehn, and Tony Robin- 19–27. son. 2013. One billion word benchmark for measur- ing progress in statistical language modeling. arXiv preprint arXiv:1312.3005. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June. Associ- ation for Computational Linguistics. Taku Kudo. 2018. Subword regularization: Im- proving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity. In Advances in neural information processing systems, pages 3111–3119.