bot.zen @ EVALITA 2016 - A minimally-deep learning PoS-tagger (trained for Italian Tweets) Egon W. Stemle Institute for Specialised Communication and Multilingualism EURAC Research Bolzano/Bozen, Italy egon.stemle@eurac.edu Abstract correnti che implementano metodi com- provati dell’NLP e del Machine Learn- English. This article describes the sys- ing, per raggiungere risultati competi- tem that participated in the POS tag- tivi nel PoS tagging dei testi italiani di ging for Italian Social Media Texts (PoST- Twitter. In particolare il sistema utilizza WITA) task of the 5th periodic evaluation strategie di word embedding e di rap- campaign of Natural Language Processing presentazione character-level di inizio e (NLP) and speech tools for the Italian lan- fine parola, in un’architettura LSTM RNN. guage EVALITA 2016. Dati etichettati (Italian UD corpus, DiDi The work is a continuation of Stemle e PoSTWITA) e dati non etichettati (Italian (2016) with minor modifications to the C4Corpus e PAISÀ) sono stati utilizzati in system and different data sets. It com- fase di training. bines a small assertion of trending tech- Il sistema è disponibile sotto licenza open niques, which implement matured meth- source APLv2. ods, from NLP and ML to achieve com- petitive results on PoS tagging of Italian Twitter texts; in particular, the system uses 1 Introduction word embeddings and character-level rep- Part-of-speech (PoS) tagging is an essential pro- resentations of word beginnings and end- cessing stage for virtually all NLP applica- ings in a LSTM RNN architecture. La- tions. Subsequent tasks, like parsing, named- belled data (Italian UD corpus, DiDi and entity recognition, event detection, and machine PoSTWITA) and unlabbelled data (Italian translation, often utilise PoS tags, and benefit (di- C4Corpus and PAISÀ) were used for train- rectly or indirectly) from accurate tag sequences. ing. Actual work on PoS tagging, meanwhile, The system is available under the APLv2 mainly concentrated on standardized texts for open-source license. many years, and frequent phenomena in computer- mediated communication (CMC) and Web cor- Italiano. Questo articolo descrive il sis- pora such as emoticons, acronyms, interaction tema che ha partecipato al task POS tag- words, iteration of letters, graphostylistics, short- ging for Italian Social Media Texts (PoST- enings, addressing terms, spelling variations, and Wita) nell’ambito di EVALITA 2016, la boilerplate (Androutsopoulos, 2007; Bernardini et 5° campagna di valutazione periodica del al., 2008; Beißwenger, 2013) still deteriorate the Natural Language Processing (NLP) e performance of PoS-taggers (Giesbrecht and Ev- delle tecnologie del linguaggio. ert, 2009; Baldwin et al., 2013). Il lavoro è un proseguimento di quanto On the other hand, the interest in automatic descritto in Stemle (2016), con modifiche evaluation of social media texts, in particular minime al sistema e insiemi di dati differ- for microblogging texts such as tweets, has been enti. Il lavoro combina alcune tecniche growing considerably, and specialised tools for Twitter data have become available for different results across various (semantic) tasks. These re- languages. But Italian completely lacks such re- sults suggest that the word embeddings substan- sources, both regarding annotated corpora and tially outperform the other architectures on seman- specific PoS-tagging tools.1 To this end, the tic similarity and analogy detection tasks. Subse- POS tagging for Italian Social Media Texts (PoST- quently, Levy et al. (2015) conducted a compre- WITA) task was proposed for EVALITA 2016 hensive set of experiments and comparisons that concerning the domain adaptation of PoS-taggers suggest that much of the improved results are due to Twitter texts. to the system design and parameter optimizations, Our system combined word2vec (w2v) word rather than the selected method. They conclude embeddings (WEs) with a single-layer Long Short that ”there does not seem to be a consistent signif- Term Memory (LSTM) recurrent neural network icant advantage to one approach over the other”. (RNN) architecture. The sequence of unlabelled Word embeddings provide high-quality low di- w2v representations of words is accompanied by mensional vector representations of words from the sequence of n-grams of the word beginnings large corpora of unlabelled data, and the repre- and endings, and is fed into the RNN which in turn sentations, typically computed using NNs, encode predicts PoS labels. many linguistic regularities and patterns (Mikolov The paper is organised as follows: We present et al., 2013b). our system design in Section 2, the implementa- tion in Section 3, and its evaluation in Section 4. 2.2 Character-Level Sub-Word Information Section 5 concludes with an outlook on possible The morphology of a word is opaque to WEs, and implementation improvements. the relatedness of the meaning of a lemma’s differ- ent word forms, i.e. its different string representa- 2 Design tions, is not systematically encoded. This means that in morphologically rich languages with long- Overall, our design takes inspiration from as far tailed frequency distributions, even some WE rep- back as Benello et al. (1989) who used four pre- resentations for word forms of common lemmata ceding words and one following word in a feed- may become very poor (Kim et al., 2015). forward neural network with backpropagation for We agree with Nogueira dos Santos and PoS tagging, builds upon the strong foundation Zadrozny (2014) and Kim et al. (2015) that sub- laid down by Collobert et al. (2011) for a neu- word information is very important for PoS tag- ral network (NN) architecture and learning algo- ging, and therefore we augment the WE repre- rithm that can be applied to various natural lan- sentations with character-level representations of guage processing tasks, and ultimately is a varia- the word beginnings and endings; thereby, we tion of Nogueira dos Santos and Zadrozny (2014) also stay language agnostic—at least, as much who trained a NN for PoS tagging, with character- as possible—by avoiding the need for, often lan- level and WE representations of words. guage specific, morphological pre-processing. Also note that an earlier version of the system was used in Stemle (2016) to participate in the 2.3 Recurrent Neural Network Layer EmpiriST 2015 shared task on automatic linguistic Language Models are a central part of NLP. They annotation of computer-mediated communication are used to place distributions over word se- / social media (Beißwenger et al., 2016). quences that encode systematic structural proper- ties of the sample of linguistic content they are 2.1 Word Embeddings built from, and can then be used on novel content, Recently, state-of-the-art results on various lin- e.g. to rank it or predict some feature on it. For a guistic tasks were accomplished by architectures detailed overview on language modelling research using neural-network based WEs. Baroni et al. see Mikolov (2012). (2014) conducted a set of experiments comparing A straight-forward approach to incorporate the popular w2v (Mikolov et al., 2013a; Mikolov WEs into feature-based language models is to et al., 2013b) implementation for creating WEs to use the embeddings’ vector representations as fea- other distributional methods with state-of-the-art tures.2 Having said that, WEs are also used in NN 1 2 http://www.evalita.it/2016/tasks/postwita For an overview see, e.g. Turian et al. (2010). architectures, where they constitute (part of) the in any scenario. Moreover, SGNS is the fastest input to the network. method to train, and cheapest (by far) in terms Neural networks consist of a large number of of disk space and memory consumption”. Coin- simple, highly interconnected processing nodes in cidentally, Mikolov et al. (2013b) also suggest to an architecture loosely inspired by the structure of use SGNS. We incorporate w2v’s original C im- the cerebral cortex of the brain (O’Reilly and Mu- plementation for learning WEs3 in an independent nakata, 2000). The nodes receive weighted inputs pre-processing step, i.e. we pre-compute the WEs. through these connections and fire according to Then, we use gensim4 , a Python tool for unsuper- their individual thresholds of their shared activa- vised semantic modelling from plain text, to load tion function. A firing node passes on an activation the pre-computed data, and to compute the vector to all successive connected nodes. During learning representations of input words for our NN. the input is propagated through the network and the output is compared to the desired output. Then, 3.2 Character-Level Sub-Word Information the weights of the connections (and the thresholds) Our implementation uses a one-hot encoding with are adjusted step-wise so as to more closely resem- a few additional features for representing sub- ble a configuration that would produce the desired word information. The one-hot encoding trans- output. After all input cases have been presented, forms a categorical feature into a vector where the the process typically starts over again, and the out- categories are represented by equally many dimen- put values will usually be closer to the correct val- sions with binary values. We convert a letter to ues. lower-case and use the sets of ASCII characters, RNNs are NNs where the connections between digits, and punctuation marks as categories for the the elements are directed cycles, i.e. the networks encoding. Then, we add dimensions to represent have loops, and this enables them to model se- more binary features like ’uppercase’ (was upper- quential dependencies of the input. However, reg- case prior to conversion), ’digit’ (is digit), ’punctu- ular RNNs have fundamental difficulties learn- ation’ (is punctuation mark), whitespace (is white ing long-term dependencies, and special kinds of space, except the new line character; note that this RNNs need to be used (Hochreiter, 1991); a very category is usually empty, because we expect our popular kind is the so called long short-term mem- tokens to not include white space characters), and ory (LSTM) network proposed by Hochreiter and unknown (other characters, e.g. diacritics). This Schmidhuber (1997). results in vectors with more than a single one-hot Overall, with this design we not only benefit dimension. from available labelled data but also from avail- 3.3 Recurrent Neural Network Layer able general or domain-specific unlabelled data. Our implementation uses Keras, a high-level NNs 3 Implementation library, written in Python and capable of running on top of either TensorFlow or Theano (Chollet, We maintain the implementation in a source 2015). In our case it runs on top of Theano, code repository at https://github.com/ a Python library that allows to define, optimize, bot-zen/. The version tagged as 1.1 com- and evaluate mathematical expressions involving prises the version that was used to generate the multi-dimensional arrays efficiently (The Theano results submitted to the shared task (ST). Development Team et al., 2016). Our system feeds WEs and character-level sub- The input to our network are sequences of the word information into a single-layer RNN with a same length as the sentences we process. During LSTM architecture. training, we group sentences of the same length into batches and process the batches according to 3.1 Word Embeddings sentence length in increasing order. Each single When computing WEs we take into consideration word in the sequence is represented by its sub- Levy et al. (2015): they observed that one spe- word information and two WEs that come from cific configuration of w2v, namely the skip-gram two sources (see Section 4). For unknown words, model with negative sampling (SGNS) ”is a robust 3 https://code.google.com/archive/p/ baseline. While it might not be the best method for word2vec/ 4 every task, it does not significantly underperform https://radimrehurek.com/gensim/ i.e. words without a pre-computed WE, we first try bank annotation for many languages.5 to find the most similar WE considering 10 sur- italian-UD6 (version from January 2015) cor- rounding words. If this fails, the unknown word is pus was originally obtained by conversion from mapped to a randomly generated vector represen- ISDT (Italian Stanford Dependency Treebank) tation. In Total, each word is represented by 2, 280 and released for the dependency parsing ST of features: two times 500 (WEs), and sixteen times EVALITA 2014 (Bosco et al., 2014). The cor- 80 for two 8-grams (word beginning and ending). pus has semi-automatically converted PoS tags If words are shorter than 8 characters their 8-grams from the original two Italian treebanks, differing are zero-padded. both in corpus composition and adopted annota- This sequential input is fed into a LSTM layer tion schemes. that, in turn, projects to a fully connected output The corpus contains around 317,000 tokens in layer with softmax activation function. During around 13,000 sentences from different sources training we use dropout for the projection into the and genres. It is available under the CC BY-NC- output layer, i.e. we set a fraction (0.5) of the input SA 3.07 license. units to 0 at each update, which helps prevent over- 4.1.3 PoSTWITA (PoS and w2v) fitting (Srivastava et al., 2014). We use categorical cross-entropy as loss function and backpropaga- postwita is the Twitter data made available by the tion in conjunction with the RMSprop optimiza- organizers of the ST. It contains Twitter tweets tion for learning. At the time of writing, this was from the EVALITA2014 SENTIPLOC corpus: the the Keras default—or the explicitly documented development and test set and additional tweets option to be used—for our type of architecture. from the same period of time were manually anno- tated for a global amount of 6438 tweets (114,967 4 Results tokens) and were distributed as the development set. The data is PoS tagged according to UD We used our slightly modified implementation to but with the additional insertion of seven Twitter- participate in the POS tagging for Italian Social specific tags. All the annotations were carried out Media Texts (PoSTWITA) shared task (ST) of the by three different annotators. The data was only 5th periodic evaluation campaign of Natural Lan- distributed to the task participants. guage Processing (NLP) and speech tools for the Italian language EVALITA 2016. First, we de- 4.1.4 C4Corpus (w2v) scribe the corpora used for training, and then the c4corpus8 is a full documents Italian Web cor- specific system configuration(s) for the ST. pus that has been extracted from CommonCrawl, the largest publicly available general Web crawl 4.1 Training Data for w2v and PoS Tagging to date. See Habernal (2016) for details about the corpus construction pipeline, and other informa- 4.1.1 DiDi-IT (PoS, w2v) tion about the corpus. didi-it (Frey et al., 2016) (version September The corpus contains about 670m tokens in 22m 2016) is the Italian sub-part of the DiDi corpus, sentences. The data is available under the Cre- a corpus of South Tyrolean German and Italian ativeCommons license family. from Facebook (FB) users’ wall posts, comments on wall posts and private messages. 4.1.5 PAISÀ (w2v) The Italian part consists of around 100,000 to- paisa (Lyding et al., 2014) is a corpus of authen- kens collected from 20 profiles of Facebook users tic contemporary Italian texts from the web (har- residing in South Tyrol. This version has about vested in September/October 2010). It was created 20,000 PoS tags semi-automatically corrected by 5 http://universaldependencies.org/ a single annotator. 6 http://universaldependencies.org/it/ The anonymised corpus is freely available for overview/introduction.html 7 research purposes. Creative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported, i.e. the data can be copied and redistributed, and adapted for purposes other than commer- 4.1.2 Italian UD (PoS, w2v) cial ones. See https://creativecommons.org/ licenses/by-nc-sa/3.0/ for more details. Universal Dependencies (UD) is a project that 8 https://github.com/dkpro/ is developing cross-linguistically consistent tree- dkpro-c4corpus in the context of the project PAISÀ (Pı́attaforma importantly 7, 778 of 12, 677 sentences had imper- per l’Apprendimento dell’Italiano Su corpora An- fect information during training. notati) with the aim to provide a large resource of freely available Italian texts for language learning 5 Conclusion & Outlook by studying authentic text materials. We presented our submission to the PoSTWITA The corpus contains about 270m tokens in task of EVALITA 2016, where we participated about 8m sentences. The data is available under with moderate results. In the future, we will try to the CC BY-NC-SA 3.09 license. rerun the experiment with training data that takes into consideration the Twitter-specific tags of the 4.2 PoSTWITA shared task task. For the ST we used one overall configuration for the system but three different corpus configura- Acknowledgments tions for training. However, only one corpus con- The computational results presented have been figuration was entered into the ST: we used PoS achieved in part using the Vienna Scientific Clus- tags from didi-it + postwita (run 1), from italian- ter (VSC). UD (run 2), and from both (run 3). For w2v we trained a 500-dimensional skip-gram model on didi-it + italian-UD + postwita that ignored References all words with less than 2 occurrences within Jannis K. Androutsopoulos. 2007. Neue Medien – a window size of 10; it was trained with neg- neue Schriftlichkeit? Mitteilungen des Deutschen ative sampling (value 15). We also trained a Germanistenverbandes, 1:72–97. 500-dimensional skip-gram model on c4corpus + Timothy Baldwin, Paul Cook, Marco Lui, Andrew paisa that ignored all words with less than 33 MacKinlay, and Li Wang. 2013. How noisy social occurrences within a window size of 10; it was media text, how diffrnt social media sources? In trained with negative sampling (value 15). Proceedings of the Sixth International Joint Confer- The other w2v parameters were left at their de- ence on Natural Language Processing, pages 356– 364, Nagoya, Japan, October. Asian Federation of fault settings10 . Natural Language Processing. http://aclweb. The evaluation of the systems was done by the org/anthology/I13-1041. organisers on unlabelled but pre-tokenised data (4759 tokens in 301 tweets), and was based on a Marco Baroni, Georgiana Dinu, and German Kruszewski. 2014. Don’t count, predict! A token-by-token comparison. The considered met- systematic comparison of context-counting vs. ric was accuracy, i.e. the number of correctly as- context-predicting semantic vectors. In Proceedings signed PoS tags divided by the total number of to- of the 52nd Annual Meeting of the Association kens. for Computational Linguistics (Volume 1: Long Papers), pages 238–247. Association for Computa- (1) didi-it + postwita 76.00 tional Linguistics. http://www.aclweb.org/ anthology/P14-1023. (2) italian-UD 80.54 (3) didi-it + postwita + italian-UD 81.61 Michael Beißwenger, Sabine Bartsch, Stefan Evert, and Kay-Michael Würzner. 2016. EmpiriST 2015: Winning Team 93.19 A Shared Task on the Automatic Linguistic Anno- tation of Computer-Mediated Communication, So- Table 1: Official result(s) of our PoS tagger for the cial Media and Web Corpora. In Proceedings of the three runs on the PoSTWITA ST data. 10th Web as Corpus Workshop (WAC-X) and the Em- piriST Shared Task, pages 78–90, Berlin, Germany. Association for Computational Linguistics. We believe, the unexpectedly little performance gain from utilizing the much larger italian-UD Michael Beißwenger. 2013. Das Dortmunder data over the rather small didi-it + postwita data Chat-Korpus: ein annotiertes Korpus zur Sprachver- wendung und sprachlichen Variation in der may be rooted in the insertion of Twitter-specific deutschsprachigen Chat-Kommunikation. LINSE - tags into the data (see 4.1.3), something we did Linguistik Server Essen, pages 1–13. not account for, i.e. 18, 213 of 289, 416 and more Julian Benello, Andrew W. Mackie, and James A. 9 Anderson. 1989. Syntactic category disam- https://creativecommons.org/licenses/ by-nc-sa/3.0/ biguation with neural networks. Computer 10 -sample 1e-3 -iter 5 -alpha 0.025 Speech & Language, 3(3):203–217, July. http: //www.sciencedirect.com/science/ Verena Lyding, Egon Stemle, Claudia Borghetti, Marco article/pii/0885230889900181. Brunello, Sara Castagnoli, Felice Dell Orletta, Hen- rik Dittmann, Alessandro Lenci, and Vito Pirrelli. Silvia Bernardini, Marco Baroni, and Stefan Evert. 2014. The PAISÀ Corpus of Italian Web Texts. In 2008. A WaCky Introduction. In Wacky! Working Proceedings of the 9th Web as Corpus Workshop papers on the Web as Corpus, pages 9–40. GEDIT, (WaC-9), pages 36–43, Gothenburg, Sweden. Asso- Bologna, Italy. http://wackybook.sslmit. ciation for Computational Linguistics. unibo.it/pdfs/bernardini.pdf. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Cristina Bosco, Felice Dell’Orletta, Simonetta Monte- Dean. 2013a. Efficient Estimation of Word Repre- magni, Manuela Sanguinetti, and Maria Simi. 2014. sentations in Vector Space. CoRR, abs/1301.3781. The EVALITA 2014 Dependency Parsing Task. In http://arxiv.org/abs/1301.3781. Proceedings of CLiC-it 2014 and EVALITA 2014, pages 1–8. Pisa University Press. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- rado, and Jeffrey Dean. 2013b. Distributed Repre- Franois Chollet. 2015. Keras: Deep Learning library sentations of Words and Phrases and their Compo- for Theano and TensorFlow. https://github. sitionality. CoRR, abs/1310.4546, October. http: com/fchollet/keras. //arxiv.org/abs/1310.4546. Ronan Collobert, Jason Weston, Léon Bottou, Michael Tomáš Mikolov. 2012. Statistical Lan- Karlen, Koray Kavukcuoglu, and Pavel Kuksa. guage Models Based on Neural Networks. 2011. Natural Language Processing (almost) from Ph.D. thesis, Brno University of Technology. Scratch. Journal of Machine Learning Research, http://www.fit.vutbr.cz/˜imikolov/ 12:2493–2537. https://arxiv.org/abs/ rnnlm/thesis.pdf. 1103.0398. Cı́cero Nogueira dos Santos and Bianca Zadrozny. Jennifer-Carmen Frey, Aivars Glaznieks, and Egon W. 2014. Learning Character-level Representa- Stemle. 2016. The DiDi Corpus of South Ty- tions for Part-of-Speech Tagging. In Pro- rolean CMC Data: A multilingual corpus of Face- ceedings of the 31st International Conference book texts. Upcoming. on Machine Learning (ICML-14), pages 1818– 1826. http://jmlr.org/proceedings/ Eugenie Giesbrecht and Stefan Evert. 2009. Is Part-of- papers/v32/santos14.pdf. Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus. Web Randall C. O’Reilly and Yuko Munakata. 2000. Com- as Corpus Workshop (WAC5). http://sigwac. putational Explorations in Cognitive Neuroscience org.uk/raw-attachment/wiki/WAC5/ Understanding the Mind by Simulating the Brain. WAC5_proceedings.pdf#page=27. MIT Press. http://books.google.com/ books?id=BLf34BFTaIUC{\&}pgis=1. Ivan Habernal, Omnia Zayed, and Iryna Gurevych. 2016. C4Corpus: Multilingual Web-size corpus Nitish Srivastava, Geoffrey E. Hinton, Alex with free license. In Proceedings of the 10th Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- International Conference on Language Resources dinov. 2014. Dropout : A Simple Way to Prevent and Evaluation (LREC 2016), page (to appear), Neural Networks from Overfitting. Journal of Ma- Portorož, Slovenia, May. European Language Re- chine Learning Research (JMLR), 15:1929–1958. sources Association (ELRA). Egon W. Stemle. 2016. bot.zen @ EmpiriST 2015 - Sepp Hochreiter and Jürgen Schmidhuber. 1997. A minimally-deep learning PoS-tagger (trained for Long short-term memory. Neural Computation, German CMC and Web data). In Proceedings of the 9(8):1735–1780, November. http://dx.doi. 10th Web as Corpus Workshop (WAC-X) and the Em- org/10.1162/neco.1997.9.8.1735. piriST Shared Task, pages 115–119. Association for Computational Linguistics. Sepp Hochreiter. 1991. Untersuchungen zu dy- namischen neuronalen Netzen. diploma thesis, TU The Theano Development Team, Rami Al-Rfou, München. Guillaume Alain, Amjad Almahairi, and et al. 2016. Theano: A Python framework for fast Yoon Kim, Yacine Jernite, David Sontag, and Alexan- computation of mathematical expressions. CoRR, der M. Rush. 2015. Character-Aware Neural Lan- abs/1605.02688. http://arxiv.org/abs/ guage Models. CoRR, abs/1508.0. http:// 1605.02688. arxiv.org/abs/1508.06615. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Omer Levy, Yoav Goldberg, and Ido Dagan. Word representations: A simple and general method 2015. Improving distributional similarity for semi-supervised learning. In Proceedings of with lessons learned from word embeddings. the 48th Annual Meeting of the Association for Transactions of the Association for Compu- Computational Linguistics, ACL ’10, pages 384– tational Linguistics, 3:211–225. https: 394, Stroudsburg, PA, USA. Association for Com- //tacl2013.cs.columbia.edu/ojs/ putational Linguistics. http://dl.acm.org/ index.php/tacl/article/view/570. citation.cfm?id=1858681.1858721.