(Better than) State-of-the-Art PoS-tagging for Italian Texts Fabio Tamburini FICLIT - University of Bologna, Italy fabio.tamburini@unibo.it Abstract in general, a very desirable feature because it al- lows for the production of tools not tied to any English. This paper presents some ex- specific language, but in various evaluation cam- periments for the construction of an high- paigns, at least for highly-inflected languages as performance PoS-tagger for Italian using Italian, the results showed quite clearly that this deep neural networks techniques (DNN) task would benefit from the use of specific and rich integrated with an Italian powerful mor- language resources (Tamburini, 2007; Attardi and phological analyser. The results ob- Simi, 2009). tained by the proposed system on stan- In this study, still work-in-progress, we set-up dard datasets taken from the EVALITA a PoS-tagger for Italian able to gather the highest campaigns show large accuracy improve- classification performances by using any available ments when compared with previous sys- language resource and the most up-to-date DNN. tems from the literature. We used AnIta (Tamburini and Melandri, 2012), Italiano. Questo contributo presenta al- one of the most powerful morphological analysers cuni esperimenti per la costruzione di for Italian, based on a wide lexicon (about 110.000 un PoS-tagger ad alte prestazioni per lemmas), for providing the PoS-tagger with a large l’italiano utilizzando reti neurali ‘deep’ set of useful information. integrate con un potente analizzatore mor- fologico. I risultati ottenuti sui dataset 2 Input features delle campagne EVALITA da parte del sis- The set of input features for each token is basically tema proposto mostrano incrementi di ac- formed by two different components: the word curatezza piuttosto rilevanti in confronto embedding and some morphological information. ai precedenti sistemi in letteratura. 2.1 Word Embeddings 1 Introduction All the embeddings used in our experiments were extracted from the CORIS corpus (Rossini Favretti In recent years there were a large number of works et al., 2002), a 130Mw synchronic reference cor- trying to push the accuracy of the PoS-tagging pus for Italian, by using the tool word2vec1 task forward using new techniques, mainly from (Mikolov et al., 2013). We added two special to- the deep learning domain (Collobert et al., 2011; kens to mark the sentence beginning ‘’ and Søgaard, 2011; dos Santos and Zadrozny, 2014; ending ‘’. Huang et al., 2015; Wang et al., 2015; Chiu and Nichols, 2016). 2.2 Morphological features All these studies are mainly devoted to show One of the most useful kind of information that in- how to find the best combination of new neu- creases the performances of PoS-taggers concerns ral network structures and character/word embed- the list of all possible tags for a single word-form. dings for reaching the highest classification per- Having a restricted list of possibility enable the formances, and typically present solutions that do tagger to reduce the search space and force it to not make any use of specific language resources take reasonable decisions. The results obtained (e.g. morphological analysers, gazetteers, guess- 1 ing procedures for unknown words, etc.). This is, https://code.google.com/archive/p/word2vec/ in past PoS-taggers evaluations on Italian agree 2.4 Data structuring in suggesting that powerful morphological analy- We experimented two different ways of structuring sers based on large lexica are invaluable resources the input features for processing: to increase tagger accuracy. For these reasons, we extended the word embeddings computed in • Win: this mode of organising input data is a completely unsupervised way by concatenating based on a sliding window that starts from the to them a vector containing the possible PoS-tags beginning of each sentence and concatenates provided by the AnIta analyser. This tool is also word feature vectors into one single vector. able to identify, through the use of simple regular Padding is inserted at sentence borders. expressions, numbers, dates, URLs, emails, etc., • Seq: each sentence is managed as one single and assign them the proper tag(s). sequence padded at the borders. 2.3 Unknown words handling and Sentence Each network experimented in this study uses padding one of these two data structuring type. The source of most tagging errors is certainly the 3 (Deep) Learning Blocks presence of the so called ‘unknown words’, word- forms for which the tagger did not receive any in- All the experiments presented in this paper has formation during the training phase. A morpho- been performed using Keras2 a “a minimalist, logical analyser based on a large lexicon could cer- highly modular neural networks library, written in tainly alleviate this problem providing information Python and capable of running on top of either also for word-forms not belonging to the training TensorFlow or Theano”, two widely used tensor set, but there are large classes of tokens that cannot manipulation libraries. Keras provides some basic be successfully handled by the analyser, for exam- neural network blocks as well as different learn- ple proper names, foreign words, etc. ing procedures for the desired network configura- tion and simple tools for writing new blocks. In In a previous work (Tamburini, 2007b) we our experiments we used some of them, namely showed that using such a powerful morphological multilayer-perceptrons (MLP) and Long Short- analyser, the word-forms not covered by it in real Term Memory (LSTM), and we wrote a new block texts belongs at 95% to the class of proper names, to handle Conditional Random Fields (CRF). adjectives and common nouns and a simple heuris- MLP are simple feedforward neural networks tic correctly assigns most of the cases. In this with one or more fully-connected hidden layers. way AnIta always provides one or more PoS-tag We obtained maximum performances using only hypothesis for each word-form that can be trans- one hidden layer. formed into a binary vector with 1s in correspon- dence of possible PoS-tags and 0s otherwise, but LSTM networks (Hochreiter and Schmidhuber, if the word-form did not have a computed embed- 1997; Graves and Schmidhuber, 2005) are a kind ding, the first part of the input features would not of recurrent neural network which received a lot be defined. For solving such problem, instead of of attention in recent years due to their ability of using the common solution of assigning a random produce good classification results for sequence vector to all unknown words, we averaged all the problems. Their property of preventing the vanish- embeddings of the other word presenting exactly ing (and exploding) gradient problem that affects the same combination of possible PoS-tags. standard recurrent neural networks made them the default choice for solving sequence classification It is also a common practice to pad sentences, problems inside the DNN framework. Usually at the beginning and at the end, using random this kind of units are arranged to form a bidirec- vectors, but we, instead, used the real embed- tional chain (BiLSTM) for gathering information dings computed for the special tokens ‘’ and both from the past and from the future of the in- ‘’, added for this purpose, with the respec- put data sequence, a very desirable issue for such tive tag ‘BoS’ and ‘EoS’. Due to the internal struc- kind of classification problems. In all our experi- turing of the used tensor manipulating application ments using BiLSTM we obtained maximum per- (see later), we were forced to add also an out-of- formances by stacking two layers of them, with sentence vector to pad sentences to their maximal 2 length, and the correspondent tag OoS. https://github.com/fchollet/keras/tree/master/keras a dropout layer after each of them (Srivastava et The original PoS-tagging task involved two dif- al., 2014), and a final dense softmax layer, or a ferent tagsets, but our experiments used only the time-distributed-dense softmax layer, feeded by tags and the annotation named ‘EAGLES-like’. the BiLSTM output. The evaluation metrics were based on a token- Linear CRFs are the simpler Probabilistic by-token comparison and only one tag was al- Graphical Model (PGM) and it has been success- lowed for each token. The EVALITA metric con- fully used in NLP for sequence classification prob- sidered in this study is the Tagging Accuracy, de- lems (Lafferty et al., 2001). We did some experi- fined as the number of correct PoS-tag assign- ments stacking them after the softmax layer. ments divided by the total number of tokens in the Figure 1 shows the most complex DNN struc- TS. See (Tamburini, 2007) for further details. ture used in out experiments. 4.2 The EVALITA 2009 evaluation The DS consisted in 113895 word forms (already divided in a training set - 108,874 tokens - and a validation set - 5021 tokens). The TS consisted of 5066 word forms. The training set is formed by newspaper articles from ‘La Repubblica’, while the validation and test set contain documents ex- tracted from the Italian Wikipedia. This test the degree of system adaptation to new domains. The organisers evaluated the results using a coarse grained (37 tags) and a morphed (336 tags) tagsets inserted in a closed/open task framework, but in this study all the results refer to the open task (one can use external resources) on the coarse grained tagset. The evaluation metric is the same described before in section 4.1. See (Attardi and Figure 1: The most complex DNN used in our ex- Simi, 2009) for further details. periments. 4.3 Hyper-Parameters 4 Experiments Considering the large number of hyper-parameters involved in the whole procedure, we did not test all All the experiments presented in this paper to the possible combinations; we used, instead, the test the effectiveness of the proposed system re- most common set-up of parameters gathered from fer to two evaluation campaigns organised inside the literature. Table 1 outlines the whole set-up for the EVALITA3 framework. In particular, in 2007 the unmodified hyper-parameters. and 2009 were organised specific task to test Ital- ian PoS-taggers performances. word2vec Embed. Feature extraction Hyperpar. Value Hyperpar. Value 4.1 The EVALITA 2007 evaluation type SkipGr. window 5 Two separate data sets were provided: the Devel- size 100 Learning Params. opment Set (DS), composed of 133,756 tokens, (1/2) win. 5 batch (win) 1/4*NU was used for system development and for the train- neg. sampl. 25 batch (seq) 1 ing phase, while a Test Set (TS), composed of sample 1e-4 Opt. Alg. Adam 17,313 tokens, was used as a reference for sys- iter 15 Loss Func. Categ.CE tems evaluation. Both contain various documents belonging mainly to journalistic and narrative gen- Table 1: Unmodified hyper-parameters and algo- res, with small sections containing academic and rithms used in our experiments. NU means the legal/administrative prose. Each participant was number of hidden or LSTM units per layer (the allowed to use any available resource or could same for all layers). For Adam refer to (Kingma freely induce it from the training data. and Ba, 2015). 3 http://www.evalita.it/ 4.4 The Early Stopping Drama There are some interesting studies (Bengio, 2012; Prechelt, 2012) dealing with the problem of stop- ping the learning process at the right point; this issue is known as the ‘early stopping’ problem. Choosing the correct epoch to stop the learning process helps avoiding overfitting on the training set and usually produces systems exhibiting bet- ter generalisations. But, how to choose the correct epoch is not simple. The suggestion given in vari- ous studies on this topic is to consider a validation set and stop the learning process when the perfor- mances on this set do not increase anymore or even Figure 2: The early stopping procedure. decrease, a clear hint of overfitting. The usual way to set up an experiment fol- in section 2.3, improves the systems’ results by lowing this suggestions involves splitting the gold more than 1%. Considering the data structuring standard into three different instance sets: the described in section 2.4, the management of an training set, for training, the validation set, to de- entire sentence as a complete sequence allows re- termine the stopping point, and the test set to eval- current configurations to work with larger contexts uate the system. However, we are testing our sys- producing better results. Adding a CRF layer after tems on real evaluation data that has been already the BiLSTM seems to slightly improve the perfor- split by the organisers into development and test mances, but not in a significant way. set. Thus, we can divide the development set into training/validation set for optimising the hyper- SYSTEM TA Notes parameters and define the stopping epoch, but, for E07 E09 the final evaluation, we would like to train the final MLP-256 96.45 95.57 Win=5 system on the complete development set to adhere MLP-256 97.75 96.84 M,Win=5 to the evaluation constraints and to benefit from 2-BiLSTM-256 98.12 97.30 M,Win=5 using more training data. 2-BiLSTM-256 98.14 97.45 M,Seq Having two different training procedures for the 2-BiLSTM-256-CRF 98.18 97.48 M,Seq optimisation and evaluation phases leads to a more complex procedure for determining the stopping Table 2: Tagging accuracies (TA) for different epoch. Moreover, the typical accuracy profile for configurations for both datasets. (‘M’ marks the DNN systems is not smooth and oscillate heav- use of AnIta morphological information). ily during training. To avoid any problem in de- termining the stopping point we smoothed all the In Table 3 we can see our best system perfor- profiles using a bezier spline. The procedure we mances, namely AnIta-BiLSTM-CRF, compared adopted to determine the stopping epoch is (please with the three best systems of the considered look at Fig. 2): (1) find the first maximum in the EVALITA campaigns. As you can see, in both validation smoothed profile - A; (2) find the corre- cases the proposed system ranked first improving sponding value of accuracy on the smoothed train- the scoring by large quantities. ing profile - B; (3) find the point in the smoothed development set profile having the same accuracy 5 Conclusions as in B - C; (4) select the epoch corresponding at The proposed system for PoS-tagging, integrat- point C as the stopping epoch - D. ing DNNs and a powerful morphological analyser, exhibited very good accuracy results when ap- 4.5 Results plied to standard Italian evaluation datasets from Table 2 outlines the systems’ accuracies for dif- the EVALITA campaigns. The information from ferent configurations for both datasets. We can AnIta proved to be crucial to reach such accuracy observe that by using AnIta morphological infor- values as well as stacked BiLSTM networks pro- mation, as well as all the techniques described cessing entire sentence sequences. EVALITA 2007 Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec- SYSTEM TA tional LSTM-CRF Models for Sequence Tagging. AnIta-BiLSTM-CRF 98.18 ArXiv e-prints, 1508.01991. FBKirst Zanoli 98.04 D.P. Kingma and J.L. Ba. 2015. Adam: a method for UniTn Baroni 97.89 stochastic optimization. In Proc. International Con- ILCcnrUniPi Lenci 97.65 ference on Learning Representations - ICLR., pages 1–13. EVALITA 2009 AnIta-BiLSTM-CRF 97.48 J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- ditional random fields: Probabilistic models for seg- UniPi SemaWiki 2 97.03 menting and labeling sequence data. In Proc. 18th UniPi SemaWiki 1 96.73 International Conf. on Machine Learning, pages UniPi SemaWiki 4 96.67 282–289. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Table 3: Participants’ results with respect to Tag- Dean. 2013. Efficient Estimation of Word Repre- ging Accuracy (TA) at EVALITA 2007 and 2009. sentations in Vector Space. In Proc. of Workshop at ICLR. We have to further test different DNN config- Lutz Prechelt. 2012. Early Stopping — But When? In Grégoire Montavon, Geneviève B. Orr, and Klaus- urations and their integration with other kind of Robert Müller, editors, Neural Networks: Tricks of PGMs as well as make more experiments with dif- the Trade: Second Edition, pages 53–67. Springer ferent hyperparameters. Berlin Heidelberg, Berlin, Heidelberg. Rema Rossini Favretti, Fabio Tamburini, and Cristiana De Santis. 2002. CORIS/CODIS: A corpus of References written Italian based on a defined and a dynamic Giuseppe Attardi and Maria Simi. 2009. Overview model. In Andrew Wilson, Paul Rayson, and Tony of the EVALITA 2009 Part-of-Speech Tagging Task. McEnery, editors, A Rainbow of Corpora: Corpus In Proc. of Workshop Evalita 2009. Linguistics and the Languages of the World, pages 27–38. Lincom-Europa, Munich. Yoshua Bengio. 2012. Practical Recommendations for Anders Søgaard. 2011. Semi-supervised condensed Gradient-Based Training of Deep Architectures. In nearest neighbor for part-of-speech tagging. In Grégoire Montavon, Geneviève B. Orr, and Klaus- Proc. of the 49th Annual Meeting of the Association Robert Müller, editors, Neural Networks: Tricks of for Computational Linguistics: Human Language the Trade: Second Edition, pages 437–478. Springer Technologies, pages 48–52, Portland, Oregon, USA. Berlin Heidelberg, Berlin, Heidelberg. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Jason Chiu and Eric Nichols. 2016. Sequential Label- Ilya Sutskever, and Ruslan Salakhutdinov. 2014. ing with Bidirectional LSTM-CNNs. In Proc. In- Dropout: A simple way to prevent neural networks ternational Conf. of Japanese Association for NLP, from overfitting. Journal of Machine Learning Re- pages 937–940. search, 15:1929–1958. Ronan Collobert, Jason Weston, Léon Bottou, Michael Fabio Tamburini and Matias Melandri. 2012. AnIta: Karlen, Koray Kavukcuoglu, and Pavel Kuksa. a powerful morphological analyser for Italian. In 2011. Natural language processing (almost) from Proc. 8th International Conference on Language Re- scratch. J. Mach. Learn. Res., 12:2493–2537. sources and Evaluation - LREC 2012, pages 941– 947, Istanbul. Cicero dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part- Fabio Tamburini. 2007. EVALITA 2007: the Part- of-speech tagging. In Proc. of the 31st Interna- of-Speech Tagging Task. Intelligenza Artificiale, tional Conference on Machine Learning, JMLR, vol- IV(2):4–7. ume 32. JMLR W&CP. Fabio Tamburini. 2007b. CORISTagger: a high- performance PoS tagger for Italian. Intelligenza Ar- Alex Graves and Jürgen Schmidhuber. 2005. Frame- tificiale. Intelligenza Artificiale, IV(2):14–15. wise phoneme classification with bidirectional lstm and other neural network architectures. Neural Net- Peilu Wang, Yao Qian, Frank. K Soong, Lei He, and works, 18(5-6):602–610. Hai Zhao. 2015. A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Sepp Hochreiter and Jürgen Schmidhuber. 1997. Word Embedding. ArXiv e-prints, 1511.00215. Long short-term memory. Neural Computation, 9(8):1735–1780.