(Better than) State-of-the-Art PoS-tagging for Italian Texts

(Better than) State-of-the-Art PoS-tagging for Italian Texts FabioTamburini fabio.tamburini@unibo.it FICLIT -University of Bologna

Italy

(Better than) State-of-the-Art PoS-tagging for Italian Texts 531FA426A5CE903CA86E684FA0AE3531 GROBID - A machine learning software for extracting information from scholarly documents

English. This paper presents some experiments for the construction of an highperformance PoS-tagger for Italian using deep neural networks techniques (DNN) integrated with an Italian powerful morphological analyser.

The results obtained by the proposed system on standard datasets taken from the EVALITA campaigns show large accuracy improvements when compared with previous systems from the literature.

Italiano. Questo contributo presenta alcuni esperimenti per la costruzione di un PoS-tagger ad alte prestazioni per l'italiano utilizzando reti neurali 'deep' integrate con un potente analizzatore morfologico. I risultati ottenuti sui dataset delle campagne EVALITA da parte del sistema proposto mostrano incrementi di accuratezza piuttosto rilevanti in confronto ai precedenti sistemi in letteratura.

Introduction

In recent years there were a large number of works trying to push the accuracy of the PoS-tagging task forward using new techniques, mainly from the deep learning domain (Collobert et al., 2011;Søgaard, 2011;dos Santos and Zadrozny, 2014;Huang et al., 2015;Wang et al., 2015;Chiu and Nichols, 2016).

All these studies are mainly devoted to show how to find the best combination of new neural network structures and character/word embeddings for reaching the highest classification performances, and typically present solutions that do not make any use of specific language resources (e.g. morphological analysers, gazetteers, guessing procedures for unknown words, etc.). This is, in general, a very desirable feature because it allows for the production of tools not tied to any specific language, but in various evaluation campaigns, at least for highly-inflected languages as Italian, the results showed quite clearly that this task would benefit from the use of specific and rich language resources (Tamburini, 2007;Attardi and Simi, 2009).

In this study, still work-in-progress, we set-up a PoS-tagger for Italian able to gather the highest classification performances by using any available language resource and the most up-to-date DNN. We used AnIta (Tamburini and Melandri, 2012), one of the most powerful morphological analysers for Italian, based on a wide lexicon (about 110.000 lemmas), for providing the PoS-tagger with a large set of useful information.

Input features

The set of input features for each token is basically formed by two different components: the word embedding and some morphological information.

Word Embeddings

All the embeddings used in our experiments were extracted from the CORIS corpus (Rossini Favretti et al., 2002), a 130Mw synchronic reference corpus for Italian, by using the tool word2vec1 (Mikolov et al., 2013). We added two special tokens to mark the sentence beginning '<s>' and ending '</s>'.

Morphological features

One of the most useful kind of information that increases the performances of PoS-taggers concerns the list of all possible tags for a single word-form. Having a restricted list of possibility enable the tagger to reduce the search space and force it to take reasonable decisions. The results obtained in past PoS-taggers evaluations on Italian agree in suggesting that powerful morphological analysers based on large lexica are invaluable resources to increase tagger accuracy. For these reasons, we extended the word embeddings computed in a completely unsupervised way by concatenating to them a vector containing the possible PoS-tags provided by the AnIta analyser. This tool is also able to identify, through the use of simple regular expressions, numbers, dates, URLs, emails, etc., and assign them the proper tag(s).

Unknown words handling and Sentence padding

The source of most tagging errors is certainly the presence of the so called 'unknown words', wordforms for which the tagger did not receive any information during the training phase. A morphological analyser based on a large lexicon could certainly alleviate this problem providing information also for word-forms not belonging to the training set, but there are large classes of tokens that cannot be successfully handled by the analyser, for example proper names, foreign words, etc. In a previous work (Tamburini, 2007b) we showed that using such a powerful morphological analyser, the word-forms not covered by it in real texts belongs at 95% to the class of proper names, adjectives and common nouns and a simple heuristic correctly assigns most of the cases. In this way AnIta always provides one or more PoS-tag hypothesis for each word-form that can be transformed into a binary vector with 1s in correspondence of possible PoS-tags and 0s otherwise, but if the word-form did not have a computed embedding, the first part of the input features would not be defined. For solving such problem, instead of using the common solution of assigning a random vector to all unknown words, we averaged all the embeddings of the other word presenting exactly the same combination of possible PoS-tags.

It is also a common practice to pad sentences, at the beginning and at the end, using random vectors, but we, instead, used the real embeddings computed for the special tokens '<s>' and '</s>', added for this purpose, with the respective tag 'BoS' and 'EoS'. Due to the internal structuring of the used tensor manipulating application (see later), we were forced to add also an out-ofsentence vector to pad sentences to their maximal length, and the correspondent tag OoS.

Data structuring

We experimented two different ways of structuring the input features for processing:

• Win: this mode of organising input data is based on a sliding window that starts from the beginning of each sentence and concatenates word feature vectors into one single vector.

Padding is inserted at sentence borders.

• Seq: each sentence is managed as one single sequence padded at the borders.

Each network experimented in this study uses one of these two data structuring type.

(Deep) Learning Blocks

All the experiments presented in this paper has been performed using Keras2 a "a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano", two widely used tensor manipulation libraries. Keras provides some basic neural network blocks as well as different learning procedures for the desired network configuration and simple tools for writing new blocks. In our experiments we used some of them, namely multilayer-perceptrons (MLP) and Long Short-Term Memory (LSTM), and we wrote a new block to handle Conditional Random Fields (CRF).

MLP are simple feedforward neural networks with one or more fully-connected hidden layers. We obtained maximum performances using only one hidden layer.

LSTM networks (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005) are a kind of recurrent neural network which received a lot of attention in recent years due to their ability of produce good classification results for sequence problems. Their property of preventing the vanishing (and exploding) gradient problem that affects standard recurrent neural networks made them the default choice for solving sequence classification problems inside the DNN framework. Usually this kind of units are arranged to form a bidirectional chain (BiLSTM) for gathering information both from the past and from the future of the input data sequence, a very desirable issue for such kind of classification problems. In all our experiments using BiLSTM we obtained maximum performances by stacking two layers of them, with a dropout layer after each of them (Srivastava et al., 2014), and a final dense softmax layer, or a time-distributed-dense softmax layer, feeded by the BiLSTM output.

Linear CRFs are the simpler Probabilistic Graphical Model (PGM) and it has been successfully used in NLP for sequence classification problems (Lafferty et al., 2001). We did some experiments stacking them after the softmax layer.

Figure 1 shows the most complex DNN structure used in out experiments.

Experiments

All the experiments presented in this paper to test the effectiveness of the proposed system refer to two evaluation campaigns organised inside the EVALITA 3 framework. In particular, in 2007 and 2009 were organised specific task to test Italian PoS-taggers performances.

The EVALITA 2007 evaluation

Two separate data sets were provided: the Development Set (DS), composed of 133,756 tokens, was used for system development and for the training phase, while a Test Set (TS), composed of 17,313 tokens, was used as a reference for systems evaluation. Both contain various documents belonging mainly to journalistic and narrative genres, with small sections containing academic and legal/administrative prose. Each participant was allowed to use any available resource or could freely induce it from the training data. The original PoS-tagging task involved two different tagsets, but our experiments used only the tags and the annotation named 'EAGLES-like'.

The evaluation metrics were based on a tokenby-token comparison and only one tag was allowed for each token. The EVALITA metric considered in this study is the Tagging Accuracy, defined as the number of correct PoS-tag assignments divided by the total number of tokens in the TS. See (Tamburini, 2007) for further details.

The EVALITA 2009 evaluation

The DS consisted in 113895 word forms (already divided in a training set -108,874 tokens -and a validation set -5021 tokens). The TS consisted of 5066 word forms. The training set is formed by newspaper articles from 'La Repubblica', while the validation and test set contain documents extracted from the Italian Wikipedia. This test the degree of system adaptation to new domains.

The organisers evaluated the results using a coarse grained (37 tags) and a morphed (336 tags) tagsets inserted in a closed/open task framework, but in this study all the results refer to the open task (one can use external resources) on the coarse grained tagset. The evaluation metric is the same described before in section 4.1. See (Attardi and Simi, 2009) for further details.

Hyper-Parameters

Considering the large number of hyper-parameters involved in the whole procedure, we did not test all the possible combinations; we used, instead, the most common set-up of parameters gathered from the literature. (Kingma and Ba, 2015).

The Early Stopping Drama

There are some interesting studies (Bengio, 2012;Prechelt, 2012) dealing with the problem of stopping the learning process at the right point; this issue is known as the 'early stopping' problem.

Choosing the correct epoch to stop the learning process helps avoiding overfitting on the training set and usually produces systems exhibiting better generalisations. But, how to choose the correct epoch is not simple. The suggestion given in various studies on this topic is to consider a validation set and stop the learning process when the performances on this set do not increase anymore or even decrease, a clear hint of overfitting. The usual way to set up an experiment following this suggestions involves splitting the gold standard into three different instance sets: the training set, for training, the validation set, to determine the stopping point, and the test set to evaluate the system. However, we are testing our systems on real evaluation data that has been already split by the organisers into development and test set. Thus, we can divide the development set into training/validation set for optimising the hyperparameters and define the stopping epoch, but, for the final evaluation, we would like to train the final system on the complete development set to adhere to the evaluation constraints and to benefit from using more training data.

Having two different training procedures for the optimisation and evaluation phases leads to a more complex procedure for determining the stopping epoch. Moreover, the typical accuracy profile for DNN systems is not smooth and oscillate heavily during training. To avoid any problem in determining the stopping point we smoothed all the profiles using a bezier spline. The procedure we adopted to determine the stopping epoch is (please look at Fig. 2): (1) find the first maximum in the validation smoothed profile -A; (2) find the corresponding value of accuracy on the smoothed training profile -B; (3) find the point in the smoothed development set profile having the same accuracy as in B -C; (4) select the epoch corresponding at point C as the stopping epoch -D.

Results

Table 2 outlines the systems' accuracies for different configurations for both datasets. We can observe that by using AnIta morphological information, as well as all the techniques described Figure 2: The early stopping procedure. in section 2.3, improves the systems' results by more than 1%. Considering the data structuring described in section 2.4, the management of an entire sentence as a complete sequence allows recurrent configurations to work with larger contexts producing better results. Adding a CRF layer after the BiLSTM seems to slightly improve the performances, but not in a significant way. In Table 3 we can see our best system performances, namely AnIta-BiLSTM-CRF, compared with the three best systems of the considered EVALITA campaigns. As you can see, in both cases the proposed system ranked first improving the scoring by large quantities.

SYSTEM

Conclusions

The proposed system for PoS-tagging, integrating DNNs and a powerful morphological analyser, exhibited very good accuracy results when applied to standard Italian evaluation datasets from the EVALITA campaigns. The information from AnIta proved to be crucial to reach such accuracy values as well as stacked BiLSTM networks processing entire sentence sequences. We have to further test different DNN configurations and their integration with other kind of PGMs as well as make more experiments with different hyperparameters.

Figure 1 :1Figure 1: The most complex DNN used in our experiments.

3 http://www.evalita.it/

Table 1 :1Table1outlines the whole set-up for the unmodified hyper-parameters. Unmodified hyper-parameters and algorithms used in our experiments. NU means the number of hidden or LSTM units per layer (the same for all layers). For Adam refer toword2vec Embed.Feature extractionHyperpar.ValueHyperpar.ValuetypeSkipGr. window5size100Learning Params.(1/2) win.5batch (win) 1/4*NUneg. sampl.25batch (seq)1sample1e-4Opt. Alg.Adamiter15Loss Func. Categ.CE

Table 2 :2Tagging accuracies (TA) for different configurations for both datasets. ('M' marks the use of AnIta morphological information).TANotesE07E09MLP-25696.45 95.57 Win=5MLP-25697.75 96.84 M,Win=52-BiLSTM-25698.12 97.30 M,Win=52-BiLSTM-25698.14 97.45 M,Seq2-BiLSTM-256-CRF 98.18 97.48 M,Seq

Table 3 :3Participants' results with respect to Tagging Accuracy (TA) at EVALITA 2007 and 2009.EVALITA 2007SYSTEMTAAnIta-BiLSTM-CRF98.18FBKirst Zanoli98.04UniTn Baroni97.89ILCcnrUniPi Lenci97.65EVALITA 2009AnIta-BiLSTM-CRF97.48UniPi SemaWiki 297.03UniPi SemaWiki 196.73UniPi SemaWiki 496.67

https://code.google.com/archive/p/word2vec/ https://github.com/fchollet/keras/tree/master/keras

Overview of the EVALITA 2009 Part-of-Speech Tagging Task GiuseppeAttardi MariaSimi Proc. of Workshop Evalita of Workshop Evalita 2009. 2009 Practical Recommendations for Gradient-Based Training of Deep Architectures YoshuaBengio Neural Networks: Tricks of the Trade: Second Edition GrégoireMontavon GenevièveBOrr Klaus-RobertMüller

Berlin Heidelberg; Berlin, Heidelberg

Springer 2012 Sequential Labeling with Bidirectional LSTM-CNNs JasonChiu EricNichols Proc. International Conf. of Japanese Association for NLP International Conf. of Japanese Association for NLP 2016 Natural language processing (almost) from scratch RonanCollobert JasonWeston LéonBottou MichaelKarlen KorayKavukcuoglu PavelKuksa J. Mach. Learn. Res 12 2011 Framewise phoneme classification with bidirectional lstm and other neural network architectures SantosCicero BiancaZadrozny ;W&cp AlexGraves JürgenSchmidhuber Proc. of the 31st International Conference on Machine Learning of the 31st International Conference on Machine Learning JMLR 2014. 2005 18 Learning character-level representations for partof-speech tagging Long short-term memory SeppHochreiter JürgenSchmidhuber Neural Computation 9 8 1997 Bidirectional LSTM-CRF Models for Sequence Tagging ZhihengHuang WeiXu KaiYu 1508.01991 2015 ArXiv e-prints Adam: a method for stochastic optimization DPKingma JLBa Proc. International Conference on Learning Representations -ICLR International Conference on Learning Representations -ICLR 2015 Conditional random fields: Probabilistic models for segmenting and labeling sequence data JLafferty AMccallum FPereira Proc. 18th International Conf. on Machine Learning 18th International Conf. on Machine Learning 2001 Efficient Estimation of Word Representations in Vector Space TomasMikolov KaiChen GregCorrado JeffreyDean Proc. of Workshop at ICLR of Workshop at ICLR 2013 Early Stopping -But When? LutzPrechelt Neural Networks: Tricks of the Trade: Second Edition GrégoireMontavon GenevièveBOrr Klaus-RobertMüller

Berlin Heidelberg; Berlin, Heidelberg

Springer 2012 CORIS/CODIS: A corpus of written Italian based on a defined and a dynamic model RemaRossini Favretti FabioTamburini CristianaDeSantis A Rainbow of Corpora: Corpus Linguistics and the Languages of the World AndrewWilson PaulRayson TonyMcenery

Munich

Lincom-Europa 2002 Semi-supervised condensed nearest neighbor for part-of-speech tagging AndersSøgaard Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Portland, Oregon, USA

2011 Dropout: A simple way to prevent neural networks from overfitting NitishSrivastava GeoffreyHinton AlexKrizhevsky IlyaSutskever RuslanSalakhutdinov Journal of Machine Learning Research 15 2014 AnIta: a powerful morphological analyser for Italian FabioTamburini MatiasMelandri Proc. 8th International Conference on Language Resources and Evaluation -LREC 2012 8th International Conference on Language Resources and Evaluation -LREC 2012

Istanbul

2012 EVALITA 2007: the Partof-Speech Tagging Task FabioTamburini Intelligenza Artificiale IV 2 2007 CORISTagger: a highperformance PoS tagger for Italian FabioTamburini Intelligenza Artificiale 2007b IV A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Word Embedding PeiluWang FrankYao Qian LeiSoong HaiHe Zhao ArXiv e-prints, 1511.00215 2015