(Better than) State-of-the-Art PoS-tagging for Italian Texts
Fabio Tamburini
FICLIT - University of Bologna, Italy
fabio.tamburini@unibo.it
Abstract in general, a very desirable feature because it al-
lows for the production of tools not tied to any
English. This paper presents some ex- specific language, but in various evaluation cam-
periments for the construction of an high- paigns, at least for highly-inflected languages as
performance PoS-tagger for Italian using Italian, the results showed quite clearly that this
deep neural networks techniques (DNN) task would benefit from the use of specific and rich
integrated with an Italian powerful mor- language resources (Tamburini, 2007; Attardi and
phological analyser. The results ob- Simi, 2009).
tained by the proposed system on stan- In this study, still work-in-progress, we set-up
dard datasets taken from the EVALITA a PoS-tagger for Italian able to gather the highest
campaigns show large accuracy improve- classification performances by using any available
ments when compared with previous sys- language resource and the most up-to-date DNN.
tems from the literature. We used AnIta (Tamburini and Melandri, 2012),
Italiano. Questo contributo presenta al- one of the most powerful morphological analysers
cuni esperimenti per la costruzione di for Italian, based on a wide lexicon (about 110.000
un PoS-tagger ad alte prestazioni per lemmas), for providing the PoS-tagger with a large
l’italiano utilizzando reti neurali ‘deep’ set of useful information.
integrate con un potente analizzatore mor-
fologico. I risultati ottenuti sui dataset 2 Input features
delle campagne EVALITA da parte del sis- The set of input features for each token is basically
tema proposto mostrano incrementi di ac- formed by two different components: the word
curatezza piuttosto rilevanti in confronto embedding and some morphological information.
ai precedenti sistemi in letteratura.
2.1 Word Embeddings
1 Introduction All the embeddings used in our experiments were
extracted from the CORIS corpus (Rossini Favretti
In recent years there were a large number of works et al., 2002), a 130Mw synchronic reference cor-
trying to push the accuracy of the PoS-tagging pus for Italian, by using the tool word2vec1
task forward using new techniques, mainly from (Mikolov et al., 2013). We added two special to-
the deep learning domain (Collobert et al., 2011; kens to mark the sentence beginning ‘’ and
Søgaard, 2011; dos Santos and Zadrozny, 2014; ending ‘’.
Huang et al., 2015; Wang et al., 2015; Chiu and
Nichols, 2016). 2.2 Morphological features
All these studies are mainly devoted to show
One of the most useful kind of information that in-
how to find the best combination of new neu-
creases the performances of PoS-taggers concerns
ral network structures and character/word embed-
the list of all possible tags for a single word-form.
dings for reaching the highest classification per-
Having a restricted list of possibility enable the
formances, and typically present solutions that do
tagger to reduce the search space and force it to
not make any use of specific language resources
take reasonable decisions. The results obtained
(e.g. morphological analysers, gazetteers, guess-
1
ing procedures for unknown words, etc.). This is, https://code.google.com/archive/p/word2vec/
in past PoS-taggers evaluations on Italian agree 2.4 Data structuring
in suggesting that powerful morphological analy- We experimented two different ways of structuring
sers based on large lexica are invaluable resources the input features for processing:
to increase tagger accuracy. For these reasons,
we extended the word embeddings computed in • Win: this mode of organising input data is
a completely unsupervised way by concatenating based on a sliding window that starts from the
to them a vector containing the possible PoS-tags beginning of each sentence and concatenates
provided by the AnIta analyser. This tool is also word feature vectors into one single vector.
able to identify, through the use of simple regular Padding is inserted at sentence borders.
expressions, numbers, dates, URLs, emails, etc., • Seq: each sentence is managed as one single
and assign them the proper tag(s). sequence padded at the borders.
2.3 Unknown words handling and Sentence Each network experimented in this study uses
padding one of these two data structuring type.
The source of most tagging errors is certainly the 3 (Deep) Learning Blocks
presence of the so called ‘unknown words’, word-
forms for which the tagger did not receive any in- All the experiments presented in this paper has
formation during the training phase. A morpho- been performed using Keras2 a “a minimalist,
logical analyser based on a large lexicon could cer- highly modular neural networks library, written in
tainly alleviate this problem providing information Python and capable of running on top of either
also for word-forms not belonging to the training TensorFlow or Theano”, two widely used tensor
set, but there are large classes of tokens that cannot manipulation libraries. Keras provides some basic
be successfully handled by the analyser, for exam- neural network blocks as well as different learn-
ple proper names, foreign words, etc. ing procedures for the desired network configura-
tion and simple tools for writing new blocks. In
In a previous work (Tamburini, 2007b) we
our experiments we used some of them, namely
showed that using such a powerful morphological
multilayer-perceptrons (MLP) and Long Short-
analyser, the word-forms not covered by it in real
Term Memory (LSTM), and we wrote a new block
texts belongs at 95% to the class of proper names,
to handle Conditional Random Fields (CRF).
adjectives and common nouns and a simple heuris-
MLP are simple feedforward neural networks
tic correctly assigns most of the cases. In this
with one or more fully-connected hidden layers.
way AnIta always provides one or more PoS-tag
We obtained maximum performances using only
hypothesis for each word-form that can be trans-
one hidden layer.
formed into a binary vector with 1s in correspon-
dence of possible PoS-tags and 0s otherwise, but LSTM networks (Hochreiter and Schmidhuber,
if the word-form did not have a computed embed- 1997; Graves and Schmidhuber, 2005) are a kind
ding, the first part of the input features would not of recurrent neural network which received a lot
be defined. For solving such problem, instead of of attention in recent years due to their ability of
using the common solution of assigning a random produce good classification results for sequence
vector to all unknown words, we averaged all the problems. Their property of preventing the vanish-
embeddings of the other word presenting exactly ing (and exploding) gradient problem that affects
the same combination of possible PoS-tags. standard recurrent neural networks made them the
default choice for solving sequence classification
It is also a common practice to pad sentences,
problems inside the DNN framework. Usually
at the beginning and at the end, using random
this kind of units are arranged to form a bidirec-
vectors, but we, instead, used the real embed-
tional chain (BiLSTM) for gathering information
dings computed for the special tokens ‘’ and
both from the past and from the future of the in-
‘’, added for this purpose, with the respec-
put data sequence, a very desirable issue for such
tive tag ‘BoS’ and ‘EoS’. Due to the internal struc-
kind of classification problems. In all our experi-
turing of the used tensor manipulating application
ments using BiLSTM we obtained maximum per-
(see later), we were forced to add also an out-of-
formances by stacking two layers of them, with
sentence vector to pad sentences to their maximal
2
length, and the correspondent tag OoS. https://github.com/fchollet/keras/tree/master/keras
a dropout layer after each of them (Srivastava et The original PoS-tagging task involved two dif-
al., 2014), and a final dense softmax layer, or a ferent tagsets, but our experiments used only the
time-distributed-dense softmax layer, feeded by tags and the annotation named ‘EAGLES-like’.
the BiLSTM output. The evaluation metrics were based on a token-
Linear CRFs are the simpler Probabilistic by-token comparison and only one tag was al-
Graphical Model (PGM) and it has been success- lowed for each token. The EVALITA metric con-
fully used in NLP for sequence classification prob- sidered in this study is the Tagging Accuracy, de-
lems (Lafferty et al., 2001). We did some experi- fined as the number of correct PoS-tag assign-
ments stacking them after the softmax layer. ments divided by the total number of tokens in the
Figure 1 shows the most complex DNN struc- TS. See (Tamburini, 2007) for further details.
ture used in out experiments.
4.2 The EVALITA 2009 evaluation
The DS consisted in 113895 word forms (already
divided in a training set - 108,874 tokens - and a
validation set - 5021 tokens). The TS consisted of
5066 word forms. The training set is formed by
newspaper articles from ‘La Repubblica’, while
the validation and test set contain documents ex-
tracted from the Italian Wikipedia. This test the
degree of system adaptation to new domains.
The organisers evaluated the results using a
coarse grained (37 tags) and a morphed (336 tags)
tagsets inserted in a closed/open task framework,
but in this study all the results refer to the open
task (one can use external resources) on the coarse
grained tagset. The evaluation metric is the same
described before in section 4.1. See (Attardi and
Figure 1: The most complex DNN used in our ex- Simi, 2009) for further details.
periments.
4.3 Hyper-Parameters
4 Experiments Considering the large number of hyper-parameters
involved in the whole procedure, we did not test all
All the experiments presented in this paper to the possible combinations; we used, instead, the
test the effectiveness of the proposed system re- most common set-up of parameters gathered from
fer to two evaluation campaigns organised inside the literature. Table 1 outlines the whole set-up for
the EVALITA3 framework. In particular, in 2007 the unmodified hyper-parameters.
and 2009 were organised specific task to test Ital-
ian PoS-taggers performances. word2vec Embed. Feature extraction
Hyperpar. Value Hyperpar. Value
4.1 The EVALITA 2007 evaluation
type SkipGr. window 5
Two separate data sets were provided: the Devel- size 100 Learning Params.
opment Set (DS), composed of 133,756 tokens, (1/2) win. 5 batch (win) 1/4*NU
was used for system development and for the train- neg. sampl. 25 batch (seq) 1
ing phase, while a Test Set (TS), composed of sample 1e-4 Opt. Alg. Adam
17,313 tokens, was used as a reference for sys- iter 15 Loss Func. Categ.CE
tems evaluation. Both contain various documents
belonging mainly to journalistic and narrative gen- Table 1: Unmodified hyper-parameters and algo-
res, with small sections containing academic and rithms used in our experiments. NU means the
legal/administrative prose. Each participant was number of hidden or LSTM units per layer (the
allowed to use any available resource or could same for all layers). For Adam refer to (Kingma
freely induce it from the training data. and Ba, 2015).
3
http://www.evalita.it/
4.4 The Early Stopping Drama
There are some interesting studies (Bengio, 2012;
Prechelt, 2012) dealing with the problem of stop-
ping the learning process at the right point; this
issue is known as the ‘early stopping’ problem.
Choosing the correct epoch to stop the learning
process helps avoiding overfitting on the training
set and usually produces systems exhibiting bet-
ter generalisations. But, how to choose the correct
epoch is not simple. The suggestion given in vari-
ous studies on this topic is to consider a validation
set and stop the learning process when the perfor-
mances on this set do not increase anymore or even Figure 2: The early stopping procedure.
decrease, a clear hint of overfitting.
The usual way to set up an experiment fol-
in section 2.3, improves the systems’ results by
lowing this suggestions involves splitting the gold
more than 1%. Considering the data structuring
standard into three different instance sets: the
described in section 2.4, the management of an
training set, for training, the validation set, to de-
entire sentence as a complete sequence allows re-
termine the stopping point, and the test set to eval-
current configurations to work with larger contexts
uate the system. However, we are testing our sys-
producing better results. Adding a CRF layer after
tems on real evaluation data that has been already
the BiLSTM seems to slightly improve the perfor-
split by the organisers into development and test
mances, but not in a significant way.
set. Thus, we can divide the development set into
training/validation set for optimising the hyper- SYSTEM TA Notes
parameters and define the stopping epoch, but, for E07 E09
the final evaluation, we would like to train the final MLP-256 96.45 95.57 Win=5
system on the complete development set to adhere MLP-256 97.75 96.84 M,Win=5
to the evaluation constraints and to benefit from 2-BiLSTM-256 98.12 97.30 M,Win=5
using more training data. 2-BiLSTM-256 98.14 97.45 M,Seq
Having two different training procedures for the 2-BiLSTM-256-CRF 98.18 97.48 M,Seq
optimisation and evaluation phases leads to a more
complex procedure for determining the stopping Table 2: Tagging accuracies (TA) for different
epoch. Moreover, the typical accuracy profile for configurations for both datasets. (‘M’ marks the
DNN systems is not smooth and oscillate heav- use of AnIta morphological information).
ily during training. To avoid any problem in de-
termining the stopping point we smoothed all the In Table 3 we can see our best system perfor-
profiles using a bezier spline. The procedure we mances, namely AnIta-BiLSTM-CRF, compared
adopted to determine the stopping epoch is (please with the three best systems of the considered
look at Fig. 2): (1) find the first maximum in the EVALITA campaigns. As you can see, in both
validation smoothed profile - A; (2) find the corre- cases the proposed system ranked first improving
sponding value of accuracy on the smoothed train- the scoring by large quantities.
ing profile - B; (3) find the point in the smoothed
development set profile having the same accuracy 5 Conclusions
as in B - C; (4) select the epoch corresponding at The proposed system for PoS-tagging, integrat-
point C as the stopping epoch - D. ing DNNs and a powerful morphological analyser,
exhibited very good accuracy results when ap-
4.5 Results
plied to standard Italian evaluation datasets from
Table 2 outlines the systems’ accuracies for dif- the EVALITA campaigns. The information from
ferent configurations for both datasets. We can AnIta proved to be crucial to reach such accuracy
observe that by using AnIta morphological infor- values as well as stacked BiLSTM networks pro-
mation, as well as all the techniques described cessing entire sentence sequences.
EVALITA 2007 Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-
SYSTEM TA tional LSTM-CRF Models for Sequence Tagging.
AnIta-BiLSTM-CRF 98.18 ArXiv e-prints, 1508.01991.
FBKirst Zanoli 98.04 D.P. Kingma and J.L. Ba. 2015. Adam: a method for
UniTn Baroni 97.89 stochastic optimization. In Proc. International Con-
ILCcnrUniPi Lenci 97.65 ference on Learning Representations - ICLR., pages
1–13.
EVALITA 2009
AnIta-BiLSTM-CRF 97.48 J. Lafferty, A. McCallum, and F. Pereira. 2001. Con-
ditional random fields: Probabilistic models for seg-
UniPi SemaWiki 2 97.03
menting and labeling sequence data. In Proc. 18th
UniPi SemaWiki 1 96.73 International Conf. on Machine Learning, pages
UniPi SemaWiki 4 96.67 282–289.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Table 3: Participants’ results with respect to Tag- Dean. 2013. Efficient Estimation of Word Repre-
ging Accuracy (TA) at EVALITA 2007 and 2009. sentations in Vector Space. In Proc. of Workshop at
ICLR.
We have to further test different DNN config- Lutz Prechelt. 2012. Early Stopping — But When? In
Grégoire Montavon, Geneviève B. Orr, and Klaus-
urations and their integration with other kind of
Robert Müller, editors, Neural Networks: Tricks of
PGMs as well as make more experiments with dif- the Trade: Second Edition, pages 53–67. Springer
ferent hyperparameters. Berlin Heidelberg, Berlin, Heidelberg.
Rema Rossini Favretti, Fabio Tamburini, and Cristiana
De Santis. 2002. CORIS/CODIS: A corpus of
References written Italian based on a defined and a dynamic
Giuseppe Attardi and Maria Simi. 2009. Overview model. In Andrew Wilson, Paul Rayson, and Tony
of the EVALITA 2009 Part-of-Speech Tagging Task. McEnery, editors, A Rainbow of Corpora: Corpus
In Proc. of Workshop Evalita 2009. Linguistics and the Languages of the World, pages
27–38. Lincom-Europa, Munich.
Yoshua Bengio. 2012. Practical Recommendations for Anders Søgaard. 2011. Semi-supervised condensed
Gradient-Based Training of Deep Architectures. In nearest neighbor for part-of-speech tagging. In
Grégoire Montavon, Geneviève B. Orr, and Klaus- Proc. of the 49th Annual Meeting of the Association
Robert Müller, editors, Neural Networks: Tricks of for Computational Linguistics: Human Language
the Trade: Second Edition, pages 437–478. Springer Technologies, pages 48–52, Portland, Oregon, USA.
Berlin Heidelberg, Berlin, Heidelberg.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Jason Chiu and Eric Nichols. 2016. Sequential Label- Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
ing with Bidirectional LSTM-CNNs. In Proc. In- Dropout: A simple way to prevent neural networks
ternational Conf. of Japanese Association for NLP, from overfitting. Journal of Machine Learning Re-
pages 937–940. search, 15:1929–1958.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Fabio Tamburini and Matias Melandri. 2012. AnIta:
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. a powerful morphological analyser for Italian. In
2011. Natural language processing (almost) from Proc. 8th International Conference on Language Re-
scratch. J. Mach. Learn. Res., 12:2493–2537. sources and Evaluation - LREC 2012, pages 941–
947, Istanbul.
Cicero dos Santos and Bianca Zadrozny. 2014.
Learning character-level representations for part- Fabio Tamburini. 2007. EVALITA 2007: the Part-
of-speech tagging. In Proc. of the 31st Interna- of-Speech Tagging Task. Intelligenza Artificiale,
tional Conference on Machine Learning, JMLR, vol- IV(2):4–7.
ume 32. JMLR W&CP. Fabio Tamburini. 2007b. CORISTagger: a high-
performance PoS tagger for Italian. Intelligenza Ar-
Alex Graves and Jürgen Schmidhuber. 2005. Frame- tificiale. Intelligenza Artificiale, IV(2):14–15.
wise phoneme classification with bidirectional lstm
and other neural network architectures. Neural Net- Peilu Wang, Yao Qian, Frank. K Soong, Lei He, and
works, 18(5-6):602–610. Hai Zhao. 2015. A Unified Tagging Solution:
Bidirectional LSTM Recurrent Neural Network with
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Word Embedding. ArXiv e-prints, 1511.00215.
Long short-term memory. Neural Computation,
9(8):1735–1780.