(Better than) State-of-the-Art PoS-tagging for Italian Texts

                                        Fabio Tamburini
                               FICLIT - University of Bologna, Italy
                                fabio.tamburini@unibo.it


                   Abstract                          in general, a very desirable feature because it al-
                                                     lows for the production of tools not tied to any
    English. This paper presents some ex-            specific language, but in various evaluation cam-
    periments for the construction of an high-       paigns, at least for highly-inflected languages as
    performance PoS-tagger for Italian using         Italian, the results showed quite clearly that this
    deep neural networks techniques (DNN)            task would benefit from the use of specific and rich
    integrated with an Italian powerful mor-         language resources (Tamburini, 2007; Attardi and
    phological analyser.      The results ob-        Simi, 2009).
    tained by the proposed system on stan-              In this study, still work-in-progress, we set-up
    dard datasets taken from the EVALITA             a PoS-tagger for Italian able to gather the highest
    campaigns show large accuracy improve-           classification performances by using any available
    ments when compared with previous sys-           language resource and the most up-to-date DNN.
    tems from the literature.                        We used AnIta (Tamburini and Melandri, 2012),
    Italiano. Questo contributo presenta al-         one of the most powerful morphological analysers
    cuni esperimenti per la costruzione di           for Italian, based on a wide lexicon (about 110.000
    un PoS-tagger ad alte prestazioni per            lemmas), for providing the PoS-tagger with a large
    l’italiano utilizzando reti neurali ‘deep’       set of useful information.
    integrate con un potente analizzatore mor-
    fologico. I risultati ottenuti sui dataset       2       Input features
    delle campagne EVALITA da parte del sis-         The set of input features for each token is basically
    tema proposto mostrano incrementi di ac-         formed by two different components: the word
    curatezza piuttosto rilevanti in confronto       embedding and some morphological information.
    ai precedenti sistemi in letteratura.
                                                     2.1       Word Embeddings
1   Introduction                                     All the embeddings used in our experiments were
                                                     extracted from the CORIS corpus (Rossini Favretti
In recent years there were a large number of works   et al., 2002), a 130Mw synchronic reference cor-
trying to push the accuracy of the PoS-tagging       pus for Italian, by using the tool word2vec1
task forward using new techniques, mainly from       (Mikolov et al., 2013). We added two special to-
the deep learning domain (Collobert et al., 2011;    kens to mark the sentence beginning ‘<s>’ and
Søgaard, 2011; dos Santos and Zadrozny, 2014;        ending ‘</s>’.
Huang et al., 2015; Wang et al., 2015; Chiu and
Nichols, 2016).                                      2.2       Morphological features
   All these studies are mainly devoted to show
                                                     One of the most useful kind of information that in-
how to find the best combination of new neu-
                                                     creases the performances of PoS-taggers concerns
ral network structures and character/word embed-
                                                     the list of all possible tags for a single word-form.
dings for reaching the highest classification per-
                                                     Having a restricted list of possibility enable the
formances, and typically present solutions that do
                                                     tagger to reduce the search space and force it to
not make any use of specific language resources
                                                     take reasonable decisions. The results obtained
(e.g. morphological analysers, gazetteers, guess-
                                                         1
ing procedures for unknown words, etc.). This is,            https://code.google.com/archive/p/word2vec/
in past PoS-taggers evaluations on Italian agree         2.4       Data structuring
in suggesting that powerful morphological analy-         We experimented two different ways of structuring
sers based on large lexica are invaluable resources      the input features for processing:
to increase tagger accuracy. For these reasons,
we extended the word embeddings computed in                  • Win: this mode of organising input data is
a completely unsupervised way by concatenating                 based on a sliding window that starts from the
to them a vector containing the possible PoS-tags              beginning of each sentence and concatenates
provided by the AnIta analyser. This tool is also              word feature vectors into one single vector.
able to identify, through the use of simple regular            Padding is inserted at sentence borders.
expressions, numbers, dates, URLs, emails, etc.,             • Seq: each sentence is managed as one single
and assign them the proper tag(s).                             sequence padded at the borders.

2.3   Unknown words handling and Sentence                  Each network experimented in this study uses
      padding                                            one of these two data structuring type.

The source of most tagging errors is certainly the       3       (Deep) Learning Blocks
presence of the so called ‘unknown words’, word-
forms for which the tagger did not receive any in-       All the experiments presented in this paper has
formation during the training phase. A morpho-           been performed using Keras2 a “a minimalist,
logical analyser based on a large lexicon could cer-     highly modular neural networks library, written in
tainly alleviate this problem providing information      Python and capable of running on top of either
also for word-forms not belonging to the training        TensorFlow or Theano”, two widely used tensor
set, but there are large classes of tokens that cannot   manipulation libraries. Keras provides some basic
be successfully handled by the analyser, for exam-       neural network blocks as well as different learn-
ple proper names, foreign words, etc.                    ing procedures for the desired network configura-
                                                         tion and simple tools for writing new blocks. In
   In a previous work (Tamburini, 2007b) we
                                                         our experiments we used some of them, namely
showed that using such a powerful morphological
                                                         multilayer-perceptrons (MLP) and Long Short-
analyser, the word-forms not covered by it in real
                                                         Term Memory (LSTM), and we wrote a new block
texts belongs at 95% to the class of proper names,
                                                         to handle Conditional Random Fields (CRF).
adjectives and common nouns and a simple heuris-
                                                            MLP are simple feedforward neural networks
tic correctly assigns most of the cases. In this
                                                         with one or more fully-connected hidden layers.
way AnIta always provides one or more PoS-tag
                                                         We obtained maximum performances using only
hypothesis for each word-form that can be trans-
                                                         one hidden layer.
formed into a binary vector with 1s in correspon-
dence of possible PoS-tags and 0s otherwise, but            LSTM networks (Hochreiter and Schmidhuber,
if the word-form did not have a computed embed-          1997; Graves and Schmidhuber, 2005) are a kind
ding, the first part of the input features would not     of recurrent neural network which received a lot
be defined. For solving such problem, instead of         of attention in recent years due to their ability of
using the common solution of assigning a random          produce good classification results for sequence
vector to all unknown words, we averaged all the         problems. Their property of preventing the vanish-
embeddings of the other word presenting exactly          ing (and exploding) gradient problem that affects
the same combination of possible PoS-tags.               standard recurrent neural networks made them the
                                                         default choice for solving sequence classification
   It is also a common practice to pad sentences,
                                                         problems inside the DNN framework. Usually
at the beginning and at the end, using random
                                                         this kind of units are arranged to form a bidirec-
vectors, but we, instead, used the real embed-
                                                         tional chain (BiLSTM) for gathering information
dings computed for the special tokens ‘<s>’ and
                                                         both from the past and from the future of the in-
‘</s>’, added for this purpose, with the respec-
                                                         put data sequence, a very desirable issue for such
tive tag ‘BoS’ and ‘EoS’. Due to the internal struc-
                                                         kind of classification problems. In all our experi-
turing of the used tensor manipulating application
                                                         ments using BiLSTM we obtained maximum per-
(see later), we were forced to add also an out-of-
                                                         formances by stacking two layers of them, with
sentence vector to pad sentences to their maximal
                                                             2
length, and the correspondent tag OoS.                           https://github.com/fchollet/keras/tree/master/keras
a dropout layer after each of them (Srivastava et        The original PoS-tagging task involved two dif-
al., 2014), and a final dense softmax layer, or a     ferent tagsets, but our experiments used only the
time-distributed-dense softmax layer, feeded by       tags and the annotation named ‘EAGLES-like’.
the BiLSTM output.                                       The evaluation metrics were based on a token-
   Linear CRFs are the simpler Probabilistic          by-token comparison and only one tag was al-
Graphical Model (PGM) and it has been success-        lowed for each token. The EVALITA metric con-
fully used in NLP for sequence classification prob-   sidered in this study is the Tagging Accuracy, de-
lems (Lafferty et al., 2001). We did some experi-     fined as the number of correct PoS-tag assign-
ments stacking them after the softmax layer.          ments divided by the total number of tokens in the
   Figure 1 shows the most complex DNN struc-         TS. See (Tamburini, 2007) for further details.
ture used in out experiments.
                                                      4.2   The EVALITA 2009 evaluation
                                                      The DS consisted in 113895 word forms (already
                                                      divided in a training set - 108,874 tokens - and a
                                                      validation set - 5021 tokens). The TS consisted of
                                                      5066 word forms. The training set is formed by
                                                      newspaper articles from ‘La Repubblica’, while
                                                      the validation and test set contain documents ex-
                                                      tracted from the Italian Wikipedia. This test the
                                                      degree of system adaptation to new domains.
                                                         The organisers evaluated the results using a
                                                      coarse grained (37 tags) and a morphed (336 tags)
                                                      tagsets inserted in a closed/open task framework,
                                                      but in this study all the results refer to the open
                                                      task (one can use external resources) on the coarse
                                                      grained tagset. The evaluation metric is the same
                                                      described before in section 4.1. See (Attardi and
Figure 1: The most complex DNN used in our ex-        Simi, 2009) for further details.
periments.
                                                      4.3   Hyper-Parameters

4       Experiments                                   Considering the large number of hyper-parameters
                                                      involved in the whole procedure, we did not test all
All the experiments presented in this paper to        the possible combinations; we used, instead, the
test the effectiveness of the proposed system re-     most common set-up of parameters gathered from
fer to two evaluation campaigns organised inside      the literature. Table 1 outlines the whole set-up for
the EVALITA3 framework. In particular, in 2007        the unmodified hyper-parameters.
and 2009 were organised specific task to test Ital-
ian PoS-taggers performances.                           word2vec Embed.             Feature extraction
                                                       Hyperpar.    Value         Hyperpar.      Value
4.1       The EVALITA 2007 evaluation
                                                       type        SkipGr.        window           5
Two separate data sets were provided: the Devel-       size          100            Learning Params.
opment Set (DS), composed of 133,756 tokens,           (1/2) win.      5          batch (win) 1/4*NU
was used for system development and for the train-     neg. sampl.    25          batch (seq)      1
ing phase, while a Test Set (TS), composed of          sample        1e-4         Opt. Alg.      Adam
17,313 tokens, was used as a reference for sys-        iter           15          Loss Func. Categ.CE
tems evaluation. Both contain various documents
belonging mainly to journalistic and narrative gen-   Table 1: Unmodified hyper-parameters and algo-
res, with small sections containing academic and      rithms used in our experiments. NU means the
legal/administrative prose. Each participant was      number of hidden or LSTM units per layer (the
allowed to use any available resource or could        same for all layers). For Adam refer to (Kingma
freely induce it from the training data.              and Ba, 2015).
    3
        http://www.evalita.it/
4.4   The Early Stopping Drama
There are some interesting studies (Bengio, 2012;
Prechelt, 2012) dealing with the problem of stop-
ping the learning process at the right point; this
issue is known as the ‘early stopping’ problem.
Choosing the correct epoch to stop the learning
process helps avoiding overfitting on the training
set and usually produces systems exhibiting bet-
ter generalisations. But, how to choose the correct
epoch is not simple. The suggestion given in vari-
ous studies on this topic is to consider a validation
set and stop the learning process when the perfor-
mances on this set do not increase anymore or even            Figure 2: The early stopping procedure.
decrease, a clear hint of overfitting.
   The usual way to set up an experiment fol-
                                                         in section 2.3, improves the systems’ results by
lowing this suggestions involves splitting the gold
                                                         more than 1%. Considering the data structuring
standard into three different instance sets: the
                                                         described in section 2.4, the management of an
training set, for training, the validation set, to de-
                                                         entire sentence as a complete sequence allows re-
termine the stopping point, and the test set to eval-
                                                         current configurations to work with larger contexts
uate the system. However, we are testing our sys-
                                                         producing better results. Adding a CRF layer after
tems on real evaluation data that has been already
                                                         the BiLSTM seems to slightly improve the perfor-
split by the organisers into development and test
                                                         mances, but not in a significant way.
set. Thus, we can divide the development set into
training/validation set for optimising the hyper-         SYSTEM                       TA          Notes
parameters and define the stopping epoch, but, for                                 E07    E09
the final evaluation, we would like to train the final    MLP-256                 96.45 95.57      Win=5
system on the complete development set to adhere          MLP-256                 97.75 96.84      M,Win=5
to the evaluation constraints and to benefit from         2-BiLSTM-256            98.12 97.30      M,Win=5
using more training data.                                 2-BiLSTM-256            98.14 97.45      M,Seq
   Having two different training procedures for the       2-BiLSTM-256-CRF        98.18 97.48      M,Seq
optimisation and evaluation phases leads to a more
complex procedure for determining the stopping           Table 2: Tagging accuracies (TA) for different
epoch. Moreover, the typical accuracy profile for        configurations for both datasets. (‘M’ marks the
DNN systems is not smooth and oscillate heav-            use of AnIta morphological information).
ily during training. To avoid any problem in de-
termining the stopping point we smoothed all the           In Table 3 we can see our best system perfor-
profiles using a bezier spline. The procedure we         mances, namely AnIta-BiLSTM-CRF, compared
adopted to determine the stopping epoch is (please       with the three best systems of the considered
look at Fig. 2): (1) find the first maximum in the       EVALITA campaigns. As you can see, in both
validation smoothed profile - A; (2) find the corre-     cases the proposed system ranked first improving
sponding value of accuracy on the smoothed train-        the scoring by large quantities.
ing profile - B; (3) find the point in the smoothed
development set profile having the same accuracy         5   Conclusions
as in B - C; (4) select the epoch corresponding at       The proposed system for PoS-tagging, integrat-
point C as the stopping epoch - D.                       ing DNNs and a powerful morphological analyser,
                                                         exhibited very good accuracy results when ap-
4.5   Results
                                                         plied to standard Italian evaluation datasets from
Table 2 outlines the systems’ accuracies for dif-        the EVALITA campaigns. The information from
ferent configurations for both datasets. We can          AnIta proved to be crucial to reach such accuracy
observe that by using AnIta morphological infor-         values as well as stacked BiLSTM networks pro-
mation, as well as all the techniques described          cessing entire sentence sequences.
               EVALITA 2007                             Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-
        SYSTEM                         TA                 tional LSTM-CRF Models for Sequence Tagging.
        AnIta-BiLSTM-CRF              98.18               ArXiv e-prints, 1508.01991.
        FBKirst Zanoli                98.04             D.P. Kingma and J.L. Ba. 2015. Adam: a method for
        UniTn Baroni                  97.89               stochastic optimization. In Proc. International Con-
        ILCcnrUniPi Lenci             97.65               ference on Learning Representations - ICLR., pages
                                                          1–13.
               EVALITA 2009
        AnIta-BiLSTM-CRF              97.48             J. Lafferty, A. McCallum, and F. Pereira. 2001. Con-
                                                           ditional random fields: Probabilistic models for seg-
        UniPi SemaWiki 2              97.03
                                                           menting and labeling sequence data. In Proc. 18th
        UniPi SemaWiki 1              96.73                International Conf. on Machine Learning, pages
        UniPi SemaWiki 4              96.67                282–289.
                                                        Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Table 3: Participants’ results with respect to Tag-       Dean. 2013. Efficient Estimation of Word Repre-
ging Accuracy (TA) at EVALITA 2007 and 2009.              sentations in Vector Space. In Proc. of Workshop at
                                                          ICLR.

   We have to further test different DNN config-        Lutz Prechelt. 2012. Early Stopping — But When? In
                                                          Grégoire Montavon, Geneviève B. Orr, and Klaus-
urations and their integration with other kind of
                                                          Robert Müller, editors, Neural Networks: Tricks of
PGMs as well as make more experiments with dif-           the Trade: Second Edition, pages 53–67. Springer
ferent hyperparameters.                                   Berlin Heidelberg, Berlin, Heidelberg.
                                                        Rema Rossini Favretti, Fabio Tamburini, and Cristiana
                                                          De Santis. 2002. CORIS/CODIS: A corpus of
References                                                written Italian based on a defined and a dynamic
Giuseppe Attardi and Maria Simi. 2009. Overview           model. In Andrew Wilson, Paul Rayson, and Tony
  of the EVALITA 2009 Part-of-Speech Tagging Task.        McEnery, editors, A Rainbow of Corpora: Corpus
  In Proc. of Workshop Evalita 2009.                      Linguistics and the Languages of the World, pages
                                                          27–38. Lincom-Europa, Munich.
Yoshua Bengio. 2012. Practical Recommendations for      Anders Søgaard. 2011. Semi-supervised condensed
  Gradient-Based Training of Deep Architectures. In       nearest neighbor for part-of-speech tagging. In
  Grégoire Montavon, Geneviève B. Orr, and Klaus-       Proc. of the 49th Annual Meeting of the Association
  Robert Müller, editors, Neural Networks: Tricks of     for Computational Linguistics: Human Language
  the Trade: Second Edition, pages 437–478. Springer      Technologies, pages 48–52, Portland, Oregon, USA.
  Berlin Heidelberg, Berlin, Heidelberg.
                                                        Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Jason Chiu and Eric Nichols. 2016. Sequential Label-      Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
   ing with Bidirectional LSTM-CNNs. In Proc. In-         Dropout: A simple way to prevent neural networks
   ternational Conf. of Japanese Association for NLP,     from overfitting. Journal of Machine Learning Re-
   pages 937–940.                                         search, 15:1929–1958.

Ronan Collobert, Jason Weston, Léon Bottou, Michael    Fabio Tamburini and Matias Melandri. 2012. AnIta:
  Karlen, Koray Kavukcuoglu, and Pavel Kuksa.             a powerful morphological analyser for Italian. In
  2011. Natural language processing (almost) from         Proc. 8th International Conference on Language Re-
  scratch. J. Mach. Learn. Res., 12:2493–2537.            sources and Evaluation - LREC 2012, pages 941–
                                                          947, Istanbul.
Cicero dos Santos and Bianca Zadrozny.        2014.
  Learning character-level representations for part-    Fabio Tamburini. 2007. EVALITA 2007: the Part-
  of-speech tagging. In Proc. of the 31st Interna-        of-Speech Tagging Task. Intelligenza Artificiale,
  tional Conference on Machine Learning, JMLR, vol-       IV(2):4–7.
  ume 32. JMLR W&CP.                                    Fabio Tamburini. 2007b. CORISTagger: a high-
                                                          performance PoS tagger for Italian. Intelligenza Ar-
Alex Graves and Jürgen Schmidhuber. 2005. Frame-         tificiale. Intelligenza Artificiale, IV(2):14–15.
  wise phoneme classification with bidirectional lstm
  and other neural network architectures. Neural Net-   Peilu Wang, Yao Qian, Frank. K Soong, Lei He, and
  works, 18(5-6):602–610.                                 Hai Zhao. 2015. A Unified Tagging Solution:
                                                          Bidirectional LSTM Recurrent Neural Network with
Sepp Hochreiter and Jürgen Schmidhuber. 1997.            Word Embedding. ArXiv e-prints, 1511.00215.
  Long short-term memory. Neural Computation,
  9(8):1735–1780.