Character Embeddings PoS Tagger vs HMM Tagger for Tweets

                                   Giuseppe Attardi, Maria Simi
                                    Dipartimento di Informatica
                                         Università di Pisa
                                      Largo B. Pontecorvo, 3
                                        I-56127 Pisa, Italy
                                {attardi, simi}@di.unipi.it


                                                     low the guidelines proposed by the Universal
                    Abstract                         Dependencies (UD) project for Italian1, in partic-
                                                     ular with respect to tokenization and tag set, with
    English. The paper describes our sub-            minor changes due to the specificity of the text
    missions to the task on PoS tagging for          genre. A few specific tags (EMO, URL, EMAIL,
    Italian Social Media Texts (PoSTWITA)            HASHTAG and MENTION), have been in fact
    at Evalita 2016. We compared two ap-             added for typical morphological categories in
    proaches: a traditional HMM trigram Pos          social media texts, like emoticons and emoji’s,
    tagger and a Deep Learning PoS tagger            web URL, email addresses, hashtags and men-
    using both character-level and word-level        tions.
    embeddings. The character-level embed-              The challenge for PoS tagging of microposts
    dings performed better proving that they         consists in dealing with misspelled, colloquial or
    can provide a finer representation of            broken words as well as in overcoming the lack
    words that allows coping with the idio-          of context and proper uppercasing, which pro-
    syncrasies and irregularities of the lan-        vide helpful hints when analysing more standard
    guage in microposts.                             texts.
                                                        We conducted preparatory work that consisted
    Italiano. Questo articolo descrive la no-        in customizing some available lexical and train-
    stra partecipazione al task di PoS tag-          ing resources for the task: in section 2 and 3 we
    ging for Italian Social Media Texts (Po-         will describe such a process.
    STWITA) di Evalita 2016. Abbiamo con-               We decided to address the research question of
    frontato due approcci: un PoS tagger             comparing the relative performance of two dif-
    tradizionale basato su HMM a trigrammi           ferent approaches to PoS tagging: the traditional
    e un PoS Tagger con Deep Learning che            word-based approach, based on a Hidden Mar-
    usa embeddings sia a livello di caratteri        kov Model PoS tagger, with a Deep Learning
    che di parole. Gli embedding a caratteri         approach that exploits character-level embed-
    hanno fornito un miglior risultato, dimo-        dings (Ma and Hovy, 2016). Section 4 and 5 de-
    strando che riescono a fornire una rap-          scribe the two approaches in detail.
    presentazione più fine delle parole che
    consente di trattare le idiosincrasie e ir-      2    Building a larger training resource
    regolarità del linguaggio usato nei mi-
    cropost.                                         The gold training set provided for the task con-
                                                     sists in a collection of 6,640 Italian tweets from
                                                     the Evalita 2014 SENTIPOLC corpus (corre-
1    Introduction
                                                     sponding to 127,843 word tokens). Given the
The PoS tagging challenge at Evalita 2016 was        relative small size of the resource, we extended it
targeted to the analysis of Italian micropost lan-   by leveraging on existing resources. We used the
guage, in particular the language of Twitter         corpus previously used in the organization of the
posts. The organizers provided an annotated          Evalita 2009 task on PoS Tagging (Attardi and
training corpus, obtained by annotating a collec-
tion of Italian tweets from the earlier Evalita      1
                                                     http://universaldependencies.org/it/pos/index.html
2014 SENTIPOLC corpus. The annotations fol-
Simi 2009), consisting in articles from the news-        The Tanl PoS tagger is derived from a rewrit-
paper “La Repubblica”, some articles from the         ing in C++ of HunPos (Halácsy, et al. 2007), an
Italian Wikipedia, and portions of the Universal      open source trigram tagger, written in OCaml.
Dependencies Italian corpus and a small collec-       The tagger estimates the probability of a se-
tion of annotated Italian tweets. Table 1 provides    quence of labels t1…tT for a sequence of words
details of the composition of the training re-        w1…wT from the probabilities of trigrams:
source.
                                                      argmax (       | )      ( |    ,      ) (    |    , )
     Resource              Number of tokens               …

     repubblica.pos                  112,593
                                                      The trigram probabilities are estimated smooth-
     extra.pos                           130
                                                      ing by linear interpolation the probabilities of
     quest.pos                         9,826
                                                      unigrams, bigrams and trigrams:
     isst_tanl.pos                    80,794
     tut.pos                          97,558              ( | ,    )=      ( )      ( | )         ( |    )
     it-twitter.pos                    1,018          where are maximum likelihood estimates and
     Evalita 2016                    121,405
                                                      1 + 2 + 3 = 1.
     Total                           423,324             An approximate Viterbi algorithm is used for
        Table 1. Composition of the training set.     finding the sequence of tags with highest proba-
                                                      bility, which exploit beam search to prune un-
The tag set was converted to the Universal De-        likely alternative paths.
pendencies schema taking into account the vari-          The tagger uses a suffix guessing algorithm
ants introduced in the task (different tokenization   for dealing with unseen words. The tagger com-
of articulated prepositions and introduction of       putes the probability distribution of tags for each
ADP_A).                                               suffix, by building a trie from the suffixes, up to
   During development, the gold dataset provid-       a maximum length (default 10), of words appear-
ed by the organizers was split into two parts: a      ing less than n (default 10) times in the training
subset of about 105,300 tokens was used for           corpus. Actually two suffix tries are built: one
training, while the remaining tokens were used as     for words beginning with uppercase, one for
validation set (~22,500 tokens).                      lowercase words. A word at the beginning of a
                                                      sentence is looked up in its lowercase variant.
3    Normalization of URLs, emoticons                    Special handling is provided for numbers and
     and emoji’s                                      HTML entities.
In order to facilitate the tagging of morphologi-        The tagger can also be given a file with a list
cal categories specifically introduced for social     of possible tags and lemmas for each word, in
media texts, we applied a pre-processing step for     order to initialize its lexicon. In our experiments
normalizing the word forms. This was done by          we used a lexicon of 130 thousands Italian
means of a set of rewriting rules based on regular    words.
expressions.
   These rules are quite straightforward for          5       Character-level Embeddings
URLs, hashtags, emails and mentions, while the        Traditional techniques of statistical machine
identification of emoticons and emoji’s required      learning usually require, to perform best, task
a set of carefully handcrafted rules because of       specific selection and tuning of hand-crafted fea-
their variety and higher degree of ambiguity.         tures as well as resources like lexicons or gazet-
                                                      teers, which are costly to develop.
4    The traditional approach: the TANL                 Recently, end-to-end approaches based on
     tagger                                           Deep Learning architectures have proved to be
Linear statistical models, such as Hidden Markov      equally effective, without the use of handcrafted
Models (HMM) or Conditional Random Fields             features or any data pre-processing, exploiting
(CRF) are often used for sequence labeling (PoS       word embeddings as only features.
tagging and NER).                                       In order to deal with sequences, Collobert et al.
   In our first experiment, we used the Tanl Pos      (2011) proposed a Convolutional Neural Net-
Tagger, based on a second order HMM.                  works (CNN), trained to maximize the overall
                                                      sentence level log-likehood of tag sequences,
                                                      which was able to achieve state of the art accura-
cy on English PoS tagging. More recently, Re-            Team5_UNOFFICIAL                0.9184           4371
cursive Neural Networks (RNN) have been pro-             UNIPI                           0.9157           4358
posed.                                                   UNIPI_UNOFFICIAL                0.9153           4356
   The word embeddings exploited as features in                 Table 2. PoSTWITA top official results.
these systems proved suitable to represent words
in well formed texts like the news articles used in     After submission we performed another experi-
the CoNNL PoS tagging benchmarks.                       ment with the BiLSTM tagger, increasing the
   We conjectured that dealing with the noisy           dimension of word embeddings from 100 to 200
and malformed texts in microposts might require         and obtained an accuracy of 92.50%
features at a finer level than words, i.e. to use       (4402/4759).
character-level embeddings. Hence we devised               To further test the ability of the character-level
an experiment to explore the effectiveness of           embeddings to deal completely autonomously
combining both character-level and word-level           with the original writings of tweets, we per-
embeddings in PoS tagging of tweets.                    formed a further experiment where we supply the
  We based our experiments on the work by Ma            original text of tweets without normalization.
and Hovy (2016), who propose an approach to             This experiment achieved an accuracy of 91.87%
sequence labeling using a bi-directional long-          (4372/4759), proving that indeed the RNN char-
short term memory (BiLSTM) neural network, a            acter-level approach is capable of learning by
variant of RNN. On top of the BiLSTM, a se-             itself even unusual tokens, recognizing quite well
quential CRF layer can be used to jointly decode        also emoticons and emoji’s, without any need of
labels for the whole sentence.                          preconceived linguistic knowledge, encoded in
   The implementation of the BiLSTM network             an ad-hoc rule system.
is done in Lasagne 2 , a lightweight library for
building and training neural networks in                7     Discussion
Theano3.
   For training the BiLSTM tagger we used word          While the results with the two approaches, used
embeddings for tweets created using the fastText        in the official and unofficial run, are strikingly
utility4 (Bojanowski et al., 2016) on a collection of   close (a difference of only two errors), the two
141 million Italian tweets retrieved over the period    taggers differ significantly on the type of errors
from May to September 2016 using the Twitter            they make.
API. Selection of Italian tweets was achieved by        7.1    Error analysis
using a query containing a list of the 200 most
common Italian words.                                   Table 3 reports a breakdown of the errors over
   The embeddings were created with dimension           PoS categories, for both systems, in order to ap-
100, using a window of 5 and retaining words            preciate the difference in behaviour. Note that a
with a minimum count of 100, for a total of 245         single PoS mismatch is counted twice, once for
thousands words.                                        each PoS involved. Three cases of misspelled
                                                        PoS in the gold test were corrected before this
6     Results                                           analysis.

The following table reports the top 9 official                               BiLSTM                HMM
scores obtained by participant systems.                  URL                             5                   2
                                                         EMO                            36                   6
                                                         DET                            32                  37
        Submission             Accuracy      Correct     AUX                            27                  19
    Team1                       0.9319        4435       CONJ                            5                   2
    Team2                       0.9285        4419       NOUN                          132                 155
    Team3_UNOFFICIAL            0.9279        4416       PUNCT                           8                   5
    Team4                       0.9270        4412       MENTION                         1                   0
    Team3                       0.9245        4400       NUM                            16                  14
    Team5                       0.9224        4390       ADP_A                           8                   7
                                                         ADV                            44                  51
2
  https://github.com/Lasagne                             VERB_CLIT                       4                   3
3
  https://github.com/Theano/Theano                       ADP                            26                  27
4
  https://github.com/facebookresearch/fastText.git
                                                         SCONJ                          15                  26
 PROPN                          136                  150   8    Conclusions
 INTJ                            44                   34
 VERB                           110                   83   We explored using both a traditional HMM tri-
 X                               34                   31   gram PoS tagger and a Deep Learning PoS Tag-
                                                           ger that uses both character and word-level em-
 ADJ                             67                   86
                                                           beddings, in the analysis of Italian tweets.
 SYM                              3                    5
                                                              The latter tagger uses embeddings as only fea-
 PRON                            42                   56   tures and no lexicon nor other linguistic resource.
 HASHTAG                          1                    1   The tagger performs surprisingly well, with an
 TOTAL                          796                  800   unofficial run that ranks among the top 5. This
      Table 3. Breakdown of errors over PoS types.         confirms our conjecture that character-level em-
                                                           beddings are able of coping with the idiosyncra-
As previously mentioned, social media specific             sies and irregular writings in microposts.
tags are not the most difficult problem. To be
fair, we noticed that the official BiLSTM run is           Acknowledgments
plagued by a suspicious high number of errors in
identifying EMO’s. However, by checking the                We gratefully acknowledge the support by the
steps in the experiment, we discovered that this           University of Pisa through project PRA and by
poor performance was due to a mistake in the               NVIDIA Corporation through a donation of a
normalization step.                                        Tesla K40 GPU used in the experiments.
    Confusion between NOUN and PROPN repre-
sents the largest source of errors. In the official        References
run there are 66 errors (35 PROPN tagged as                Giuseppe Attardi and Maria Simi. 2009. Overview of
NOUN, 33 NOUN tagged as PROPN), corre-                       the EVALITA. Part-of-Speech Tagging Task. Pro-
sponding to nearly 17% of all the errors. The tra-           ceedings of Workshop Evalita 2009, Reggio Emi-
ditional unofficial run does even worse: 19% of              lia.
the errors are due to this confusion.                      Piotr Bojanowski, Edouard Grave, Armand Joulin,
    Both taggers are weak in dealing with im-                 and Tomas Mikolov. 2016. Enriching Word Vec-
proper use of case (lower case proper names and               tors with Subword Information.
all caps texts), which is very common in Twitter              https://arxiv.org/abs/1607.04606
posts. This could be because the training set is
                                                           Ronan Collobert, JasonWeston, Léon Bottou, Michael
still dominated by more regular texts where the
                                                             Karlen, Koray Kavukcuoglu, Pavel Kuksa. 2011.
case is a strong indication of proper names. In              Natural Language Processing (Almost) from
addition, the annotation style chosen for long               Scratch. Journal of Machine Learning Research 12.
titles, not fully compliant with UD, makes the               2461-2505.
task even more difficult. For example the event
“Settimana della moda femminile/Women fash-                Péter Halácsy, András Kornai and Csaba Oravecz.
                                                             2007. HunPos – an open source trigram tagger.
ion week” or “Giornata mondiale vittime                      Proceedings of the Demo and Poster Sessions of
dell’amianto/World Day of the victims of the as-             the 54th Annual Meeting of the ACL, pp. 209-212.
bestos” are annotated as a sequence of PROPN in
the gold test set as opposed to using the normal           Xuezhe Ma and Eduard Hovy. 2016. End-to-end Se-
grammatical conventions, as specified in the UD              quence Labeling via Bi-directional LSTM-CNNs-
                                                             CRF. Proceedings of the 54th Annual Meeting of
guidelines.
                                                             the Association for Computational Linguistics
    The traditional system is slightly more accu-            (ACL 2016), pp. 1064-1074, Berlin, Germany.
rate in predicting the distinction between VERB              August 2016.
(main verbs) and AUX (auxiliary and modal
verbs): 19 errors against 26.