NLP–NITMZ:Part–of–Speech Tagging on Italian Social Media Text using
                    Hidden Markov Model
                   Partha Pakray                           Goutam Majumder
        Deptt. of Computer Science & Engg.         Deptt. of Computer Science & Engg.
         National Institute of Technology           National Institute of Technology
               Mizoram, Aizawl, India                     Mizoram, Aizawl, India
         parthapakray@gmail.com                      goutam.nita@gmail.com


                    Abstract                       the 5th evaluation campaign, where following six
                                                   tasks are organized such as:
    English. This paper describes our ap-
    proach on Part-of-Speech tagging for Ital-       • ArtiPhon – Articulatory Phone Recognition
    ian Social Media Texts (PoSTWITA),
    which is one of the task of EVALITA 2016         • FactA – Event Factuality Annotation
    campaign. EVALITA is a evaluation cam-           • NEEL–IT – Named Entity Recognition and
    paign, where teams are participated and            Linking in Italian Tweets
    submit their systems towards the develop-
    ing of tools related to Natural Language         • PoSTWITA – POS tagging for Italian Social
    Processing (NLP) and Speech for Italian            Media Texts
    language. Our team NLP–NITMZ par-
                                                     • QA4FAQ – Question Answering for Fre-
    ticipated in the PoS tagging challenge for
                                                       quently Asked Questions
    Italian Social Media Texts. In this task,
    total 9 team was participated and out of         • SENTIPOLC – SENTIment POLarity Clas-
    4759 tags Team1 successfully identified            sification
    4435 tags and get the 1st rank. Our team
    get the 8th rank officially and we success-       In addition, a new challenge to this event is
    fully identified 4091 tags as a accuracy of    also organized by IBM Italy as IBM Watson Ser-
    85.96%.                                        vices Challenge. Among these challenges our
                                                   team NLP–NITMZ is participated in 4th task i.e.
    Italiano. In questo articolo descriviamo       POS tagging for Italian Social Media Texts (PoST-
    la nostra partecipazione al task di tag-       WITA).
    ging for Italian Social Media Texts (PoST-        The main concern about PosTWITA is, Part-
    WITA), che uno dei task della campagna         of-Speech (PoS) tagging for automatic evaluation
    Evalita 2016. A questo task hanno parte-       of social media texts, in particular for micro–
    cipato 9 team; su 4759 tag il team vinci-      blogging texts such as tweets, which have many
    tore ha identificato correttamante 4435        application such as identifying trends and upcom-
    PoS tag. Il nostro team si classificato        ing events in various fields. For these applications
    all’ottavo posto con 4091 PoS tag annotati     NLP based methods need to be adapted for obtain-
    correttamente ed una percentuale di accu-      ing a reliable processing of text. In literature var-
    ratezza di 85.96                               ious attempts were already taken for developing
                                                   of such specialised tools (Derczynski et al., 2013),
                                                   (Neunerdt et al., 2013), (Pakray et al., 2015), (Ma-
1   Introduction
                                                   jumder et al., 2016) for other languages, but for
EVALITA is a evaluation campaign, where re-        Italian is lack of such resources both regarding an-
searchers are contributes tools for Natural Lan-   notated corpora and specific PoS–tagging tools.
guage Processing (NLP) and Speech for Italian      For these reasons, EVALITA 2016 proposes the
language. The main objective is to promote the     domain adaptation of PoS–taggers to Twitter texts.
development of language and speech technologies       For this task, we used a supervised leaning for
by shared framework, where different systems and   PoS tagging and the details of system implemen-
approaches can be evaluated. EVALITA 2016, is      tation is given in section 2. We discuss the per-
formance of the system in section 3. Finally, we               over (x, y) pairs. In this case, we further break
conclude our task in section 4.                                down the probability p(x, y) as follows:

2     Proposed Method                                                          p(x, y) = p(y)p(x|y)                   (2)

For this task, we used supervised learning ap-                 and then we estimate the model p(y) and p(x|y)
proach to build the model. First we implement                  separately. We consider p(y) as a prior probabil-
the conditional model for PoS tagging and then to              ity distribution over label y and p(x|y) is the prob-
simplify the model we used Bayesian classifica-                ability of generating the input x, given that the un-
tion based generative model. Further this genera-              derlying label is y.
tive model is simplified based on two key assump-                 We use the Bayes rule to derive the conditional
tions to implement the HMM model using bigram.                 probability p(y|x) for any (x, y) pair:

2.1    Conditional Model Approach                                                          p(y)p(x|y)
                                                                              p(y|x) =                                (3)
                                                                                              p(x)
In machine learning supervised problems are de-
fined as a set of input called training examples               where
(x(1) , y (1) ) · · · (x(m) , y (m) ), where each input x(i)                  X                    X
paired with a output label y (i) . In this task, our                 p(x) =         p(x, y) =            p(y)p(x|y)   (4)
goal is to learn a function f : X → Y , where X                               y∈Y                  y∈Y

and Y refers to the set of possible input and labels.             We apply Bayes rule directly to a new test ex-
   For PoS tagging problem, each input represents              ample x, so the output of the model f (x), can be
                               (i)         (i)
a sequence of words x1 , · · · , xni and labels be a           estimated as follows:
                         (i)           (i)
sequence of tags y1 , · · · , yni , where ni refers to
the length of ith training example. In this machine                      f (x) = arg max p(y)p(x|y)                   (5)
                                                                                            y
learning each input x be a sentence of Italian lan-
guage and each label be the possible PoS tag. We                  To simplify Eq.5, we use Hidden Markov
use conditional model to define the function f (x)             Model (HMM) taggers with two simplifying as-
and we define the conditional probability as                   sumptions. The first assumption is that the proba-
                                                               bility of word appearing depends only on its own
                         p (y|x)                               PoS tag as follows:
                                                                                            n
                                                                                            Y
for any x, y pair. We use training examples to es-                          p(w1n tn1 ) ≈          p(wi ti )          (6)
timate the parameters of the model and output of                                            i=1
the model for a given test example x is measured
                                                               where p(w1n tn1 ) means probability of tag ti with
as
                                                               word wi . The second assumption is that the prob-
                                                               ability of a tag appearing is dependent only on the
              f (x) = arg max p(y|x)                    (1)
                              y∈Y                              previous tag, rather than entire tag sequence. This
                                                               is known as bigram assumption and can be mea-
Thus we consider the most likely label y as the
                                                               sured as follows:
output of the trained model. If the model p(y|x) is
                                                                                          n
close to the true conditional distribution of a labels                                    Y
given inputs, so the function f (x) will consider as                          p(tn1 ) ≈         p(ti , ti−1 )         (7)
                                                                                          i=1
an optimal.
                                                                  Further, we incorporate these two assumptions
2.2    Generative Model                                        in Eq.5 by which a bigram tagger estimates the
In this model, we use the Bayes’ rule to transform             most probable tag as follows:
the Eq.1 into a set of other probabilities called gen-
erative model. Without estimating the conditional
probability p(y|x), in generative model we use the               tˆn1 = arg max
                                                                             n
                                                                                p(tn1 w1n ) ≈
                                                                              t1
Bayesian classification                                                                   n
                                                                                          Y
                                                                              arg max
                                                                                   n
                                                                                                 p(wi ti )p(ti ti−1 ) (8)
                         p(x, y)                                                     t1
                                                                                          i=1
3     Experiment Results                              1st ranked team successfully tags 4435 words and
                                                      the last positioned team i.e. Team9 successfully
3.1    Dataset
                                                      identified 3617 tags. In Table 2, we provide our
For the proposed task organizers re-uses the tweets   system tag wise statistics.
being part of the EVALITA2014 SENTIPLOC
corpus. Both the development and test set first           Sl. No.       Tag         Successful Tags
annotated manually for a global amount of 4, 041             1        PRON               292
and 1, 749 tweets and distributed as the new de-             2         AUX                82
velopment set. Then a new manually annotated                 3        PROPN              283
test set, which is composed of 600 and 700 tweets            4         EMO                30
were produced using texts from the same period of            5         SYM                 8
time. All the annotations are carried out by three           6         NUM                63
different annotators. Further a tokenised version            7         ADJ               145
of the texts is also distributed in order to avoid           8        SCONJ               37
tokenisation problems among participants and the             9         ADP               332
boring problem of disappeared tweets.                       10         URL               117
3.2    Results                                              11         DET               288
                                                            12      HASHTAG              114
For this task, total 13 runs were submitted 9 teams
                                                            13         ADV               281
and among these runs 4 Unofficial runs also sub-
                                                            14      VERB CLIT             10
mitted. In Table 1 we list out all results for this
                                                            15       PUNCT               582
task.
                                                            16        VERB               443
    Rank   Team     Successful Tags    Accuracy             17         CONJ              122
     1     Team1         4435           93.19               18           X                 3
     2     Team2         4419           92.86               19         INTJ               50
     3     Team3         4416           92.79               20      MENTION              186
     4     Team4         4412           92.70               21        ADP A              144
     5     Team3         4400           92.46               22        NOUN               479
     6     Team5         4390           92.25
     7     Team5         4371           91.85         Table 2: Tag wise Statistics of NLP–NITMZ Team
     8     Team6         4358           91.57
     9     Team6         4356           91.53
     10    Team7         4183           87.89         4   Conclusion
     11    Team8         4091           85.96
                                                      This PoS tagging task of EVALITA 2016 cam-
     12    Team2         3892           81.78
                                                      paign is for Italian language and our system ranked
     13    Team9         3617           76.00         11th position for the task of POS tagging for Ital-
                                                      ian Social Media Texts. We also want to men-
Table 1: Tagging Accuracy of Participated Teams       tioned that, authors are not native speaker of the
                                                      Italian language. We build a supervised learning
   Team 2, 3, 5 and 6 submitted one Un-Official       model based on the available knowledge on train-
run with compulsory one and these Un-Official         ing dataset.
submissions are ranked as 12th , 3rd , 7th and 9th
respectively. We also listed these submissions in     Acknowledgements
Table 1 with other runs. Our team NLP–NITMZ
represent as Team8 and ranked as 11th in this task.   This work presented here under the research
                                                      project Grant No. YSS/2015/000988 and sup-
3.3    Comparison with other submissions              ported by the Department of Science & Technol-
In this competition, a total of 4759 words were       ogy (DST) and Science and Engineering Research
given for tagging purpose. These words were cat-      Board (SERB), Govt. of India. Authors are also
egories into 22 PoS tags and our team successfully    acknowledges the Department of Computer Sci-
tags 4091 words with 668 unsuccessful tags. The       ence & Engineering of National Institute of Tech-
nology Mizoram, India for proving infrastructural
facilities.


References
Derczynski, Leon, Alan Ritter, Sam Clark, and Kalina
  Bontcheva. 2013. Twitter Part-of-Speech Tagging
  for All: Overcoming Sparse and Noisy Data. In
  RANLP, pages 198–206.
Neunerdt Melanie, Bianka Trevisan, Michael Reyer,
  and Rudolf Mathar. 2013. Part-of-speech tagging
  for social media texts. In Language Processing and
  Knowledge in the Web, pages 139–150, Springer
  Berlin Heidelberg.
Partha Pakray, Arunagshu Pal, Goutam Majumder, and
  Alexander Gelbukh. 2015. Resource Building and
  Parts-of-Speech (POS) Tagging for the Mizo Lan-
  guage. In Fourteenth Mexican International Confer-
  ence on Artificial Intelligence (MICAI), pages 3–7.
  IEEE, October.
Goutam Majumder, Partha Pakray and Alexander Gel-
  bukh. 2016. Literature Survey: Multiword Expres-
  sions (MWE) for Mizo Language. In 17th Interna-
  tional Conference on Intelligent Text Processing and
  Computational Linguistics (CICLing), to be pub-
  lished as an issue of Lecture Notes in Computer Sci-
  ence, Springer. Konya, Turkey. April.