NLP–NITMZ:Part–of–Speech Tagging on Italian Social Media Text using Hidden Markov Model Partha Pakray Goutam Majumder Deptt. of Computer Science & Engg. Deptt. of Computer Science & Engg. National Institute of Technology National Institute of Technology Mizoram, Aizawl, India Mizoram, Aizawl, India parthapakray@gmail.com goutam.nita@gmail.com Abstract the 5th evaluation campaign, where following six tasks are organized such as: English. This paper describes our ap- proach on Part-of-Speech tagging for Ital- • ArtiPhon – Articulatory Phone Recognition ian Social Media Texts (PoSTWITA), which is one of the task of EVALITA 2016 • FactA – Event Factuality Annotation campaign. EVALITA is a evaluation cam- • NEEL–IT – Named Entity Recognition and paign, where teams are participated and Linking in Italian Tweets submit their systems towards the develop- ing of tools related to Natural Language • PoSTWITA – POS tagging for Italian Social Processing (NLP) and Speech for Italian Media Texts language. Our team NLP–NITMZ par- • QA4FAQ – Question Answering for Fre- ticipated in the PoS tagging challenge for quently Asked Questions Italian Social Media Texts. In this task, total 9 team was participated and out of • SENTIPOLC – SENTIment POLarity Clas- 4759 tags Team1 successfully identified sification 4435 tags and get the 1st rank. Our team get the 8th rank officially and we success- In addition, a new challenge to this event is fully identified 4091 tags as a accuracy of also organized by IBM Italy as IBM Watson Ser- 85.96%. vices Challenge. Among these challenges our team NLP–NITMZ is participated in 4th task i.e. Italiano. In questo articolo descriviamo POS tagging for Italian Social Media Texts (PoST- la nostra partecipazione al task di tag- WITA). ging for Italian Social Media Texts (PoST- The main concern about PosTWITA is, Part- WITA), che uno dei task della campagna of-Speech (PoS) tagging for automatic evaluation Evalita 2016. A questo task hanno parte- of social media texts, in particular for micro– cipato 9 team; su 4759 tag il team vinci- blogging texts such as tweets, which have many tore ha identificato correttamante 4435 application such as identifying trends and upcom- PoS tag. Il nostro team si classificato ing events in various fields. For these applications all’ottavo posto con 4091 PoS tag annotati NLP based methods need to be adapted for obtain- correttamente ed una percentuale di accu- ing a reliable processing of text. In literature var- ratezza di 85.96 ious attempts were already taken for developing of such specialised tools (Derczynski et al., 2013), (Neunerdt et al., 2013), (Pakray et al., 2015), (Ma- 1 Introduction jumder et al., 2016) for other languages, but for EVALITA is a evaluation campaign, where re- Italian is lack of such resources both regarding an- searchers are contributes tools for Natural Lan- notated corpora and specific PoS–tagging tools. guage Processing (NLP) and Speech for Italian For these reasons, EVALITA 2016 proposes the language. The main objective is to promote the domain adaptation of PoS–taggers to Twitter texts. development of language and speech technologies For this task, we used a supervised leaning for by shared framework, where different systems and PoS tagging and the details of system implemen- approaches can be evaluated. EVALITA 2016, is tation is given in section 2. We discuss the per- formance of the system in section 3. Finally, we over (x, y) pairs. In this case, we further break conclude our task in section 4. down the probability p(x, y) as follows: 2 Proposed Method p(x, y) = p(y)p(x|y) (2) For this task, we used supervised learning ap- and then we estimate the model p(y) and p(x|y) proach to build the model. First we implement separately. We consider p(y) as a prior probabil- the conditional model for PoS tagging and then to ity distribution over label y and p(x|y) is the prob- simplify the model we used Bayesian classifica- ability of generating the input x, given that the un- tion based generative model. Further this genera- derlying label is y. tive model is simplified based on two key assump- We use the Bayes rule to derive the conditional tions to implement the HMM model using bigram. probability p(y|x) for any (x, y) pair: 2.1 Conditional Model Approach p(y)p(x|y) p(y|x) = (3) p(x) In machine learning supervised problems are de- fined as a set of input called training examples where (x(1) , y (1) ) · · · (x(m) , y (m) ), where each input x(i) X X paired with a output label y (i) . In this task, our p(x) = p(x, y) = p(y)p(x|y) (4) goal is to learn a function f : X → Y , where X y∈Y y∈Y and Y refers to the set of possible input and labels. We apply Bayes rule directly to a new test ex- For PoS tagging problem, each input represents ample x, so the output of the model f (x), can be (i) (i) a sequence of words x1 , · · · , xni and labels be a estimated as follows: (i) (i) sequence of tags y1 , · · · , yni , where ni refers to the length of ith training example. In this machine f (x) = arg max p(y)p(x|y) (5) y learning each input x be a sentence of Italian lan- guage and each label be the possible PoS tag. We To simplify Eq.5, we use Hidden Markov use conditional model to define the function f (x) Model (HMM) taggers with two simplifying as- and we define the conditional probability as sumptions. The first assumption is that the proba- bility of word appearing depends only on its own p (y|x) PoS tag as follows: n Y for any x, y pair. We use training examples to es- p(w1n tn1 ) ≈ p(wi ti ) (6) timate the parameters of the model and output of i=1 the model for a given test example x is measured where p(w1n tn1 ) means probability of tag ti with as word wi . The second assumption is that the prob- ability of a tag appearing is dependent only on the f (x) = arg max p(y|x) (1) y∈Y previous tag, rather than entire tag sequence. This is known as bigram assumption and can be mea- Thus we consider the most likely label y as the sured as follows: output of the trained model. If the model p(y|x) is n close to the true conditional distribution of a labels Y given inputs, so the function f (x) will consider as p(tn1 ) ≈ p(ti , ti−1 ) (7) i=1 an optimal. Further, we incorporate these two assumptions 2.2 Generative Model in Eq.5 by which a bigram tagger estimates the In this model, we use the Bayes’ rule to transform most probable tag as follows: the Eq.1 into a set of other probabilities called gen- erative model. Without estimating the conditional probability p(y|x), in generative model we use the tˆn1 = arg max n p(tn1 w1n ) ≈ t1 Bayesian classification n Y arg max n p(wi ti )p(ti ti−1 ) (8) p(x, y) t1 i=1 3 Experiment Results 1st ranked team successfully tags 4435 words and the last positioned team i.e. Team9 successfully 3.1 Dataset identified 3617 tags. In Table 2, we provide our For the proposed task organizers re-uses the tweets system tag wise statistics. being part of the EVALITA2014 SENTIPLOC corpus. Both the development and test set first Sl. No. Tag Successful Tags annotated manually for a global amount of 4, 041 1 PRON 292 and 1, 749 tweets and distributed as the new de- 2 AUX 82 velopment set. Then a new manually annotated 3 PROPN 283 test set, which is composed of 600 and 700 tweets 4 EMO 30 were produced using texts from the same period of 5 SYM 8 time. All the annotations are carried out by three 6 NUM 63 different annotators. Further a tokenised version 7 ADJ 145 of the texts is also distributed in order to avoid 8 SCONJ 37 tokenisation problems among participants and the 9 ADP 332 boring problem of disappeared tweets. 10 URL 117 3.2 Results 11 DET 288 12 HASHTAG 114 For this task, total 13 runs were submitted 9 teams 13 ADV 281 and among these runs 4 Unofficial runs also sub- 14 VERB CLIT 10 mitted. In Table 1 we list out all results for this 15 PUNCT 582 task. 16 VERB 443 Rank Team Successful Tags Accuracy 17 CONJ 122 1 Team1 4435 93.19 18 X 3 2 Team2 4419 92.86 19 INTJ 50 3 Team3 4416 92.79 20 MENTION 186 4 Team4 4412 92.70 21 ADP A 144 5 Team3 4400 92.46 22 NOUN 479 6 Team5 4390 92.25 7 Team5 4371 91.85 Table 2: Tag wise Statistics of NLP–NITMZ Team 8 Team6 4358 91.57 9 Team6 4356 91.53 10 Team7 4183 87.89 4 Conclusion 11 Team8 4091 85.96 This PoS tagging task of EVALITA 2016 cam- 12 Team2 3892 81.78 paign is for Italian language and our system ranked 13 Team9 3617 76.00 11th position for the task of POS tagging for Ital- ian Social Media Texts. We also want to men- Table 1: Tagging Accuracy of Participated Teams tioned that, authors are not native speaker of the Italian language. We build a supervised learning Team 2, 3, 5 and 6 submitted one Un-Official model based on the available knowledge on train- run with compulsory one and these Un-Official ing dataset. submissions are ranked as 12th , 3rd , 7th and 9th respectively. We also listed these submissions in Acknowledgements Table 1 with other runs. Our team NLP–NITMZ represent as Team8 and ranked as 11th in this task. This work presented here under the research project Grant No. YSS/2015/000988 and sup- 3.3 Comparison with other submissions ported by the Department of Science & Technol- In this competition, a total of 4759 words were ogy (DST) and Science and Engineering Research given for tagging purpose. These words were cat- Board (SERB), Govt. of India. Authors are also egories into 22 PoS tags and our team successfully acknowledges the Department of Computer Sci- tags 4091 words with 668 unsuccessful tags. The ence & Engineering of National Institute of Tech- nology Mizoram, India for proving infrastructural facilities. References Derczynski, Leon, Alan Ritter, Sam Clark, and Kalina Bontcheva. 2013. Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data. In RANLP, pages 198–206. Neunerdt Melanie, Bianka Trevisan, Michael Reyer, and Rudolf Mathar. 2013. Part-of-speech tagging for social media texts. In Language Processing and Knowledge in the Web, pages 139–150, Springer Berlin Heidelberg. Partha Pakray, Arunagshu Pal, Goutam Majumder, and Alexander Gelbukh. 2015. Resource Building and Parts-of-Speech (POS) Tagging for the Mizo Lan- guage. In Fourteenth Mexican International Confer- ence on Artificial Intelligence (MICAI), pages 3–7. IEEE, October. Goutam Majumder, Partha Pakray and Alexander Gel- bukh. 2016. Literature Survey: Multiword Expres- sions (MWE) for Mizo Language. In 17th Interna- tional Conference on Intelligent Text Processing and Computational Linguistics (CICLing), to be pub- lished as an issue of Lecture Notes in Computer Sci- ence, Springer. Konya, Turkey. April.