PoS Taggers in the Wild: A Case Study with Swiss Italian Student Essays Daniele Puccinelli, Silvia Demartini, Aris Piatti, Sara Giulivi, Luca Cignetti, Simone Fornara University of Applied Sciences and Arts of Southern Switzerland (SUPSI) {daniele.puccinelli, silvia.demartini, aris.piatti, {sara.giulivi, luca.cignetti, simone.fornara}@supsi.ch Abstract the-art Part of Speech (PoS) taggers on the DFA- English. State-of-the-art Part-of-Speech TIscrivo corpus of Italian-language (L1) K-12 stu- taggers have been thoroughly evaluated dent essays from schools in the Italian-speaking on standard Italian. To understand how part of Switzerland. The DFA-TIscrivo corpus Part-of-Speech taggers that have been pre- represents an example of non-standard Italian1 be- trained on standard Italian fare with a wide cause its contributors are young students with a array of language anomalies, we evalu- poor command of the Italian language living in the ate five Part-of-Speech taggers on a cor- largest Italian-speaking area outside of Italy, and pus of student essays written throughout therefore prone to regionalisms as well as ortho- the largest Italian-speaking area outside of graphic mistakes. The key research question at this stage is how Italy. Our preliminary results show that well state-of-the-art PoS taggers that were pre- there is a significant gap between their per- trained on standard Italian cope with a specific fla- formance on non-standard Italian and on vor of non-standard Italian. It would of course standard Italian, and that the performance be possible to retrain all these tools on texts with loss mainly comes from relatively subtle similar properties as the ones in our corpus, but tagging errors within morphological cate- at this stage in our work this is not possible due gories as opposed to coarse errors across to the overly small size of the available annotated categories. data. In turn, using pre-trained models gives us Italiano. Gli strumenti di Part-of-Speech a twofold advantage: it allows us to obtain a per- tagging più rappresentativi dello stato formance baseline on non-standard Italian, and it dell’arte sono stati analizzati a fondo con makes it possible to directly compare our perfor- l’italiano standard. Per capire come stru- mance metrics to previously published results (ob- menti pre-addestrati sull’italiano standard tained with the same models we use). While our si comportano in presenza di un’ampia work is still in progress and the results reported gamma di anomalie linguistiche, analizzi- herein are preliminary in nature, we can already amo le prestazioni di cinque strumenti share several notable observations. su di un corpus di elaborati redatti da studenti della scuola dell’obbligo nella 2 Related Work and PoS taggers under Svizzera Italiana. I nostri risultati prelim- test inari mostrano che esiste un notevole di- There have been various recent efforts focused on vario tra le prestazioni sull’italiano non- social media within the scope of EVALITA 2016 standard e quelle sull’italiano standard, e (Bosco et al., 2016), whose goal was the domain che la perdita di prestazioni deriva prin- adaptation of PoS-taggers to Twitter texts. Notable cipalmente da errori di tagging relativa- contributions include (Cimino and Dell’Orletta, mente sottili all’interno delle categorie 2016), whose authors propose a PoS tagging ar- grammaticali. chitecture optimized to process Italian-language tweets. While we do acknowledge the need for 1 Introduction 1 http://www.treccani.it/enciclopedia/ The goal of this paper is to present the prelimi- italiano-standard_(Enciclopedia-dell% nary results of the evaluation of a set of state-of- 27Italiano)/ domain adaptation with non-standard texts, we al., 2003), which leverages maximum entropy PoS ask a more basic question: if we perform no tagging (Toutanova and Manning, 2000). Given domain adaptation and simply deploy general- a word and its context (other words in the sen- purpose PoS taggers in the wild, how do they tence and their tags), maximum entropy PoS tag- fare? We use K-12 student essays as our flavor ging assigns a probability to every tag in a prede- of non-standard Italian. Although such texts are fined tagset, eventually enabling the estimation of beset with all sorts of anomalies, they can still the probability of a tag sequence given a word se- be processed them with general purpose taggers, quence. Out of all the possible distributions that unlike far more unstructured and unconventional satisfy a set of constraints, the one with maximum texts such as tweets. While similar studies have entropy is chosen, as it represents the most non- been conducted for other languages, such as Ger- committal assignment of probabilities that meets man (Giesbrecht and Evert, 2009), to the best of the constraints (Ratnaparkhi, 1996). our knowledge this is the first study of the accu- Syntaxnet (2016). Various recent efforts focus racy of general-purpose PoS taggers in the wild on the application of recurrent neural networks for the Italian language. Our selection of state- to PoS tagging and dependency parsing (Ling of-the-art general purpose PoS taggers is based on et al., 2015), but it is shown in (Andor et al., their popularity with the research community and 2016) that recurrence-free feed-forward networks the availability of ready-to-use software versions. can work at least as well as recurrent ones if TreeTagger (1994). The popular TreeTagger they are globally normalized; this is the guiding (Schmid, 1994) tool uses decision trees to esti- principle behind PoS tagging in Syntaxnet (syn, mate transition probabilities based on context. De- 2016), a neural network NLP framework that is cision trees were extremely popular for PoS tag- built on top of Google’s popular TensorFlow ma- ging in the 1990s, when more sophisticated ma- chine learning framework (Abadi et al., 2016). chine learning tools such as neural networks were Syntaxnet employs beam search, which serves to still too computationally demanding given the rel- maintain multiple hypotheses, and global normal- atively limited resources available at the time. ization with a conditional random field (CRF) ob- TreeTagger actively addresses the issues encoun- jective, which avoids label bias issues (typically tered by earlier probabilistic PoS taggers with rare reported in locally normalized models). PoS tag- words with a very low (but non-zero) probabil- ging in Syntaxnet is heavily inspired by (Bohnet ity of occurrence. The use of decision trees en- and Nivre, 2012) and relies on the close integra- ables TreeTagger to account for context, whose tion of PoS tagging and dependency parsing. A nature is not restricted to n−grams, but also to al- pre-trained English language model whimsically lowed/disallowed tag sequences. called Parsey McParseface was released along with Syntaxnet in May 2016 and a pre-trained UD-Pipe (2014). UD-Pipe (Straka et al., 2016) model for the Italian language was released in Au- is a language-agnostic natural language processing gust 2016 as one of Parsey’s Cousins. (NLP) pipeline developed within Universal De- pendencies, whose focus is the development of a DRAGNN (2017). In March 2017 Google re- treebank annotation scheme that can work con- leased a Syntaxnet upgrade based on Dynamic sistently across multiple languages. UD-Pipe’s Recurrent Acyclic Graphical Neural Networks PoS tagger uses the Morphological Dictionary and (DRAGNN) (Kong et al., 2017) along with the Tagger MorphoDiTa (Straková et al., 2014), devel- Parseysaurus set of pre-trained models (Alberti et oped at Charles University in Prague, Czech Re- al., 2017) that was developed for the CONLL 2017 public. MorphoDiTa uses the averaged perceptron shared task. PoS tagging in DRAGNN (Kong et PoS tagger described in (Spoustová et al., 2009) al., 2017) is based on (Zhang and Weiss, 2016), and based on (Collins, 2002). which closely integrates PoS tagging and parsing in a novel fashion (specifically, the continuous hid- Tint (2016). The Italian NLP Tool (Palmero den layer activations of the window-based tagger Aprosio and Moretti, 2016) is an NLP pipeline for network are fed as input to the transition-based the Italian language based on Stanford CoreNLP parser network). The tagger works token by to- (Manning et al., 2014). Tint’s PoS tagger is based ken, extracting features from a window of tokens on the Stanford Log-linear Tagger (Toutanova et around the target token. It has a fairly standard structure with embedding, hidden, and softmax Accuracy layers. TreeTagger 0.84 UD-Pipe 0.79 3 The DFA-TIscrivo corpus Tint 0.83 The DFA-TIscrivo corpus has been prepared Syntaxnet 0.83 within the projects TIscrivo (2011-2014) and TIs- DRAGNN 0.84 crivo 2.0 (2014-2017) projects2 , both funded by Table 1: Overall PoS tagging accuracy for each the Swiss National Science Foundation. The goal tool on the DFA-TIscrivo corpus. of the projects is to paint an accurate picture of the writing skills of primary school and lower sec- to essays written by fifth graders. We use the ISST- ondary school in Southern Switzerland in order to TANL-PoS reference tagset3 based on Universal describe the variety of language written at school Dependencies. and to propose new teaching practices to improve We begin by assessing the tagging accuracy writing skills in compulsory education (Cignetti et of the five PoS taggers under test on the DFA- al., 2016). Other studies with some similarities TIscrivo corpus. We compute the tagging ac- to the TIscrivo projects include projects focused curacy as the ratio of correctly tagged parts of on texts by L1 or L2 learners such as ISACCO speech with respect to the aforementioned man- (Brunato and Dell’Orletta, 2015), CItA (Barbagli ually tagged ground truth. While the ground truth et al., 2015)(Barbagli et al., 2016), and KoKo isolates out multiword expressions, none of the (Abel et al., 2016). tools are able to do that, so all multiword expres- The DFA-TIscrivo corpus is a balanced cor- sions are considered to be mistagged and every pus collected in 56 Italian-speaking primary and multiword expression counts as one single miss. lower secondary schools from Southern Switzer- Verbal enclitics are not considered and the corre- land. It contains 1735 narrative-reflective essays sponding verbs are expected to be tagged simply (742 from primary, 993 from secondary school), as verbs. Our results are shown in Table 1; we see transcribed but not normalized, and accompanied that UD-Pipe trails behind and falls below the 0.8 by sociolinguistic metadata (age, gender, school mark, while the other four taggers under test offer and class, linguistic information). It amounts to a similar performance, with TreeTagger slightly about 390,000 tokens. Lexical data were ini- ahead of the pack. All these taggers reportedly tially lemmatized and PoS tagged using TreeTag- perform above the 95% mark on standard Italian. ger (with the Italian parameters by Marco Baroni) Tables 2-6 contain the confusion matrices of the and are being manually revised. Furthermore, we PoS taggers under test based on the ISST-TANL are manually annotating orthographic, morpho- coarse-grained tags. Row i shows the ground truth logical and lexical main types of error, multi-word for tag i and column k shows the frequency with expressions, peculiar lexicon of Italian only used which it is tagged as k. To abstract away from in Southern Switzerland and foreign words. A key how individual taggers address prepositional ar- project goal is to build up a dictionary of the Ital- ticle, we merge the tags for prepositions (E) and ian language as it is written in Southern Switzer- articles (R) into a super-tag ER. We also merge land (Cignetti and Demartini, 2016)(Fornara et al., the tags for adjectives (A) and determiners (D) 2016) as an online resource useful both to scholars because determiners may be viewed as a cate- and to teachers. gory of adjectives in Italian. We only show the tags that occur most often (which is why some 4 Methodology and Performance rows/columns do not add up to one). We note Analysis that TreeTagger outperforms all other taggers with We run the five taggers on the corpus and compare AD while lagging behind all of them with P (pro- their output to a manually tagged ground truth. We nouns) and C (conjunctions), often tagged as P or note that, at the time of writing, the analysis is B (adverbs). TreeTaggers also performs remark- restricted to a subset of the DFA-TIscrivo corpus ably well with verbs (V). that has been manually PoS-tagged and is limited 2 3 http://dfa-blog.supsi.ch/ http://www.italianlp.it/docs/ DFA-TIscrivo˜/la-ricerca/ ISST-TANL-POStagset.pdf AD B C ER P S V AD B C ER P S V AD 0.95 0.03 0.01 0 0 0 0 AD 0.75 0.01 0 0 0.03 0.05 0.02 B 0.02 0.88 0 0 0.02 0.04 0.04 B 0.01 0.88 0 0.05 0.01 0.02 0.03 C 0.01 0.07 0.76 0.01 0.15 0 0 C 0 0.08 0.9 0.01 0 0 0.01 ER 0.08 0 0 0.92 0 0 0 ER 0.04 0 0 0.92 0 0.02 0.02 P 0.05 0 0.01 0 0.79 0.01 0 P 0.01 0.01 0.03 0.03 0.91 0.01 0 S 0.04 0 0 0 0 0.93 0.03 S 0.03 0.01 0 0 0 0.94 0.02 V 0 0 0 0 0 0.01 0.99 V 0.01 0 0 0 0 0.03 0.96 Table 2: Tree Tagger confusion matrix. Table 5: Syntaxnet confusion matrix. AD B C ER P S V AD B C ER P S V AD 0.72 0.03 0 0 0.02 0.07 0.01 AD 0.77 0.04 0 0 0 0.04 0.01 B 0.03 0.87 0 0.01 0 0.04 0.01 B 0.04 0.86 0 0.01 0 0.07 0.02 C 0 0.08 0.90 0.01 0.01 0 0 C 0 0.06 0.91 0.02 0 0.01 0 ER 0.06 0 0 0.92 0 0.01 0.01 ER 0 0 0 1 0 0 0 P 0.01 0 0.02 0.02 0.93 0.01 0 P 0.01 0.01 0.02 0.02 0.93 0.01 0 S 0 0 0 0 0 0.97 0.02 S 0.02 0.01 0 0 0 0.96 0.01 V 0.01 0 0 0 0 0.04 0.95 V 0.02 0 0 0 0.01 0.03 0.94 Table 6: DRAGNN confusion matrix. Table 3: UD-Pipe confusion matrix. 5 Conclusion We have presented a comparative performance as- sessment of five state-of-the-art PoS taggers on AD B C ER P S V the DFA-TIscrivo corpus of K-12 student essays, AD 0.75 0 0 0 0.15 0 0 along with an analysis of the patterns that can be B 0.06 0.88 0 0.01 0 0.03 0.02 observed in the mistakes made by individual tag- C 0 0.09 0.89 0.01 0 0 0 gers. As this is still a work in progress, the re- ER 0 0 0 0.99 0 0 0 sults in the paper are limited to a subset of the P 0.02 0 0 0.03 0.88 0.01 0 corpus containing fifth grade essays. These re- S 0.03 0 0 0 0 0.94 0.03 sults provide a valuable baseline that could likely V 0.01 0 0 0 0 0.01 0.92 be improved with domain adaptation. On the Table 4: Tint confusion matrix. other hand, it is fair to ask whether the DFA- TIscrivo corpus is different enough from standard We have also studied the confusion matrices Italian to warrant domain adaptation, or whether within the V category (not shown), noting that we would encounter issues with overfitting. In the TreeTagger performs remarkably better than the latter case, an alternative would be the rule-based others with respect to principal verbs (0.97 ac- combination of the output of the five taggers, in- curacy while the others are right around the 0.9 formed with the knowledge of the observed error mark). and modal verbs (0.94 versus 0.81 for patterns. UD-Pipe and TINT and a disappointing 0.75 for both Syntaxnet and DRAGNN). All taggers per- Acknowledgments form equally poorly with auxiliary verbs (accuracy just above the 0.8 mark in all cases). Aside from The partial support of SNF through project TIs- Tint, which does not provide morphological infor- crivo 2.0 and of SUPSI through project Scripsit is mation (at least in the version we used), all tag- gratefully acknowledged. gers do well with finite verbs (> 0.97, with UD- Pipe trailing behind at 0.95). While TreeTagger References and UD-Pipe perform at the same level of accu- Abadi et al., 2016 Martı́n Abadi, Ashish Agarwal, Paul racy for both finite and non-finite verbs, Syntaxnet Barham, Eugene Brevdo, Zhifeng Chen, Craig and DRAGNN barely go beyond the 0.9 mark with Citro, Gregory S. Corrado, Andy Davis, Jef- the latter. frey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irv- Italian Conference on Computational Linguistics ing, Michael Isard, Yangqing Jia, Rafal Józefowicz, (CLiC-it 2016) & Fifth Evaluation Campaign of Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Natural Language Processing and Speech Tools for Dan Mané, Rajat Monga, Sherry Moore, Derek Gor- Italian. Final Workshop (EVALITA 2016), Napoli, don Murray, Chris Olah, Mike Schuster, Jonathon Italy, December 5-7, 2016. Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal- war, Paul A. Tucker, Vincent Vanhoucke, Vijay Va- Brunato and Dell’Orletta, 2015 Dominique Brunato sudevan, Fernanda B. Viégas, Oriol Vinyals, Pete and Felice Dell’Orletta. 2015. Isacco: a corpus Warden, Martin Wattenberg, Martin Wicke, Yuan for investigating spoken and written language Yu, and Xiaoqiang Zheng. 2016. Tensorflow: development in italian school–age children. CLiC Large-scale machine learning on heterogeneous dis- it, page 62. tributed systems. CoRR, abs/1603.04467. Cignetti and Demartini, 2016 Luca Cignetti and Silvia Abel et al., 2016 Andrea Abel, Aivars Glaznieks, Li- Demartini. 2016. From data to tools. theoretical and onel Nicolas, and Egon Stemle. 2016. An extended applied problems in the compilation of lissics (the version of the koko german L1 learner corpus. In lexicon of written italian in a school context in ital- Proceedings of Third Italian Conference on Compu- ian switzerland). RiCOGNIZIONI. Rivista di Lingue tational Linguistics (CLiC-it 2016) & Fifth Evalua- e Letterature straniere e Culture moderne, 3(6):35– tion Campaign of Natural Language Processing and 49. Speech Tools for Italian. Final Workshop (EVALITA 2016), Napoli, Italy, December 5-7, 2016. Cignetti et al., 2016 Luca Cignetti, Silvia Demartini, and Simone Fornara. 2016. Come TIscrivo? La Alberti et al., 2017 Chris Alberti, Daniel Andor, Ivan scrittura a scuola tra teoria e didattica. Aracne. Bogatyy, Michael Collins, Dan Gillick, Lingpeng Kong, Terry Koo, Ji Ma, Mark Omernick, Slav Cimino and Dell’Orletta, 2016 Andrea Cimino and Fe- Petrov, Chayut Thanapirom, Zora Tung, and David lice Dell’Orletta. 2016. Building the state-of-the- Weiss. 2017. Syntaxnet models for the conll 2017 art in POS tagging of italian tweets. In Proceed- shared task. CoRR, abs/1703.04929. ings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Cam- Andor et al., 2016 Daniel Andor, Chris Alberti, David paign of Natural Language Processing and Speech Weiss, Aliaksei Severyn, Alessandro Presta, Kuz- Tools for Italian. Final Workshop (EVALITA 2016), man Ganchev, Slav Petrov, and Michael Collins. Napoli, Italy, December 5-7, 2016. 2016. Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042. Collins, 2002 Michael Collins. 2002. Discriminative Barbagli et al., 2015 Alessia Barbagli, Piero Lucisano, training methods for hidden markov models: The- Felice Dell’Orletta, Simonetta Montemagni, and ory and experiments with perceptron algorithms. In Giulia Venturi. 2015. Cita: un corpus di produzioni Proceedings of the ACL-02 Conference on Empirical scritte di apprendenti litaliano l1 annotato con errori. Methods in Natural Language Processing - Volume In BProceedings of the Second Italian Conference 10, EMNLP ’02, pages 1–8, Stroudsburg, PA, USA. on Computational Linguistics, CLiC-it, pages 31– Association for Computational Linguistics. 35. Fornara et al., 2016 Simone Fornara, Luca Cignetti, and Barbagli et al., 2016 Alessia Barbagli, Pietro Lucisano, Silvia Demartini. 2016. Il lessico di tiscrivo. Felice Dell’Orletta, Simonetta Montemagni, and caratterizzazione del vocabolario e osservazioni in Giulia Venturi. 2016. Cita: an l1 italian learners prospettiva didattica. In Sviluppo della compe- corpus to study the development of writing com- tenza lessicale: acquisizione, apprendimento, inseg- petence. In Proceedings of the Tenth International namento, pages 43–60. Aracne, Roma. Conference on Language Resources and Evaluation (LREC 2016), may. Giesbrecht and Evert, 2009 Eugenie Giesbrecht and Stefan Evert. 2009. Is part-of-speech tagging a Bohnet and Nivre, 2012 Bernd Bohnet and Joakim solved task? an evaluation of pos taggers for the Nivre. 2012. A transition-based system for joint german web as corpus. In I. Alegria, I. Leturia, and part-of-speech tagging and labeled non-projective S. Sharoff, editors, Proceedings of the 5th Web as dependency parsing. In Proceedings of the 2012 Corpus Workshop (WAC5), San Sebastian, Spain. Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Natu- Kong et al., 2017 Lingpeng Kong, Chris Alberti, Daniel ral Language Learning, EMNLP-CoNLL ’12, pages Andor, Ivan Bogatyy, and David Weiss. 2017. 1455–1465, Stroudsburg, PA, USA. Association for DRAGNN: A transition-based framework for dy- Computational Linguistics. namically connected neural networks. CoRR, abs/1703.04474. Bosco et al., 2016 Cristina Bosco, Fabio Tamburini, Andrea Bolioli, and Alessandro Mazzei. 2016. Ling et al., 2015 Wang Ling, Tiago Luı́s, Luı́s Marujo, Overview of the EVALITA 2016 part of speech on Ramón Fernandez Astudillo, Silvio Amir, Chris twitter for italian task. In Proceedings of Third Dyer, Alan W Black, and Isabel Trancoso. 2015. Finding function in form: Compositional charac- Toutanova et al., 2003 Kristina Toutanova, Dan Klein, ter models for open vocabulary word representation. Christopher D Manning, and Yoram Singer. 2003. arXiv preprint arXiv:1508.02096. Feature-rich part-of-speech tagging with a cyclic de- pendency network. In Proceedings of the 2003 Manning et al., 2014 Christopher D Manning, Mihai Conference of the North American Chapter of the Surdeanu, John Bauer, Jenny Finkel, Steven J Association for Computational Linguistics on Hu- Bethard, and David McClosky. 2014. The stanford man Language Technology-Volume 1, pages 173– corenlp natural language processing toolkit. In Pro- 180. Association for Computational Linguistics. ceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstra- Zhang and Weiss, 2016 Yuan Zhang and David Weiss. tions. 2016. Stack-propagation: Improved represen- tation learning for syntax. arXiv preprint Palmero Aprosio and Moretti, 2016 A. Palmero Apro- arXiv:1603.06598. sio and G. Moretti. 2016. Italy goes to Stanford: a collection of CoreNLP modules for Italian. ArXiv e-prints, September. Ratnaparkhi, 1996 Adwait Ratnaparkhi. 1996. A max- imum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Lan- guage Processing, pages 133–142. Schmid, 1994 Helmut Schmid. 1994. Part-of-speech tagging with neural networks. In Proceedings of the 15th conference on Computational linguistics- Volume 1, pages 172–176. Association for Compu- tational Linguistics. Spoustová et al., 2009 Drahomı́ra “johanka” Spous- tová, Jan Hajič, Jan Raab, and Miroslav Spousta. 2009. Semi-supervised training for the averaged perceptron pos tagger. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 763–771, Athens, Greece, March. Association for Computational Linguistics. Straka et al., 2016 Milan Straka, Jan Hajic, and Jana Straková. 2016. Udpipe: Trainable pipeline for pro- cessing conll-u files performing tokenization, mor- phological analysis, pos tagging and parsing. In LREC. Straková et al., 2014 Jana Straková, Milan Straka, and Jan Hajič. 2014. Open-Source Tools for Morphol- ogy, Lemmatization, POS Tagging and Named En- tity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Lin- guistics: System Demonstrations, pages 13–18, Bal- timore, Maryland, June. Association for Computa- tional Linguistics. syn, 2016 2016. Syntaxnet. https://www. tensorflow.org/versions/r0.12/ tutorials/syntaxnet. Accessed: 2017-07- 13. Toutanova and Manning, 2000 Kristina Toutanova and Christopher D. Manning. 2000. Enriching the knowledge sources used in a maximum entropy part- of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Nat- ural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13, EMNLP ’00, pages 63–70. Association for Computational Linguistics.