PoS Taggers in the Wild: A Case Study with Swiss Italian Student Essays

Daniele Puccinelli, Silvia Demartini, Aris Piatti, Sara Giulivi, Luca Cignetti, Simone Fornara
           University of Applied Sciences and Arts of Southern Switzerland (SUPSI)
          {daniele.puccinelli, silvia.demartini, aris.piatti,
        {sara.giulivi, luca.cignetti, simone.fornara}@supsi.ch


                      Abstract                          the-art Part of Speech (PoS) taggers on the DFA-
     English. State-of-the-art Part-of-Speech           TIscrivo corpus of Italian-language (L1) K-12 stu-
     taggers have been thoroughly evaluated             dent essays from schools in the Italian-speaking
     on standard Italian. To understand how             part of Switzerland. The DFA-TIscrivo corpus
     Part-of-Speech taggers that have been pre-         represents an example of non-standard Italian1 be-
     trained on standard Italian fare with a wide       cause its contributors are young students with a
     array of language anomalies, we evalu-             poor command of the Italian language living in the
     ate five Part-of-Speech taggers on a cor-          largest Italian-speaking area outside of Italy, and
     pus of student essays written throughout           therefore prone to regionalisms as well as ortho-
     the largest Italian-speaking area outside of       graphic mistakes.
                                                           The key research question at this stage is how
     Italy. Our preliminary results show that
                                                        well state-of-the-art PoS taggers that were pre-
     there is a significant gap between their per-
                                                        trained on standard Italian cope with a specific fla-
     formance on non-standard Italian and on
                                                        vor of non-standard Italian. It would of course
     standard Italian, and that the performance
                                                        be possible to retrain all these tools on texts with
     loss mainly comes from relatively subtle
                                                        similar properties as the ones in our corpus, but
     tagging errors within morphological cate-
                                                        at this stage in our work this is not possible due
     gories as opposed to coarse errors across
                                                        to the overly small size of the available annotated
     categories.
                                                        data. In turn, using pre-trained models gives us
     Italiano. Gli strumenti di Part-of-Speech          a twofold advantage: it allows us to obtain a per-
     tagging più rappresentativi dello stato           formance baseline on non-standard Italian, and it
     dell’arte sono stati analizzati a fondo con        makes it possible to directly compare our perfor-
     l’italiano standard. Per capire come stru-         mance metrics to previously published results (ob-
     menti pre-addestrati sull’italiano standard        tained with the same models we use). While our
     si comportano in presenza di un’ampia              work is still in progress and the results reported
     gamma di anomalie linguistiche, analizzi-          herein are preliminary in nature, we can already
     amo le prestazioni di cinque strumenti             share several notable observations.
     su di un corpus di elaborati redatti da
     studenti della scuola dell’obbligo nella           2       Related Work and PoS taggers under
     Svizzera Italiana. I nostri risultati prelim-              test
     inari mostrano che esiste un notevole di-
                                                        There have been various recent efforts focused on
     vario tra le prestazioni sull’italiano non-
                                                        social media within the scope of EVALITA 2016
     standard e quelle sull’italiano standard, e
                                                        (Bosco et al., 2016), whose goal was the domain
     che la perdita di prestazioni deriva prin-
                                                        adaptation of PoS-taggers to Twitter texts. Notable
     cipalmente da errori di tagging relativa-
                                                        contributions include (Cimino and Dell’Orletta,
     mente sottili all’interno delle categorie
                                                        2016), whose authors propose a PoS tagging ar-
     grammaticali.
                                                        chitecture optimized to process Italian-language
                                                        tweets. While we do acknowledge the need for
 1   Introduction
                                                            1
                                                           http://www.treccani.it/enciclopedia/
 The goal of this paper is to present the prelimi-      italiano-standard_(Enciclopedia-dell%
 nary results of the evaluation of a set of state-of-   27Italiano)/
domain adaptation with non-standard texts, we          al., 2003), which leverages maximum entropy PoS
ask a more basic question: if we perform no            tagging (Toutanova and Manning, 2000). Given
domain adaptation and simply deploy general-           a word and its context (other words in the sen-
purpose PoS taggers in the wild, how do they           tence and their tags), maximum entropy PoS tag-
fare? We use K-12 student essays as our flavor         ging assigns a probability to every tag in a prede-
of non-standard Italian. Although such texts are       fined tagset, eventually enabling the estimation of
beset with all sorts of anomalies, they can still      the probability of a tag sequence given a word se-
be processed them with general purpose taggers,        quence. Out of all the possible distributions that
unlike far more unstructured and unconventional        satisfy a set of constraints, the one with maximum
texts such as tweets. While similar studies have       entropy is chosen, as it represents the most non-
been conducted for other languages, such as Ger-       committal assignment of probabilities that meets
man (Giesbrecht and Evert, 2009), to the best of       the constraints (Ratnaparkhi, 1996).
our knowledge this is the first study of the accu-
                                                       Syntaxnet (2016). Various recent efforts focus
racy of general-purpose PoS taggers in the wild
                                                       on the application of recurrent neural networks
for the Italian language. Our selection of state-
                                                       to PoS tagging and dependency parsing (Ling
of-the-art general purpose PoS taggers is based on
                                                       et al., 2015), but it is shown in (Andor et al.,
their popularity with the research community and
                                                       2016) that recurrence-free feed-forward networks
the availability of ready-to-use software versions.
                                                       can work at least as well as recurrent ones if
TreeTagger (1994). The popular TreeTagger              they are globally normalized; this is the guiding
(Schmid, 1994) tool uses decision trees to esti-       principle behind PoS tagging in Syntaxnet (syn,
mate transition probabilities based on context. De-    2016), a neural network NLP framework that is
cision trees were extremely popular for PoS tag-       built on top of Google’s popular TensorFlow ma-
ging in the 1990s, when more sophisticated ma-         chine learning framework (Abadi et al., 2016).
chine learning tools such as neural networks were      Syntaxnet employs beam search, which serves to
still too computationally demanding given the rel-     maintain multiple hypotheses, and global normal-
atively limited resources available at the time.       ization with a conditional random field (CRF) ob-
TreeTagger actively addresses the issues encoun-       jective, which avoids label bias issues (typically
tered by earlier probabilistic PoS taggers with rare   reported in locally normalized models). PoS tag-
words with a very low (but non-zero) probabil-         ging in Syntaxnet is heavily inspired by (Bohnet
ity of occurrence. The use of decision trees en-       and Nivre, 2012) and relies on the close integra-
ables TreeTagger to account for context, whose         tion of PoS tagging and dependency parsing. A
nature is not restricted to n−grams, but also to al-   pre-trained English language model whimsically
lowed/disallowed tag sequences.                        called Parsey McParseface was released along
                                                       with Syntaxnet in May 2016 and a pre-trained
UD-Pipe (2014). UD-Pipe (Straka et al., 2016)          model for the Italian language was released in Au-
is a language-agnostic natural language processing     gust 2016 as one of Parsey’s Cousins.
(NLP) pipeline developed within Universal De-
pendencies, whose focus is the development of a        DRAGNN (2017). In March 2017 Google re-
treebank annotation scheme that can work con-          leased a Syntaxnet upgrade based on Dynamic
sistently across multiple languages. UD-Pipe’s         Recurrent Acyclic Graphical Neural Networks
PoS tagger uses the Morphological Dictionary and       (DRAGNN) (Kong et al., 2017) along with the
Tagger MorphoDiTa (Straková et al., 2014), devel-     Parseysaurus set of pre-trained models (Alberti et
oped at Charles University in Prague, Czech Re-        al., 2017) that was developed for the CONLL 2017
public. MorphoDiTa uses the averaged perceptron        shared task. PoS tagging in DRAGNN (Kong et
PoS tagger described in (Spoustová et al., 2009)      al., 2017) is based on (Zhang and Weiss, 2016),
and based on (Collins, 2002).                          which closely integrates PoS tagging and parsing
                                                       in a novel fashion (specifically, the continuous hid-
Tint (2016). The Italian NLP Tool (Palmero             den layer activations of the window-based tagger
Aprosio and Moretti, 2016) is an NLP pipeline for      network are fed as input to the transition-based
the Italian language based on Stanford CoreNLP         parser network). The tagger works token by to-
(Manning et al., 2014). Tint’s PoS tagger is based     ken, extracting features from a window of tokens
on the Stanford Log-linear Tagger (Toutanova et        around the target token. It has a fairly standard
structure with embedding, hidden, and softmax                                       Accuracy
layers.                                                             TreeTagger        0.84
                                                                    UD-Pipe           0.79
3   The DFA-TIscrivo corpus                                         Tint              0.83
The DFA-TIscrivo corpus has been prepared                           Syntaxnet         0.83
within the projects TIscrivo (2011-2014) and TIs-                   DRAGNN            0.84
crivo 2.0 (2014-2017) projects2 , both funded by
                                                        Table 1: Overall PoS tagging accuracy for each
the Swiss National Science Foundation. The goal
                                                        tool on the DFA-TIscrivo corpus.
of the projects is to paint an accurate picture of
the writing skills of primary school and lower sec-     to essays written by fifth graders. We use the ISST-
ondary school in Southern Switzerland in order to       TANL-PoS reference tagset3 based on Universal
describe the variety of language written at school      Dependencies.
and to propose new teaching practices to improve           We begin by assessing the tagging accuracy
writing skills in compulsory education (Cignetti et     of the five PoS taggers under test on the DFA-
al., 2016). Other studies with some similarities        TIscrivo corpus. We compute the tagging ac-
to the TIscrivo projects include projects focused       curacy as the ratio of correctly tagged parts of
on texts by L1 or L2 learners such as ISACCO            speech with respect to the aforementioned man-
(Brunato and Dell’Orletta, 2015), CItA (Barbagli        ually tagged ground truth. While the ground truth
et al., 2015)(Barbagli et al., 2016), and KoKo          isolates out multiword expressions, none of the
(Abel et al., 2016).                                    tools are able to do that, so all multiword expres-
   The DFA-TIscrivo corpus is a balanced cor-           sions are considered to be mistagged and every
pus collected in 56 Italian-speaking primary and        multiword expression counts as one single miss.
lower secondary schools from Southern Switzer-          Verbal enclitics are not considered and the corre-
land. It contains 1735 narrative-reflective essays      sponding verbs are expected to be tagged simply
(742 from primary, 993 from secondary school),          as verbs. Our results are shown in Table 1; we see
transcribed but not normalized, and accompanied         that UD-Pipe trails behind and falls below the 0.8
by sociolinguistic metadata (age, gender, school        mark, while the other four taggers under test offer
and class, linguistic information). It amounts to       a similar performance, with TreeTagger slightly
about 390,000 tokens. Lexical data were ini-            ahead of the pack. All these taggers reportedly
tially lemmatized and PoS tagged using TreeTag-         perform above the 95% mark on standard Italian.
ger (with the Italian parameters by Marco Baroni)          Tables 2-6 contain the confusion matrices of the
and are being manually revised. Furthermore, we         PoS taggers under test based on the ISST-TANL
are manually annotating orthographic, morpho-           coarse-grained tags. Row i shows the ground truth
logical and lexical main types of error, multi-word     for tag i and column k shows the frequency with
expressions, peculiar lexicon of Italian only used      which it is tagged as k. To abstract away from
in Southern Switzerland and foreign words. A key        how individual taggers address prepositional ar-
project goal is to build up a dictionary of the Ital-   ticle, we merge the tags for prepositions (E) and
ian language as it is written in Southern Switzer-      articles (R) into a super-tag ER. We also merge
land (Cignetti and Demartini, 2016)(Fornara et al.,     the tags for adjectives (A) and determiners (D)
2016) as an online resource useful both to scholars     because determiners may be viewed as a cate-
and to teachers.                                        gory of adjectives in Italian. We only show the
                                                        tags that occur most often (which is why some
4   Methodology and Performance
                                                        rows/columns do not add up to one). We note
    Analysis
                                                        that TreeTagger outperforms all other taggers with
We run the five taggers on the corpus and compare       AD while lagging behind all of them with P (pro-
their output to a manually tagged ground truth. We      nouns) and C (conjunctions), often tagged as P or
note that, at the time of writing, the analysis is      B (adverbs). TreeTaggers also performs remark-
restricted to a subset of the DFA-TIscrivo corpus       ably well with verbs (V).
that has been manually PoS-tagged and is limited
  2                                                       3
    http://dfa-blog.supsi.ch/                               http://www.italianlp.it/docs/
DFA-TIscrivo˜/la-ricerca/                               ISST-TANL-POStagset.pdf
        AD     B      C      ER     P      S      V                 AD     B      C      ER     P      S      V
 AD     0.95   0.03   0.01   0      0      0      0           AD    0.75   0.01   0      0      0.03   0.05   0.02
 B      0.02   0.88   0      0      0.02   0.04   0.04        B     0.01   0.88   0      0.05   0.01   0.02   0.03
 C      0.01   0.07   0.76   0.01   0.15   0      0           C     0      0.08   0.9    0.01   0      0      0.01
 ER     0.08   0      0      0.92   0      0      0           ER    0.04   0      0      0.92   0      0.02   0.02
 P      0.05   0      0.01   0      0.79   0.01   0           P     0.01   0.01   0.03   0.03   0.91   0.01   0
 S      0.04   0      0      0      0      0.93   0.03        S     0.03   0.01   0      0      0      0.94   0.02
 V      0      0      0      0      0      0.01   0.99        V     0.01   0      0      0      0      0.03   0.96

      Table 2: Tree Tagger confusion matrix.                       Table 5: Syntaxnet confusion matrix.

                                                                    AD     B      C      ER     P      S      V
        AD     B      C      ER     P      S      V           AD    0.72   0.03   0      0      0.02   0.07   0.01
 AD     0.77   0.04   0      0      0      0.04   0.01        B     0.03   0.87   0      0.01   0      0.04   0.01
 B      0.04   0.86   0      0.01   0      0.07   0.02        C     0      0.08   0.90   0.01   0.01   0      0
 C      0      0.06   0.91   0.02   0      0.01   0           ER    0.06   0      0      0.92   0      0.01   0.01
 ER     0      0      0      1      0      0      0           P     0.01   0      0.02   0.02   0.93   0.01   0
 P      0.01   0.01   0.02   0.02   0.93   0.01   0           S     0      0      0      0      0      0.97   0.02
 S      0.02   0.01   0      0      0      0.96   0.01        V     0.01   0      0      0      0      0.04   0.95
 V      0.02   0      0      0      0.01   0.03   0.94
                                                                   Table 6: DRAGNN confusion matrix.
       Table 3: UD-Pipe confusion matrix.                 5    Conclusion
                                                          We have presented a comparative performance as-
                                                          sessment of five state-of-the-art PoS taggers on
        AD     B      C      ER     P      S      V
                                                          the DFA-TIscrivo corpus of K-12 student essays,
 AD     0.75   0      0      0      0.15   0      0
                                                          along with an analysis of the patterns that can be
 B      0.06   0.88   0      0.01   0      0.03   0.02
                                                          observed in the mistakes made by individual tag-
 C      0      0.09   0.89   0.01   0      0      0
                                                          gers. As this is still a work in progress, the re-
 ER     0      0      0      0.99   0      0      0
                                                          sults in the paper are limited to a subset of the
 P      0.02   0      0      0.03   0.88   0.01   0       corpus containing fifth grade essays. These re-
 S      0.03   0      0      0      0      0.94   0.03    sults provide a valuable baseline that could likely
 V      0.01   0      0      0      0      0.01   0.92    be improved with domain adaptation. On the
         Table 4: Tint confusion matrix.                  other hand, it is fair to ask whether the DFA-
                                                          TIscrivo corpus is different enough from standard
   We have also studied the confusion matrices
                                                          Italian to warrant domain adaptation, or whether
within the V category (not shown), noting that
                                                          we would encounter issues with overfitting. In the
TreeTagger performs remarkably better than the
                                                          latter case, an alternative would be the rule-based
others with respect to principal verbs (0.97 ac-
                                                          combination of the output of the five taggers, in-
curacy while the others are right around the 0.9
                                                          formed with the knowledge of the observed error
mark). and modal verbs (0.94 versus 0.81 for
                                                          patterns.
UD-Pipe and TINT and a disappointing 0.75 for
both Syntaxnet and DRAGNN). All taggers per-              Acknowledgments
form equally poorly with auxiliary verbs (accuracy
just above the 0.8 mark in all cases). Aside from         The partial support of SNF through project TIs-
Tint, which does not provide morphological infor-         crivo 2.0 and of SUPSI through project Scripsit is
mation (at least in the version we used), all tag-        gratefully acknowledged.
gers do well with finite verbs (> 0.97, with UD-
Pipe trailing behind at 0.95). While TreeTagger           References
and UD-Pipe perform at the same level of accu-
                                                         Abadi et al., 2016 Martı́n Abadi, Ashish Agarwal, Paul
racy for both finite and non-finite verbs, Syntaxnet        Barham, Eugene Brevdo, Zhifeng Chen, Craig
and DRAGNN barely go beyond the 0.9 mark with               Citro, Gregory S. Corrado, Andy Davis, Jef-
the latter.                                                 frey Dean, Matthieu Devin, Sanjay Ghemawat,
    Ian J. Goodfellow, Andrew Harp, Geoffrey Irv-                 Italian Conference on Computational Linguistics
    ing, Michael Isard, Yangqing Jia, Rafal Józefowicz,          (CLiC-it 2016) & Fifth Evaluation Campaign of
    Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,              Natural Language Processing and Speech Tools for
    Dan Mané, Rajat Monga, Sherry Moore, Derek Gor-              Italian. Final Workshop (EVALITA 2016), Napoli,
    don Murray, Chris Olah, Mike Schuster, Jonathon               Italy, December 5-7, 2016.
    Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal-
    war, Paul A. Tucker, Vincent Vanhoucke, Vijay Va-         Brunato and Dell’Orletta, 2015 Dominique     Brunato
    sudevan, Fernanda B. Viégas, Oriol Vinyals, Pete            and Felice Dell’Orletta. 2015. Isacco: a corpus
    Warden, Martin Wattenberg, Martin Wicke, Yuan                for investigating spoken and written language
    Yu, and Xiaoqiang Zheng. 2016. Tensorflow:                   development in italian school–age children. CLiC
    Large-scale machine learning on heterogeneous dis-           it, page 62.
    tributed systems. CoRR, abs/1603.04467.
                                                              Cignetti and Demartini, 2016 Luca Cignetti and Silvia
Abel et al., 2016 Andrea Abel, Aivars Glaznieks, Li-             Demartini. 2016. From data to tools. theoretical and
   onel Nicolas, and Egon Stemle. 2016. An extended              applied problems in the compilation of lissics (the
   version of the koko german L1 learner corpus. In              lexicon of written italian in a school context in ital-
   Proceedings of Third Italian Conference on Compu-             ian switzerland). RiCOGNIZIONI. Rivista di Lingue
   tational Linguistics (CLiC-it 2016) & Fifth Evalua-           e Letterature straniere e Culture moderne, 3(6):35–
   tion Campaign of Natural Language Processing and              49.
   Speech Tools for Italian. Final Workshop (EVALITA
   2016), Napoli, Italy, December 5-7, 2016.                  Cignetti et al., 2016 Luca Cignetti, Silvia Demartini,
                                                                 and Simone Fornara. 2016. Come TIscrivo? La
Alberti et al., 2017 Chris Alberti, Daniel Andor, Ivan           scrittura a scuola tra teoria e didattica. Aracne.
   Bogatyy, Michael Collins, Dan Gillick, Lingpeng
   Kong, Terry Koo, Ji Ma, Mark Omernick, Slav                Cimino and Dell’Orletta, 2016 Andrea Cimino and Fe-
   Petrov, Chayut Thanapirom, Zora Tung, and David               lice Dell’Orletta. 2016. Building the state-of-the-
   Weiss. 2017. Syntaxnet models for the conll 2017              art in POS tagging of italian tweets. In Proceed-
   shared task. CoRR, abs/1703.04929.                            ings of Third Italian Conference on Computational
                                                                 Linguistics (CLiC-it 2016) & Fifth Evaluation Cam-
Andor et al., 2016 Daniel Andor, Chris Alberti, David            paign of Natural Language Processing and Speech
   Weiss, Aliaksei Severyn, Alessandro Presta, Kuz-              Tools for Italian. Final Workshop (EVALITA 2016),
   man Ganchev, Slav Petrov, and Michael Collins.                Napoli, Italy, December 5-7, 2016.
   2016. Globally normalized transition-based neural
   networks. arXiv preprint arXiv:1603.06042.                 Collins, 2002 Michael Collins. 2002. Discriminative
Barbagli et al., 2015 Alessia Barbagli, Piero Lucisano,           training methods for hidden markov models: The-
   Felice Dell’Orletta, Simonetta Montemagni, and                 ory and experiments with perceptron algorithms. In
   Giulia Venturi. 2015. Cita: un corpus di produzioni            Proceedings of the ACL-02 Conference on Empirical
   scritte di apprendenti litaliano l1 annotato con errori.       Methods in Natural Language Processing - Volume
   In BProceedings of the Second Italian Conference               10, EMNLP ’02, pages 1–8, Stroudsburg, PA, USA.
   on Computational Linguistics, CLiC-it, pages 31–               Association for Computational Linguistics.
   35.
                                                              Fornara et al., 2016 Simone Fornara, Luca Cignetti, and
Barbagli et al., 2016 Alessia Barbagli, Pietro Lucisano,         Silvia Demartini. 2016. Il lessico di tiscrivo.
   Felice Dell’Orletta, Simonetta Montemagni, and                caratterizzazione del vocabolario e osservazioni in
   Giulia Venturi. 2016. Cita: an l1 italian learners            prospettiva didattica. In Sviluppo della compe-
   corpus to study the development of writing com-               tenza lessicale: acquisizione, apprendimento, inseg-
   petence. In Proceedings of the Tenth International            namento, pages 43–60. Aracne, Roma.
   Conference on Language Resources and Evaluation
   (LREC 2016), may.                                          Giesbrecht and Evert, 2009 Eugenie Giesbrecht and
                                                                 Stefan Evert. 2009. Is part-of-speech tagging a
Bohnet and Nivre, 2012 Bernd Bohnet and Joakim                   solved task? an evaluation of pos taggers for the
   Nivre. 2012. A transition-based system for joint              german web as corpus. In I. Alegria, I. Leturia, and
   part-of-speech tagging and labeled non-projective             S. Sharoff, editors, Proceedings of the 5th Web as
   dependency parsing. In Proceedings of the 2012                Corpus Workshop (WAC5), San Sebastian, Spain.
   Joint Conference on Empirical Methods in Natu-
   ral Language Processing and Computational Natu-            Kong et al., 2017 Lingpeng Kong, Chris Alberti, Daniel
   ral Language Learning, EMNLP-CoNLL ’12, pages                 Andor, Ivan Bogatyy, and David Weiss. 2017.
   1455–1465, Stroudsburg, PA, USA. Association for              DRAGNN: A transition-based framework for dy-
   Computational Linguistics.                                    namically connected neural networks.         CoRR,
                                                                 abs/1703.04474.
Bosco et al., 2016 Cristina Bosco, Fabio Tamburini,
   Andrea Bolioli, and Alessandro Mazzei. 2016.               Ling et al., 2015 Wang Ling, Tiago Luı́s, Luı́s Marujo,
   Overview of the EVALITA 2016 part of speech on                Ramón Fernandez Astudillo, Silvio Amir, Chris
   twitter for italian task. In Proceedings of Third             Dyer, Alan W Black, and Isabel Trancoso. 2015.
   Finding function in form: Compositional charac-         Toutanova et al., 2003 Kristina Toutanova, Dan Klein,
   ter models for open vocabulary word representation.        Christopher D Manning, and Yoram Singer. 2003.
   arXiv preprint arXiv:1508.02096.                           Feature-rich part-of-speech tagging with a cyclic de-
                                                              pendency network. In Proceedings of the 2003
Manning et al., 2014 Christopher D Manning, Mihai             Conference of the North American Chapter of the
   Surdeanu, John Bauer, Jenny Finkel, Steven J               Association for Computational Linguistics on Hu-
   Bethard, and David McClosky. 2014. The stanford            man Language Technology-Volume 1, pages 173–
   corenlp natural language processing toolkit. In Pro-       180. Association for Computational Linguistics.
   ceedings of 52nd Annual Meeting of the Association
   for Computational Linguistics: System Demonstra-        Zhang and Weiss, 2016 Yuan Zhang and David Weiss.
   tions.                                                     2016.    Stack-propagation: Improved represen-
                                                              tation learning for syntax.     arXiv preprint
Palmero Aprosio and Moretti, 2016 A. Palmero Apro-            arXiv:1603.06598.
   sio and G. Moretti. 2016. Italy goes to Stanford:
   a collection of CoreNLP modules for Italian. ArXiv
   e-prints, September.
Ratnaparkhi, 1996 Adwait Ratnaparkhi. 1996. A max-
   imum entropy model for part-of-speech tagging. In
   Conference on Empirical Methods in Natural Lan-
   guage Processing, pages 133–142.
Schmid, 1994 Helmut Schmid. 1994. Part-of-speech
   tagging with neural networks. In Proceedings of
   the 15th conference on Computational linguistics-
   Volume 1, pages 172–176. Association for Compu-
   tational Linguistics.
Spoustová et al., 2009 Drahomı́ra “johanka” Spous-
   tová, Jan Hajič, Jan Raab, and Miroslav Spousta.
   2009. Semi-supervised training for the averaged
   perceptron pos tagger. In Proceedings of the 12th
   Conference of the European Chapter of the ACL
   (EACL 2009), pages 763–771, Athens, Greece,
   March. Association for Computational Linguistics.
Straka et al., 2016 Milan Straka, Jan Hajic, and Jana
    Straková. 2016. Udpipe: Trainable pipeline for pro-
    cessing conll-u files performing tokenization, mor-
    phological analysis, pos tagging and parsing. In
    LREC.
Straková et al., 2014 Jana Straková, Milan Straka, and
    Jan Hajič. 2014. Open-Source Tools for Morphol-
    ogy, Lemmatization, POS Tagging and Named En-
    tity Recognition. In Proceedings of 52nd Annual
    Meeting of the Association for Computational Lin-
    guistics: System Demonstrations, pages 13–18, Bal-
    timore, Maryland, June. Association for Computa-
    tional Linguistics.
syn, 2016 2016. Syntaxnet. https://www.
    tensorflow.org/versions/r0.12/
    tutorials/syntaxnet. Accessed: 2017-07-
    13.
Toutanova and Manning, 2000 Kristina Toutanova and
   Christopher D. Manning. 2000. Enriching the
   knowledge sources used in a maximum entropy part-
   of-speech tagger. In Proceedings of the 2000 Joint
   SIGDAT Conference on Empirical Methods in Nat-
   ural Language Processing and Very Large Corpora:
   Held in Conjunction with the 38th Annual Meeting
   of the Association for Computational Linguistics -
   Volume 13, EMNLP ’00, pages 63–70. Association
   for Computational Linguistics.