CoreNLP-it: A UD pipeline for Italian based on Stanford CoreNLP Alessandro Bondielli1 , Lucia C. Passaro2 and Alessandro Lenci2 1 Dipartimento di Ingegneria dell’Informazione (DINFO), Università degli studi di Firenze 2 CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica (FiLeLi), Università di Pisa alessandro.bondielli@unifi.it lucia.passaro@fileli.unipi.it alessandro.lenci@unipi.it Abstract Italian in which some of the modules have been completely re-implemented on new classes and English. This paper describes a collec- data structures compared to the CoreNLP ones. In tion of modules for Italian language pro- addition, like for the other existing resources, it cessing based on CoreNLP and Univer- does not provide an output that is fully compatible sal Dependencies (UD). The software will with the Universal Dependency (UD) framework,1 be freely available for download under which is becoming the de facto standard especially the GNU General Public License (GNU for morpho-syntactic annotation, as well as for GPL). Given the flexibility of the frame- text annotation in general. work, it is easily adaptable to new lan- In this paper, we present CoreNLP-it, a set of guages provided with an UD Treebank. customizable classes for CoreNLP designed for Italiano. Questo lavoro descrive un Italian. Our system, despite being simpler than insieme di strumenti di analisi linguis- any of the above mentioned toolkits, both in scope tica per l’Italiano basati su CoreNLP and number of features, has the advantage of be- e Universal Dependencies (UD). Il soft- ing easily integrated with the CoreNLP suite, since ware sarà liberamente scaricabile sotto li- its development has been grounded on the princi- cenza GNU General Public License (GNU ple that all data structures be natively supported by GPL). Data la sua flessibilità, il frame- CoreNLP. work è facilmente adattabile ad altre The key properties of CoreNLP-it are: lingue con una Treebank UD. • UD based and compliant: The toolkit and models are based on UD and follow its guide- lines for token and parsing representation. It 1 Introduction can provide all annotation required in the UD The fast-growing research field of Text Min- framework, and produces a CoNLL-U for- ing and Natural Language Processing (NLP) has matted output at any level of annotation, as shown important advancements in recent years. well as any other type of annotation provided NLP tools that provide basic linguistic annotation in CoreNLP. of raw texts are a crucial building block for further • Multi-word token representation: Multi- research and applications. Most of these tools, like word tokens (e.g., enclitic constructions) are NLTK (Bird et al., 2009) and Stanford CoreNLP handled by providing separate tokens. More- (Manning et al., 2014), have been developed for over, the CoNLL-U output can represent such English, and, most importantly, are freely avail- information following the UD guidelines. able. For Italian, several tools have been devel- oped during the years such as TextPro (Pianta et • Hybrid tokenization: A fast and accurate al., 2008) and the Tanl Pipeline (Attardi et al., hybrid tokenization and sentence splitting 2010) but unfortunately they are either outdated module replaces the original rule-based an- or not open source. An exception is represented notators for this task. by Tint (Aprosio and Moretti, 2016), a standalone freely available and customizable software based • Integration with CoreNLP: Given the way on Stanford CoreNLP. The main drawback of this it is built (including the exclusive usage of 1 solution is that it is a resource highly tailored for http://universaldependencies.org/ CoreNLP classifiers and data structures), the providing all textual annotations required by the add-on can be seamlessly integrated with the UD guidelines. Moreover, our system is also com- latest available version (3.9.1) of CoreNLP, patible with standard CoreNLP functions (e.g., and is expected to work with upcoming ver- Named Entity Recognition (NER) and Sentiment sions as well. annotation). For these reasons,we implemented a series of custom annotators and statistical models • Support for other languages: It provides for Italian. The custom annotators replace the cor- out-of-the-box new capabilities of support- responding CoreNLP annotators leaving intact the ing basic annotations for other languages pro- annotation structure and output of the annotators vided with a UD Treebank. they are replacing. This paper is organized as follows: in Section 2, For simplicity, we used only one of the UD tree- we present the architecture of the toolkit, whereas banks available for Italian, namely the UD adapta- its core components (annotators) are described in tion of the ISDT Italian Treebank (Bosco et al., Section 3. The results on Italian are discussed in 2013). The resource was used to build most of the Section 3.5. Section 4 shows preliminary experi- new models, as well as for training standard sta- ments for the adaptation of the software to two ad- tistical models (e.g., PoS tagging and Dependency ditional languages provided with a UD treebank, Parsing) available in CoreNLP. More specifically, namely Spanish and French. to obtain a UD-compliant output, we trained the Italian models on the training, dev, and test sets 2 Architecture provided within the treebank. The current version of CoreNLP-it can be eas- CoreNLP-it has been built as an add-on to the ily integrated and configured into CoreNLP by Stanford CoreNLP toolkit (Manning et al., 2014). adding the custom annotator classes and their re- CoreNLP offers a set of linguistic tools to per- spective models into the pipeline. Such classes form core linguistic analyses of texts in English and their properties can be added in a configura- and other languages, and produces an annotated tion file or called via the API interface. This pro- output in various formats such as CoNLL (Nivre cedure follows the standard CoreNLP documenta- et al., 2007), XML, Json, etc. tion and guidelines for custom annotator classes. In addition, we provide a new class (resembling 2.1 Stanford CoreNLP a CoreNLP one) for the training of the hybrid to- The main architecture of CoreNLP consists of an kenization and sentence splitting. The configura- annotation object as well as a sequence of anno- tion of the classifier and the required dictionaries tators aimed at annotating texts at different levels (cf. Section 3.1) can be specified in a separate of analysis. Starting from a raw text, each mod- property file. ule adds a new annotation layer such as tokeniza- ton, PoS tagging, parsing etc. The behavior of 3 Modules the single annotators can be controlled via stan- dard Java properties. Annotators can analyze text The annotators described in the following sections with both rule-based or statistical-based models. are aimed at producing a UD compliant and com- While rule-based models are typically language plete output. The following information is ex- dependent, statistical based ones can be trained di- tracted from text: Sentences, Tokens, Universal rectly within the CoreNLP toolkit in order to im- PoS Tags, language specific PoS Tags, Lemmas, prove the performance of the default models or to Morphological Features, and Dependency Parse deal with different languages and domains. Tree for each sentence. In this section, we briefly describe each module 2.2 CoreNLP-it of our linguistic pipeline, focusing on the annota- The main goal we pursued in developing tors and models it implements. CoreNLP-it was to keep the original CoreNLP structure and usage intact, while enabling it to 3.1 Sentence Splitting and Tokenization deal with Italian texts in order to produce a UD- Sentence Splitting and Tokenization are han- compliant and UD-complete output. More specif- dled by a single classifier, namely the annotator ically, we aimed at building a system capable of it tok sent. The process splits raw text into sen- tences, and each sentence into tokens. Crucially, ging phase was used to merge the rule-based and the tokenization process can deal with both single statistical predictions. and multi-word tokens as specified by the CoNLL- U format. 3.2 Part-of-Speech Tagging Multi word tokens such as verbs with clitic pro- The Maximum Entropy implementation of the nouns (e.g., portar-vi “carry to you”) and articu- Part-of-Speech Tagger (Toutanova et al., 2003) lated prepositions (prep + determiner) (e.g., della, provided in the Stanford CoreNLP toolkit has di+la “of the”), are split into their respective com- been used to predict language dependant PoS Tags ponents. The information about the original word (xPoS). and its position in the sentence is however retained In order to annotate Universal PoS (uPoS) tags, within each token by exploiting the token span and a separate annotator class, namely upos, has been original word annotations. implemented. Tokenization is usually solved with rule-based For what concerns the xPoS Tagger, the Maxi- systems able to identify word and sentence bound- mum Entropy model was trained on the UD-ISDT aries, for example by identifying white spaces and Treebank. uPoS tags are instead approached with full stops. However, in order to avoid encoding a rule based strategy. In particular, we built a map- such set of rules, we implemented a model in- ping between xPoS and uPoS based on the UD- spired by Evang et al. (2013). At its core, the pro- ISTD Treebank. The mapping is used within the cess is driven by a hybrid model. First, it uses a annotator to assign the uPoS tag based on the pre- character-based statistical model to recognize sen- dicted xPoS tag. tences, tokens, and clitic prepositions. Then, a rule based dictionary is used to optimize the multi- 3.3 Lemmatization and Morphological word tokens detection and splitting. Annotation The classifier tags each character with respect to one of the following classes: i. S: start of a new In order to annotate each token with its corre- sentence; ii. T: start of a new token; iii. I: inside sponding lemma and morphological features, we of a token; iv. O: outside of a token; v. C: start of a developed a rule-based custom annotator. The an- clitic preposition inside a token (e.g. mandarvi). notator exploits a parametric dictionary, to assign The classifier is a simple implementation of the lemmas based on the word form and PoS. In par- maximum entropy Column Data Classifier avail- ticular, the dictionary contains the lemma and UD able in the Stanford CoreNLP. To train the model, morphological features for n (f orm, P oS) pairs. we used the following feature set: i. window: a The form is used as the main access key to the dic- window of n characters before and after the target tionary, while PoS is used to solve ambiguity, e.g., character; ii. the case of the character; iii. the class between amo as ”I love” or as ”fishing hook”. Fi- of the previous character. nally, in cases of PoS ambiguity, corpus frequency is used to select the target lemma. In order to deal with multi-tokens, the system The dictionary can be manually built or ex- allows for a full rule-based tagging of a parametric tracted from a UD treebank. In the latter case, the list of multi-tokens typically belonging to a strictly provided Vocabulary class has methods to extract language dependent closed class words. In the and build a serialized model of the vocabulary. Italian implementation, such words are articulated prepositions (prep + determiner). The word list to 3.4 Dependency Parsing be ignored is fed to the classifier during training. Moreover, an additional set of rules can be ap- The Neural Network Dependency Parser imple- plied after the classification step in order to deal mented in Stanford CoreNLP (Chen and Manning, with possibly misclassified items. In particular, 2014) allows models to be trained for different lan- the system simply checks each token against a dic- guages. tionary of multi-words and split them accordingly. As for Italian, we used FastText (Joulin et al., In the case of Italian, we built a dictionary of clitic 2016) Italian 300dim-pretrained embeddings de- verbs (which are instead an open class) by boot- scribed in Bojanowski et al. (2017). The depen- strapping the verbs in the treebank with all possi- dency parser was trained with the default configu- ble combinations of clitic pronouns. A final tag- ration provided in Stanford CoreNLP. 3.5 CoreNLP-it performances ble. To obtain the required linguistic knowledge, Table 1 reports the global performances of the cur- the framework exploits statistical models or exter- rently trained models. In particular, all our mod- nal resources. On the one hand, the use of big els were evaluated against the UD-ISDT Treebank linguistic resources to perform some of the tasks test set. can affect the computational performances, but the With respect to the Tokenization, we measured system enables the construction of basic resources the accuracy by considering the whole output of from the treebank used for training. On the other the tokenization process (i.e., the combination of hand, this framework is very flexible, especially by the statistical classifier and rule based multi-word considering tasks like tokenization and lemmatiza- tokens detection). As for Lemmatization, we tion. In particular, the system is able to produce a tested the system by predicting the lemmas for to- full UD-compliant Stanford Pipeline for languages kens in the UD-ISDT Italian test set. PoS Tagging for which an UD Treebank is available. and Dependency Parsing were tested with the sys- In order to validate this claim, we focused on tem provided in CoreNLP. two languages closely related to Italian, namely Spanish and French. We trained the respective Task Tokens/sec Results models on the UD-adapted corpora ES-ANCORA Tok., S.Split. 17277.4 Accuracy: 99% (Taulé et al., 2008) and FR-GSD (Hernandez and xPoS Tag 7575.4 F1: 0.97 Boudin, 2013). In these cases, to detect multi- Lemma 5553.1 Accuracy: 92% word tokens we exploited the information avail- Dep. Parsing 1717.8 LAS: 86.15 able in these corpora. It is clear that such mod- UAS: 88.57 els are intended as an interesting UD baseline, be- cause the linguistic information they employ is not Table 1: Evaluation of CoreNLP-it modules on the yet as optimized as the one used by the Italian UD-ISDT Treebank test set. models. Since the core of the adaptation of the Stanford We must point out that one of the main short- Pipeline to Universal Dependencies relies on the comings of implementing a more statistically ori- Tokenization phase, we report here the results ob- ented model for tokenization with respect to a rule tained for this task. It is clear that the rest of the based one is that it may underperform in the case models (i.e., PoS tags and Parsing) can be trained of badly formatted or error-filled texts, which we simply by following the Stanford CoreNLP guide- cannot find in most Treebanks. However, we be- lines. Results obtained for the tokenization mod- lieve that such an approach could be nonetheless ules for French and Spanish are shown in Table 2. very useful in that it can be automatically scaled to different linguistic registers and text genres. Task Language Accuracy (%) Moreover, most typical errors could be avoided by Tok., S.Split. Spanish 99,9 means of data augmentation strategies and the use French 99,7 of more heterogeneous data for training, such as Lemma Spanish 66 for example the PoSTWITA-UD Treebank (San- French 69 guinetti et al., 2018). It is important to stress that the main focus of Table 2: Evaluation of CoreNLP-it modules on this work was to build a framework allowing for a Spanish and French. fast and easy implementation of UD models based on Stanford CoreNLP from a software engineering All statistical models have similar performances point of view. The basic pre-trained models are with respect to Italian ones. The main differences, intended as a proof of concept, and will require as expected, concern the tasks most dependent on further parameter tuning to increase their perfor- external resources (e.g., Lemmatization). For ex- mance. ample, we noticed a much lower recall for multi- word token identification, given the exclusive use 4 Flexibility Towards Other Languages of the examples found in the training set. The ap- One of the key goals that has driven the devel- proach shows very promising results especially for opment of CoreNLP-it is keeping the core code tokenization and sentence splitting modules which implementation as language independent as possi- are central for all the subsequent levels of analysis based on UD. It is clear that for PoS Tagging and Danqi Chen and Christopher Manning. 2014. A fast Parsing further developments based on Stanford and accurate dependency parser using neural net- works. In Proceedings of EMNLP 2014, pages 740– CoreNLP and language-specific resources are re- 750, Doha, Qatar. quired to account for the specific features of each language. Kilian Evang, Valerio Basile, Grzegorz Chrupała, and Johan Bos. 2013. Elephant: Sequence labeling for 5 Conclusion and Ongoing Work word and sentence segmentation. In Proceedings of EMNLP 2013, pages 1422–1426, Seattle, Washing- In this paper, we presented CoreNLP-it, a set of ton, USA. ACL. add-on modules for the Stanford CoreNLP lan- Jenny Rose Finkel, Trond Grenager, and Christopher guage toolkit. Our system provides basic language Manning. 2005. Incorporating non-local informa- annotations such as sentence splitting, tokeniza- tion into information extraction systems by Gibbs tion, PoS tagging, lemmatization and dependency sampling. In Proceedings of ACL 2005, ACL ’05, pages 363–370, Stroudsburg, PA, USA. ACL. parsing, and can provide a UD-compliant output. Our rule based and statistical models achieve good Nicolas Hernandez and Florian Boudin. 2013. Con- performances for all tasks. In addition, since the struction automatique d’un large corpus libre annoté morpho-syntaxiquement en français. In Actes de la framework has been implemented as an add-on conférence TALN-RECITAL 2013, pages 160–173, to Stanford CoreNLP, it offers the possibility of Sables d’Olonne, France. adding other new annotators, including for exam- ple the Stanford NER (Finkel et al., 2005). More- Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, and Tomas Mikolov. over, first experiments on other languages have 2016. Fasttext.zip: Compressing text classification shown very good adaptation capability with very models. CoRR. little effort. Christopher D. Manning, Mihai Surdeanu, John Bauer, In the near future, we plan to refine the core Jenny Finkel, Steven J. Bethard, and David Mc- code by performing extensive tests to better deal Closky. 2014. The Stanford CoreNLP natural lan- with additional UD-supported languages and opti- guage processing toolkit. In Association for Compu- mize their performances. We also plan to release tational Linguistics (ACL) System Demonstrations, pages 55–60. the tool as well as the basic trained models for Italian. Moreover, we intend to perform data aug- Joakim Nivre, Johan Hall, Sandra Kübler, Ryan Mc- mentation strategies to refine our models and make Donald, Jens Nilsson, Sebastian Riedel, and Deniz them able to work properly also with ill-formed or Yuret. 2007. The CoNLL 2007 shared task on de- pendency parsing. In Proceedings of The CoNLL substandard text input. Shared Task Session of EMNLP-CoNLL 2007, pages 915–932, Prague, Czech Republic. ACL. References Emanuele Pianta, Christian Girardi, and Roberto Zanoli. 2008. The TextPro tool suite. In Proceed- Alessio Palmero Aprosio and Giovanni Moretti. 2016. ings of LREC 2008, pages 2603–2607, Marrakech, Italy goes to Stanford: a collection of CoreNLP Morocco. European Language Resources Associa- modules for Italian. CoRR. tion (ELRA). Giuseppe Attardi, Stefano Dei Rossi, and Maria Simi. Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli, 2010. The tanl pipeline. In LREC Workshop on Alessandro Mazzei, Oronzo Antonelli, and Fabio WSPP, pages 15–21, Valletta, Malta. Tamburini. 2018. Postwita-ud: an italian twitter treebank in universal dependencies. In Proceedings Steven Bird, Ewan Klein, and Edward Loper. 2009. of LREC 2018. Natural Language Processing with Python. Mariona Taulé, Maria Antònia Martı́, and Marta Re- Piotr Bojanowski, Edouard Grave, Armand Joulin, and casens. 2008. AnCora: Multilevel annotated cor- Tomas Mikolov. 2017. Enriching word vectors with pora for catalan and spanish. In Proceedings of subword information. Transactions of the Associa- LREC 2008, pages 96–101, Marrakech, Morocco. tion for Computational Linguistics, 5:135–146. Kristina Toutanova, Dan Klein, Christopher D. Man- Cristina Bosco, Simonetta Montemagni, and Maria ning, and Yoram Singer. 2003. Feature-rich part-of- Simi. 2013. Converting italian treebanks: Towards speech tagging with a cyclic dependency network. an italian stanford dependency treebank. In Pro- In Proceedings of NAACL 2003, NAACL ’03, pages ceedings of the 7th Linguistic Annotation Workshop 173–180, Stroudsburg, PA, USA. ACL. and Interoperability with Discourse, pages 61–69, Sofia, Bulgaria.