CoreNLP-it: A UD pipeline for Italian based on Stanford CoreNLP
           Alessandro Bondielli1 , Lucia C. Passaro2 and Alessandro Lenci2
    1
   Dipartimento di Ingegneria dell’Informazione (DINFO), Università degli studi di Firenze
2
  CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica (FiLeLi), Università di Pisa
                        alessandro.bondielli@unifi.it
                        lucia.passaro@fileli.unipi.it
                           alessandro.lenci@unipi.it


                        Abstract                        Italian in which some of the modules have been
                                                        completely re-implemented on new classes and
        English. This paper describes a collec-         data structures compared to the CoreNLP ones. In
        tion of modules for Italian language pro-       addition, like for the other existing resources, it
        cessing based on CoreNLP and Univer-            does not provide an output that is fully compatible
        sal Dependencies (UD). The software will        with the Universal Dependency (UD) framework,1
        be freely available for download under          which is becoming the de facto standard especially
        the GNU General Public License (GNU             for morpho-syntactic annotation, as well as for
        GPL). Given the flexibility of the frame-       text annotation in general.
        work, it is easily adaptable to new lan-           In this paper, we present CoreNLP-it, a set of
        guages provided with an UD Treebank.            customizable classes for CoreNLP designed for
        Italiano.     Questo lavoro descrive un         Italian. Our system, despite being simpler than
        insieme di strumenti di analisi linguis-        any of the above mentioned toolkits, both in scope
        tica per l’Italiano basati su CoreNLP           and number of features, has the advantage of be-
        e Universal Dependencies (UD). Il soft-         ing easily integrated with the CoreNLP suite, since
        ware sarà liberamente scaricabile sotto li-    its development has been grounded on the princi-
        cenza GNU General Public License (GNU           ple that all data structures be natively supported by
        GPL). Data la sua flessibilità, il frame-      CoreNLP.
        work è facilmente adattabile ad altre             The key properties of CoreNLP-it are:
        lingue con una Treebank UD.                       • UD based and compliant: The toolkit and
                                                            models are based on UD and follow its guide-
                                                            lines for token and parsing representation. It
1       Introduction
                                                            can provide all annotation required in the UD
The fast-growing research field of Text Min-                framework, and produces a CoNLL-U for-
ing and Natural Language Processing (NLP) has               matted output at any level of annotation, as
shown important advancements in recent years.               well as any other type of annotation provided
NLP tools that provide basic linguistic annotation          in CoreNLP.
of raw texts are a crucial building block for further
                                                          • Multi-word token representation: Multi-
research and applications. Most of these tools, like
                                                            word tokens (e.g., enclitic constructions) are
NLTK (Bird et al., 2009) and Stanford CoreNLP
                                                            handled by providing separate tokens. More-
(Manning et al., 2014), have been developed for
                                                            over, the CoNLL-U output can represent such
English, and, most importantly, are freely avail-
                                                            information following the UD guidelines.
able. For Italian, several tools have been devel-
oped during the years such as TextPro (Pianta et          • Hybrid tokenization: A fast and accurate
al., 2008) and the Tanl Pipeline (Attardi et al.,           hybrid tokenization and sentence splitting
2010) but unfortunately they are either outdated            module replaces the original rule-based an-
or not open source. An exception is represented             notators for this task.
by Tint (Aprosio and Moretti, 2016), a standalone
freely available and customizable software based          • Integration with CoreNLP: Given the way
on Stanford CoreNLP. The main drawback of this              it is built (including the exclusive usage of
                                                           1
solution is that it is a resource highly tailored for          http://universaldependencies.org/
      CoreNLP classifiers and data structures), the    providing all textual annotations required by the
      add-on can be seamlessly integrated with the     UD guidelines. Moreover, our system is also com-
      latest available version (3.9.1) of CoreNLP,     patible with standard CoreNLP functions (e.g.,
      and is expected to work with upcoming ver-       Named Entity Recognition (NER) and Sentiment
      sions as well.                                   annotation). For these reasons,we implemented a
                                                       series of custom annotators and statistical models
    • Support for other languages: It provides         for Italian. The custom annotators replace the cor-
      out-of-the-box new capabilities of support-      responding CoreNLP annotators leaving intact the
      ing basic annotations for other languages pro-   annotation structure and output of the annotators
      vided with a UD Treebank.                        they are replacing.
   This paper is organized as follows: in Section 2,      For simplicity, we used only one of the UD tree-
we present the architecture of the toolkit, whereas    banks available for Italian, namely the UD adapta-
its core components (annotators) are described in      tion of the ISDT Italian Treebank (Bosco et al.,
Section 3. The results on Italian are discussed in     2013). The resource was used to build most of the
Section 3.5. Section 4 shows preliminary experi-       new models, as well as for training standard sta-
ments for the adaptation of the software to two ad-    tistical models (e.g., PoS tagging and Dependency
ditional languages provided with a UD treebank,        Parsing) available in CoreNLP. More specifically,
namely Spanish and French.                             to obtain a UD-compliant output, we trained the
                                                       Italian models on the training, dev, and test sets
2     Architecture                                     provided within the treebank.
                                                          The current version of CoreNLP-it can be eas-
CoreNLP-it has been built as an add-on to the          ily integrated and configured into CoreNLP by
Stanford CoreNLP toolkit (Manning et al., 2014).       adding the custom annotator classes and their re-
CoreNLP offers a set of linguistic tools to per-       spective models into the pipeline. Such classes
form core linguistic analyses of texts in English      and their properties can be added in a configura-
and other languages, and produces an annotated         tion file or called via the API interface. This pro-
output in various formats such as CoNLL (Nivre         cedure follows the standard CoreNLP documenta-
et al., 2007), XML, Json, etc.                         tion and guidelines for custom annotator classes.
                                                       In addition, we provide a new class (resembling
2.1    Stanford CoreNLP
                                                       a CoreNLP one) for the training of the hybrid to-
The main architecture of CoreNLP consists of an        kenization and sentence splitting. The configura-
annotation object as well as a sequence of anno-       tion of the classifier and the required dictionaries
tators aimed at annotating texts at different levels   (cf. Section 3.1) can be specified in a separate
of analysis. Starting from a raw text, each mod-       property file.
ule adds a new annotation layer such as tokeniza-
ton, PoS tagging, parsing etc. The behavior of         3     Modules
the single annotators can be controlled via stan-
dard Java properties. Annotators can analyze text      The annotators described in the following sections
with both rule-based or statistical-based models.      are aimed at producing a UD compliant and com-
While rule-based models are typically language         plete output. The following information is ex-
dependent, statistical based ones can be trained di-   tracted from text: Sentences, Tokens, Universal
rectly within the CoreNLP toolkit in order to im-      PoS Tags, language specific PoS Tags, Lemmas,
prove the performance of the default models or to      Morphological Features, and Dependency Parse
deal with different languages and domains.             Tree for each sentence.
                                                          In this section, we briefly describe each module
2.2    CoreNLP-it                                      of our linguistic pipeline, focusing on the annota-
The main goal we pursued in developing                 tors and models it implements.
CoreNLP-it was to keep the original CoreNLP
structure and usage intact, while enabling it to       3.1    Sentence Splitting and Tokenization
deal with Italian texts in order to produce a UD-      Sentence Splitting and Tokenization are han-
compliant and UD-complete output. More specif-         dled by a single classifier, namely the annotator
ically, we aimed at building a system capable of       it tok sent. The process splits raw text into sen-
tences, and each sentence into tokens. Crucially,          ging phase was used to merge the rule-based and
the tokenization process can deal with both single         statistical predictions.
and multi-word tokens as specified by the CoNLL-
U format.                                                  3.2   Part-of-Speech Tagging
   Multi word tokens such as verbs with clitic pro-        The Maximum Entropy implementation of the
nouns (e.g., portar-vi “carry to you”) and articu-         Part-of-Speech Tagger (Toutanova et al., 2003)
lated prepositions (prep + determiner) (e.g., della,       provided in the Stanford CoreNLP toolkit has
di+la “of the”), are split into their respective com-      been used to predict language dependant PoS Tags
ponents. The information about the original word           (xPoS).
and its position in the sentence is however retained
                                                              In order to annotate Universal PoS (uPoS) tags,
within each token by exploiting the token span and
                                                           a separate annotator class, namely upos, has been
original word annotations.
                                                           implemented.
   Tokenization is usually solved with rule-based             For what concerns the xPoS Tagger, the Maxi-
systems able to identify word and sentence bound-          mum Entropy model was trained on the UD-ISDT
aries, for example by identifying white spaces and         Treebank. uPoS tags are instead approached with
full stops. However, in order to avoid encoding            a rule based strategy. In particular, we built a map-
such set of rules, we implemented a model in-              ping between xPoS and uPoS based on the UD-
spired by Evang et al. (2013). At its core, the pro-       ISTD Treebank. The mapping is used within the
cess is driven by a hybrid model. First, it uses a         annotator to assign the uPoS tag based on the pre-
character-based statistical model to recognize sen-        dicted xPoS tag.
tences, tokens, and clitic prepositions. Then, a
rule based dictionary is used to optimize the multi-       3.3   Lemmatization and Morphological
word tokens detection and splitting.                             Annotation
   The classifier tags each character with respect
to one of the following classes: i. S: start of a new      In order to annotate each token with its corre-
sentence; ii. T: start of a new token; iii. I: inside      sponding lemma and morphological features, we
of a token; iv. O: outside of a token; v. C: start of a    developed a rule-based custom annotator. The an-
clitic preposition inside a token (e.g. mandarvi).         notator exploits a parametric dictionary, to assign
   The classifier is a simple implementation of the        lemmas based on the word form and PoS. In par-
maximum entropy Column Data Classifier avail-              ticular, the dictionary contains the lemma and UD
able in the Stanford CoreNLP. To train the model,          morphological features for n (f orm, P oS) pairs.
we used the following feature set: i. window: a            The form is used as the main access key to the dic-
window of n characters before and after the target         tionary, while PoS is used to solve ambiguity, e.g.,
character; ii. the case of the character; iii. the class   between amo as ”I love” or as ”fishing hook”. Fi-
of the previous character.                                 nally, in cases of PoS ambiguity, corpus frequency
                                                           is used to select the target lemma.
   In order to deal with multi-tokens, the system
                                                              The dictionary can be manually built or ex-
allows for a full rule-based tagging of a parametric
                                                           tracted from a UD treebank. In the latter case, the
list of multi-tokens typically belonging to a strictly
                                                           provided Vocabulary class has methods to extract
language dependent closed class words. In the
                                                           and build a serialized model of the vocabulary.
Italian implementation, such words are articulated
prepositions (prep + determiner). The word list to
                                                           3.4   Dependency Parsing
be ignored is fed to the classifier during training.
   Moreover, an additional set of rules can be ap-         The Neural Network Dependency Parser imple-
plied after the classification step in order to deal       mented in Stanford CoreNLP (Chen and Manning,
with possibly misclassified items. In particular,          2014) allows models to be trained for different lan-
the system simply checks each token against a dic-         guages.
tionary of multi-words and split them accordingly.            As for Italian, we used FastText (Joulin et al.,
In the case of Italian, we built a dictionary of clitic    2016) Italian 300dim-pretrained embeddings de-
verbs (which are instead an open class) by boot-           scribed in Bojanowski et al. (2017). The depen-
strapping the verbs in the treebank with all possi-        dency parser was trained with the default configu-
ble combinations of clitic pronouns. A final tag-          ration provided in Stanford CoreNLP.
3.5     CoreNLP-it performances                        ble. To obtain the required linguistic knowledge,
Table 1 reports the global performances of the cur-    the framework exploits statistical models or exter-
rently trained models. In particular, all our mod-     nal resources. On the one hand, the use of big
els were evaluated against the UD-ISDT Treebank        linguistic resources to perform some of the tasks
test set.                                              can affect the computational performances, but the
   With respect to the Tokenization, we measured       system enables the construction of basic resources
the accuracy by considering the whole output of        from the treebank used for training. On the other
the tokenization process (i.e., the combination of     hand, this framework is very flexible, especially by
the statistical classifier and rule based multi-word   considering tasks like tokenization and lemmatiza-
tokens detection). As for Lemmatization, we            tion. In particular, the system is able to produce a
tested the system by predicting the lemmas for to-     full UD-compliant Stanford Pipeline for languages
kens in the UD-ISDT Italian test set. PoS Tagging      for which an UD Treebank is available.
and Dependency Parsing were tested with the sys-          In order to validate this claim, we focused on
tem provided in CoreNLP.                               two languages closely related to Italian, namely
                                                       Spanish and French. We trained the respective
    Task             Tokens/sec      Results           models on the UD-adapted corpora ES-ANCORA
    Tok., S.Split.    17277.4     Accuracy: 99%        (Taulé et al., 2008) and FR-GSD (Hernandez and
    xPoS Tag           7575.4        F1: 0.97          Boudin, 2013). In these cases, to detect multi-
    Lemma              5553.1     Accuracy: 92%        word tokens we exploited the information avail-
    Dep. Parsing       1717.8      LAS: 86.15          able in these corpora. It is clear that such mod-
                                   UAS: 88.57          els are intended as an interesting UD baseline, be-
                                                       cause the linguistic information they employ is not
Table 1: Evaluation of CoreNLP-it modules on the       yet as optimized as the one used by the Italian
UD-ISDT Treebank test set.                             models.
                                                          Since the core of the adaptation of the Stanford
   We must point out that one of the main short-
                                                       Pipeline to Universal Dependencies relies on the
comings of implementing a more statistically ori-
                                                       Tokenization phase, we report here the results ob-
ented model for tokenization with respect to a rule
                                                       tained for this task. It is clear that the rest of the
based one is that it may underperform in the case
                                                       models (i.e., PoS tags and Parsing) can be trained
of badly formatted or error-filled texts, which we
                                                       simply by following the Stanford CoreNLP guide-
cannot find in most Treebanks. However, we be-
                                                       lines. Results obtained for the tokenization mod-
lieve that such an approach could be nonetheless
                                                       ules for French and Spanish are shown in Table 2.
very useful in that it can be automatically scaled
to different linguistic registers and text genres.        Task             Language      Accuracy (%)
Moreover, most typical errors could be avoided by         Tok., S.Split.   Spanish           99,9
means of data augmentation strategies and the use                          French            99,7
of more heterogeneous data for training, such as          Lemma            Spanish            66
for example the PoSTWITA-UD Treebank (San-                                 French             69
guinetti et al., 2018).
   It is important to stress that the main focus of    Table 2: Evaluation of CoreNLP-it modules on
this work was to build a framework allowing for a      Spanish and French.
fast and easy implementation of UD models based
on Stanford CoreNLP from a software engineering           All statistical models have similar performances
point of view. The basic pre-trained models are        with respect to Italian ones. The main differences,
intended as a proof of concept, and will require       as expected, concern the tasks most dependent on
further parameter tuning to increase their perfor-     external resources (e.g., Lemmatization). For ex-
mance.                                                 ample, we noticed a much lower recall for multi-
                                                       word token identification, given the exclusive use
4     Flexibility Towards Other Languages
                                                       of the examples found in the training set. The ap-
One of the key goals that has driven the devel-        proach shows very promising results especially for
opment of CoreNLP-it is keeping the core code          tokenization and sentence splitting modules which
implementation as language independent as possi-       are central for all the subsequent levels of analysis
based on UD. It is clear that for PoS Tagging and      Danqi Chen and Christopher Manning. 2014. A fast
Parsing further developments based on Stanford           and accurate dependency parser using neural net-
                                                         works. In Proceedings of EMNLP 2014, pages 740–
CoreNLP and language-specific resources are re-
                                                         750, Doha, Qatar.
quired to account for the specific features of each
language.                                              Kilian Evang, Valerio Basile, Grzegorz Chrupała, and
                                                         Johan Bos. 2013. Elephant: Sequence labeling for
5   Conclusion and Ongoing Work                          word and sentence segmentation. In Proceedings of
                                                         EMNLP 2013, pages 1422–1426, Seattle, Washing-
In this paper, we presented CoreNLP-it, a set of         ton, USA. ACL.
add-on modules for the Stanford CoreNLP lan-           Jenny Rose Finkel, Trond Grenager, and Christopher
guage toolkit. Our system provides basic language         Manning. 2005. Incorporating non-local informa-
annotations such as sentence splitting, tokeniza-         tion into information extraction systems by Gibbs
tion, PoS tagging, lemmatization and dependency           sampling. In Proceedings of ACL 2005, ACL ’05,
                                                          pages 363–370, Stroudsburg, PA, USA. ACL.
parsing, and can provide a UD-compliant output.
Our rule based and statistical models achieve good     Nicolas Hernandez and Florian Boudin. 2013. Con-
performances for all tasks. In addition, since the       struction automatique d’un large corpus libre annoté
                                                         morpho-syntaxiquement en français. In Actes de la
framework has been implemented as an add-on              conférence TALN-RECITAL 2013, pages 160–173,
to Stanford CoreNLP, it offers the possibility of        Sables d’Olonne, France.
adding other new annotators, including for exam-
ple the Stanford NER (Finkel et al., 2005). More-      Armand Joulin, Edouard Grave, Piotr Bojanowski,
                                                         Matthijs Douze, Hervé Jégou, and Tomas Mikolov.
over, first experiments on other languages have          2016. Fasttext.zip: Compressing text classification
shown very good adaptation capability with very          models. CoRR.
little effort.
                                                       Christopher D. Manning, Mihai Surdeanu, John Bauer,
    In the near future, we plan to refine the core       Jenny Finkel, Steven J. Bethard, and David Mc-
code by performing extensive tests to better deal        Closky. 2014. The Stanford CoreNLP natural lan-
with additional UD-supported languages and opti-         guage processing toolkit. In Association for Compu-
mize their performances. We also plan to release         tational Linguistics (ACL) System Demonstrations,
                                                         pages 55–60.
the tool as well as the basic trained models for
Italian. Moreover, we intend to perform data aug-      Joakim Nivre, Johan Hall, Sandra Kübler, Ryan Mc-
mentation strategies to refine our models and make       Donald, Jens Nilsson, Sebastian Riedel, and Deniz
them able to work properly also with ill-formed or       Yuret. 2007. The CoNLL 2007 shared task on de-
                                                         pendency parsing. In Proceedings of The CoNLL
substandard text input.                                  Shared Task Session of EMNLP-CoNLL 2007, pages
                                                         915–932, Prague, Czech Republic. ACL.

References                                             Emanuele Pianta, Christian Girardi, and Roberto
                                                         Zanoli. 2008. The TextPro tool suite. In Proceed-
Alessio Palmero Aprosio and Giovanni Moretti. 2016.      ings of LREC 2008, pages 2603–2607, Marrakech,
  Italy goes to Stanford: a collection of CoreNLP        Morocco. European Language Resources Associa-
  modules for Italian. CoRR.                             tion (ELRA).
Giuseppe Attardi, Stefano Dei Rossi, and Maria Simi.   Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli,
  2010. The tanl pipeline. In LREC Workshop on          Alessandro Mazzei, Oronzo Antonelli, and Fabio
  WSPP, pages 15–21, Valletta, Malta.                   Tamburini. 2018. Postwita-ud: an italian twitter
                                                        treebank in universal dependencies. In Proceedings
Steven Bird, Ewan Klein, and Edward Loper. 2009.        of LREC 2018.
   Natural Language Processing with Python.
                                                       Mariona Taulé, Maria Antònia Martı́, and Marta Re-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and     casens. 2008. AnCora: Multilevel annotated cor-
   Tomas Mikolov. 2017. Enriching word vectors with     pora for catalan and spanish. In Proceedings of
   subword information. Transactions of the Associa-    LREC 2008, pages 96–101, Marrakech, Morocco.
   tion for Computational Linguistics, 5:135–146.
                                                       Kristina Toutanova, Dan Klein, Christopher D. Man-
Cristina Bosco, Simonetta Montemagni, and Maria          ning, and Yoram Singer. 2003. Feature-rich part-of-
  Simi. 2013. Converting italian treebanks: Towards      speech tagging with a cyclic dependency network.
  an italian stanford dependency treebank. In Pro-       In Proceedings of NAACL 2003, NAACL ’03, pages
  ceedings of the 7th Linguistic Annotation Workshop     173–180, Stroudsburg, PA, USA. ACL.
  and Interoperability with Discourse, pages 61–69,
  Sofia, Bulgaria.