=Paper=
{{Paper
|id=Vol-2253/paper58
|storemode=property
|title=Tint 2.0: an All-inclusive Suite for NLP in Italian
|pdfUrl=https://ceur-ws.org/Vol-2253/paper58.pdf
|volume=Vol-2253
|authors=Alessio Palmero Aprosio,Giovanni Moretti
|dblpUrl=https://dblp.org/rec/conf/clic-it/AprosioM18
}}
==Tint 2.0: an All-inclusive Suite for NLP in Italian==
Tint 2.0: an All-inclusive Suite for NLP in Italian Alessio Palmero Aprosio Giovanni Moretti Fondazione Bruno Kessler Fondazione Bruno Kessler Trento, Italy Trento, Italy aprosio@fbk.eu moretti@fbk.eu Abstract as Stanford CoreNLP1 and OpenNLP2 ) are de- signed for English and sometimes adapted to other English. In this we paper present Tint 2.0, languages, there is a lack of this kind of resources an open-source, fast and extendable Nat- for Italian. ural Language Processing suite for Ital- In this paper, we present a novel, extended re- ian based on Stanford CoreNLP. The new lease of Tint (Palmero Aprosio and Moretti, 2016), release includes some improvements of a suite of ready-to-use modules for Italian NLP. It the existing NLP modules, and a set of is free to use, open source, and can be downloaded new text processing components for fine- and used out-of-the-box (see Section 6). Com- grained linguistic analysis that were not pared to the previous version, the suite has been available so far, including multi-word ex- enriched with several modules for fine-grained lin- pression recognition, affix analysis, read- guistic analysis that were not available for Italian ability and classification of complex verb before. tenses. 2 Related work Italiano. In questo articolo presentiamo There are plenty of linguistic pipelines available Tint 2.0, una collezione di moduli open- for download. Most of them (such as Stanford source veloci e personalizzabili per l’ana- CoreNLP and OpenNLP) are language indepen- lisi automatica di testi in italiano basa- dent and, even if they are not available in Ital- ta su Stanford CoreNLP. La nuova versio- ian out-of-the-box, they could be trained in ev- ne comprende alcune migliorie relative ai ery existing language. A notable example in moduli standard, e l’integrazione di com- this direction is UDpipe (Straka and Straková, ponenti totalmente nuovi per l’analisi lin- 2017), a trainable pipeline which performs most guistica. Questi includono per esempio il of the common NLP tasks and is available in riconoscimento di espressioni poliremati- more than 50 languages, and Freeling (Padró and che, l’analisi degli affissi, il calcolo del- Stanilovsky, 2012), a C++ library providing lan- la leggibilità e il riconoscimento dei tempi guage analysis functionalities for a variety of lan- verbali composti. guages. There are also some pipelines for Ital- ian, such as TextPro (Emanuele Pianta and Zanoli, 2008), T2K (Dell’Orletta et al., 2014), and TaNL, 1 Introduction but none of them are released as open source (and In recent years, Natural Language Processing only TextPro can be downloaded and used for free (NLP) technologies have become fundamental to for research purposes). Other single components deal with complex tasks requiring text analysis, are unfortunately available only upon request to such as Question Answering, Topic Classification, the authors, for example the AnIta morphological Text Simplification, etc. Both research institutions analyser (Tamburini and Melandri, 2012). and companies require accurate and reliable soft- In this respect, Tint represents an exception be- ware for free and efficient linguistic analysis, al- cause not only it includes standard NLP mod- lowing programmers to focus on the core of their ules, for example Named Entity Recognition and business or research. While most of the open- 1 http://stanfordnlp.github.io/CoreNLP/ source NLP tools freely available on the web (such 2 https://opennlp.apache.org/ Lemmatization, but it also provides within a single entity linking, temporal expression identification, framework additional components that are usually keyword extraction. available as separate tools, such as the identifica- tion of multi-word expressions, the estimation of 4 Modules text complexity and the detection of text reuse. In this Section, we present a set of Tint modules, Multi-word expression identification is a well briefly describing those that were already included studied problem, but most of the tools are avail- in the first release (Palmero Aprosio and Moretti, able or optimized only for English. One of them, 2016) and focusing with more details on novel, jMWE,3 is written in Java and provides a paral- more recent ones. While the old modules per- lel project4 that adds compatibility to CoreNLP form traditional NLP tasks (i.e. morphological (Kulkarni and Finlayson, 2011). The mwetoolkit5 analysis), we have recently integrated components is written in Python and uses a CRF classifier for a more fine-grained linguistic analysis of spe- (Ramisch et al., 2010). The word2phrase module cific phenomena, such as affixation, the identifi- of word2vec attempts to learn phrases in a docu- cation of multi-word expressions, anglicisms and ment of any language (Mikolov et al., 2013), but it euphonic “d”. These are the outcome of a larger is more a statistical tool for phrase extraction than project involving FBK and the Institute for Educa- for multi-word detection. tional Research of the Province of Trento (Sprug- As for the assessment of text complexity, noli et al., 2018), aimed at studying with NLP READ-IT (Dell’Orletta et al., 2011) is the only ex- tools the evolution of Italian texts towards the so- isting tool that gathers readability information for called neo-standard Italian (Berruto, 2012). an Italian text. However, while the online demo can be used for free without registration, the tool 4.1 Already existing modules is not available for offline use. As described in (Palmero Aprosio and Moretti, As for text reuse detection, i.e. when an author 2016), the Tint pipeline provides a set of pre- quotes (or borrows) another earlier or contempo- installed modules for basic linguistic annotation: rary author, in the last years it has become easier tokenization, part-of-speech (POS) tagging, mor- thanks to new algorithms and high availability of phological analysis, lemmatization, named en- texts (Mullen, 2016; Clough et al., 2002; Mihalcea tity recognition and classification (NERC), depen- et al., 2006). However, also in this case, no tools dency parsing. are available for Italian. Among the modules, two have been imple- mented from scratch and do not rely on the com- 3 Tool description ponents available in Stanford CoreNLP: the to- The Tint pipeline is based on Stanford CoreNLP kenizer and the morphological analyser (see be- (Manning et al., 2014), an open-source framework low). POS tagging, dependency parsing and written in Java, that provides most of the com- NERC are performed using the existing modules mon Natural Language Processing tasks out-of- in CoreNLP, trained on the Universal Dependen- the-box in various languages. The framework pro- cies6 (UD) dataset in Italian (Bosco et al., 2013), vides also an easy interface to extend the anno- and I-CAB (Magnini et al., 2006) respectively. tation to new tasks and/or languages. Differently Additional modules include wrappers for tem- from some similar tools, such as UIMA (Ferrucci poral expression extraction and classification with and Lally, 2004) and GATE (Cunningham et al., HeidelTime (Strötgen and Gertz, 2013), keyword 2002), CoreNLP is easy to use and requires only extraction with Keyphrase Digger (Moretti et al., basic object-oriented programming skills to ex- 2015), and entity linking using DBpedia Spot- tend it. In Tint, we adopt this framework to: (i) light7 (Daiber et al., 2013) and The Wiki Machine8 port the most common NLP tasks to Italian; (ii) (Giuliano et al., 2009). make it easily extendable, both for writing new Tokenizer: This module provides text segmen- modules and replacing existing ones with more tation in tokens and sentences. At first, the text customized ones; and (iii) implement some new is grossly tokenized. Then, in a second step, to- annotators as wrappers for external tools, such as kens that need to be put together are merged us- 3 6 http://projects.csail.mit.edu/jmwe/ http://universaldependencies.org/ 4 7 https://github.com/toliwa/CoreNLP-jMWE http://bit.ly/dbpspotlight 5 8 http://mwetoolkit.sourceforge.net/PHITE.php http://bit.ly/thewikimachine ing two customizable lists of Italian non-breaking stylometry, authorship attribution, citation analy- abbreviations (such as “dott.” or “S.p.A.”) and sis, etc. Tint includes now a component to deal regular expressions (for e-mail addresses, web with this task, i.e. identifying parts of an input URIs, numbers, dates). This second phase uses text that overlap with a given corpus. First of all, (De La Briandais, 1959) to speedup the process. each sentence of the corpus is compared with the Morphological Analyser: The morphological sentences in the processed text using the Fuzzy- analyzer module provides the full list of morpho- Wuzzy package11 , a Java fuzzy string matching logical features for each annotated token. The cur- implementation: this allows the system not to miss rent version of the module has been trained us- expressions that are slightly different with respect ing the Morph-it lexicon (Zanchetta and Baroni, to the texts in the original corpus. In this phase, 2005), but it is possible to extend or retrain it with only long spans of text can be considered, as the other Italian datasets. In order to grant fast perfor- probability of an incorrect match on fuzzy com- mance, the model storage has been implemented parison grows as soon as the text length decreases. with the mapDB Java library9 that provides an ex- A second step checks whether the overlap involves cellent variation of Cassandras Sorted String Ta- the whole sentence and, if not, it analyzes the two ble. To extend the coverage of the results, espe- texts and identifies the number of overlapping to- cially for the complex forms, such as “porta-ce- kens. Finally, the Stanford CoreNLP quote anno- ne” or “bi-direzionale”, the module tries to de- tator12 is used to catch text reuse that is in between compose the token into prefix-root-infix-suffix and quotes, ignoring the length limitation of the fuzzy tries to recognise the root form. comparison. See Section 5 for an extensive evaluation of the Readability: In this module, we compute some modules. metrics that can be useful to assess the readability of a text, partially inspired by Dell’Orletta et al. 4.2 New modules (2011) and Tonelli et al. (2012). In particular, we Affixes annotation: This module provides a include the following indices: token-level annotation about word derivatives, based on derIvaTario (Talamo et al., 2016).10 The • Number of content words, hyphens (using resource was built segmenting into derivational iText Java Library13 ), sentences having less cycles about 11,000 derivatives and annotating than a fixed number of words, distribution of them with a wide array of features. The mod- tokens based on part-of-speech. ule uses this resource in input to segment a token • Type-token ratio (TTR), i.e. the ratio between into root and affixes, for example visione is anal- the number of different lemmas and the num- ysed as baseLemma=vedere, affix=zione and allo- ber of tokens; high TTR indicates a high de- morph=ione. gree of lexical variation. Classification of verbal tenses: Part-of speech tagger and morphological analyzer released with • Lexical density, i.e. the number of content Tint can identify and classify verbs at token level, words divided by the total number of words. but sometimes the modality, form and tense of a verb is the result of a sequence of tokens, as in • Amount of coordinate and subordinate compound tenses such as participio passato, or clauses, along with the ratio between them. passive verb forms. For this reason, we include in Tint a new tense module to provide a more com- • Depth of the parse tree for each sentence: plete annotation of multi-token verbal forms. The both average and max depth are calculated on module supports also the analysis of discontinuous the whole text. expressions, like for example ho sempre mangiato. Text reuse: Detecting text reuse is useful when, • Gulpease formula (Lucisano and Piemontese, in a document, we want to measure the overlap 1988) to measure the readability at document with a given corpus. This is needed in a number of level. applications, for example for plagiarism detection, 11 https://github.com/xdrop/fuzzywuzzy 12 https://stanfordnlp.github.io/CoreNLP/quote. 9 http://www.mapdb.org html 10 13 http://derivatario.sns.it/ https://github.com/itext/itextpdf • Text difficulty based on word lists from De 5 Evaluation Mauro’s Dictionary of Basic Italian14 . Tint includes a rich set of tools, evaluated sepa- rately. In some cases, an evaluation based on the Multi-word expressions: A specific multi- accuracy is not possible, because of the lack of token annotator has been implemented to recog- available gold standard or because the tool out- nize more than 13,450 multi-word expressions, the come is not comparable to other tools’ ones. so-called ‘polirematiche’ (Voghera, 2004), manu- When possible, Tint is compared with existing ally collected from various online resources. The pipelines that work with the Italian language: Tanl list includes verbal, nominal, adjectival and prepo- (Attardi et al., 2010), TextPro (Pianta et al., 2008) sitional expressions (e.g. lasciar perdere, società and TreeTagger (Schmid, 1994). per azioni, nei confronti di, mezzo morto). This In calculating speed, we run each experiment annotator can identify also discontinuous multi- 10 times and consider the average execution time. words. For example, in the expression andare a When available, multi-thread capabilities have genio (Italian phrase that means “to like”) an ad- been disabled. All experiments have been exe- verb can be included, as in andare troppo a genio. cuted on a 2,3 GHz Intel Core i7 with 16 GB of Similarly, in such phrases one can find nouns and memory. adjectives (e.g. lasciare Antonio a piedi, where The Tanl API is not available as a download- lasciare a piedi is an Italian multiword for leave able package, but it’s only usable online through a stranded). REST API, therefore the speed may be influenced Anglicisms: A list of more than 2,500 angli- by the network connection. cisms, collected from the web, is included in the No evaluation is performed for the Tint annota- last release of Tint, and a particular annotator iden- tors that act as wrappers for an external tools (tem- tifies them in the text and distinguishes between poral expression tagging, entity linking, keyword adapted (“chattare”, “skillato”) and non-adapted extraction). anglicisms (“spread”, “leadership”). This module 5.1 Tokenization and sentence splitting can then be used to track the use of borrowings from English in Italian texts, a phenomenon much For the task of tokenization and sentence splitting, debated in the media and among scholars (Fanfani, Tint outperforms in speed both TextPro and Tanl 1996; Furiassi, 2008). (see Table 1). Euphonic “D”: For euphonic reasons, the System Speed (tok/sec) preposition a, and the conjunctions e and o usually Tint 80,000 become ad, ed, od when the subsequent word be- Tanl API 30,000 TextPro 2.0 35,000 gins with a, e, o respectively. While traditionally this rule was applied to every vowel, a more recent Table 1: Tokenization and sentence splitting grammatical rule has established that the euphonic speed. ‘d’ should be limited to cases in which it is fol- lowed by the same vowel, for example ed ecco vs. e ancora15 . Tint provides an annotator that identi- 5.2 Part-of-speech tagging fies this phenomenon, and classifies each instance The evaluation of the part-of-speech tagging is as correct, if it follows the aforementioned rule, or performed against the test set included in the UD incorrect in all the other cases. dataset, containing 10K tokens. As the tagset used Corpus statistics: A collection of CoreNLP an- is different for different tools, the accuracy is cal- notators have been developed to extract statistics culated only on five coarse-grained types: nouns that can be used, for instance, to analyse traits of (N), verbs (V), adverbs (B), adjectives (A) and interest in texts. More specifically, the provided other (O). Table 2 shows the results. modules can mark and compute words and sen- 5.3 Lemmatization tences based on token, lemma, part-of-speech and word position in the sentence. Like part-of-speech tagging, lemmatization is evaluated, both in terms of accuracy and execu- 14 16 http://bit.ly/nuovo-demauro The (considerable) speed of TreeTagger includes both lemmatization 15 http://bit.ly/crusca-d-eufonica and part-of-speech tagging. System Speed (tok/sec) Accuracy System Speed LAS UAS Tint 28,000 98% Tint 9,000 84.67 87.05 Tanl API 20,000 n.a. TextPro 2.0 1,300 87.30 91.47 TextPro 2.0 20,000 96% Tanl (DeSR) 900 89.88 93.73 TreeTagger 190,00016 92% Table 5: Evaluation of the dependency parsing. Table 2: Evaluation of part-of-speech tagging. 6 Tint distribution tion time, on the UD test set. When the lemma is guessed starting form a morphological analysis The Tint pipeline is released as an open source (such as in Tint and TextPro), the speed is calcu- software under the GNU General Public License lated by including both tasks. Table 3 shows the (GPL), version 3. It can be download from the Tint results. All the tools reach the same accuracy of website17 as a standalone package, or it can be in- 96% (with minor differences that are not statisti- tegrated into an existing application as a Maven cally significant). dependency. The source code is available on Github.18 System Speed (tok/sec) Accuracy The tool is written using the Stanford CoreNLP Tint 97,000 96% paradigm, therefore a third part software can be TextPro 2.0 9,000 96% integrated easily into the pipeline. TreeTagger 190,00016 96% Table 3: Evaluation of lemmatization. 7 Conclusions and Future Works In this paper, we presented the new release of Tint, a simple, fast and accurate NLP pipeline for Ital- 5.4 Named Entity Recognition ian, based on Stanford CoreNLP. In the new ver- For Named Entity Recognition, we evaluate and sion, we have fixed some bugs and improved some compare our system with the test set available on of the existing modules. We have also added a set the I-CAB dataset. We consider three classes: of components for fine-grained linguistics analysis PER, ORG, LOC. In training Tint, we extracted that were not available so far. a list of persons, locations and organizations by In the future, we plan to improve the suite and querying the Airpedia database (Palmero Apro- extend it with additional modules, also based on sio et al., 2013) for Wikipedia pages classified as the feedback from the users through the github Person, Place and Organisation, respec- project page. We are currently working on new tively. Table 4 shows the results of the named en- modules, in particular Word Sense Disambigua- tity recognition task. tion (WSD) based on linguistic resources such as MultiWordNet (Pianta et al., 2002) and Seman- System Speed P R F1 tic Role Labelling, by porting to Italian resources Tint 30,000 84.37 79.97 82.11 such as FrameNet (Baker et al., 1998), now avail- TextPro 2.0 4,000 81.78 80.78 81.28 Tanl API 16,000 72.89 52.50 61.04 able only in English. The Tint pipeline will also be integrated in Table 4: Evaluation of the NER. PIKES (Corcoglioniti et al., 2016), a tool that ex- tracts knowledge from English texts using NLP and outputs it in a queryable form (such RDF 5.5 Dependency parsing triples), so to extend it to Italian. The evaluation of the dependency parser is per- Acknowledgments formed against Tanl (Attardi et al., 2013) and TextPro (Lavelli, 2013) w.r.t the usual metrics La- The research leading to this paper was partially beled Attachment Score (LAS) and Unlabeled At- supported by the EU Horizon 2020 Programme via tachment Score (UAS). Table 5 shows the results: the SIMPATICO Project (H2020-EURO-6-2015, the Tint evaluation has been performed on the UD n. 692819). test data; LAS and UAS for TextPro and Tanl is taken directly from the Evalita 2011 proceedings 17 http://tint.fbk.eu/ (Magnini et al., 2013). 18 https://github.com/dhfbk/tint/ References system for automatically extracting and organizing knowledge from texts. In Proceedings of the Ninth G. Attardi, S. Dei Rossi, and M. Simi. 2010. The Tanl International Conference on Language Resources Pipeline. In Proc. of LREC Workshop on WSPP. and Evaluation (LREC-2014). Giuseppe Attardi, Maria Simi, and Andrea Zanelli. 2013. Tuning desr for dependency parsing of ital- Christian Girardi Emanuele Pianta and Roberto ian. In Evaluation of Natural Language and Speech Zanoli. 2008. The textpro tool suite. In Tools for Italian, pages 37–45. Springer. Bente Maegaard Joseph Mariani Jan Odijk Stelios Piperidis Daniel Tapias Nicoletta Calzolari (Confer- Collin F Baker, Charles J Fillmore, and John B Lowe. ence Chair), Khalid Choukri, editor, Proceedings 1998. The berkeley framenet project. In Proceed- of the Sixth International Conference on Language ings of the 36th Annual Meeting of the Associa- Resources and Evaluation (LREC’08), Marrakech, tion for Computational Linguistics and 17th Inter- Morocco. European Language Resources Associa- national Conference on Computational Linguistics- tion (ELRA). Volume 1, pages 86–90. Association for Computa- tional Linguistics. Massimo Fanfani. 1996. Sugli-anglicismi nell”italiano contemporaneo (xiv). Lingua nostra, 57(2):72–91. Gateano Berruto. 2012. Sociolinguistica dell’italiano contemporaneo. Carocci. David Ferrucci and Adam Lally. 2004. Uima: An Cristina Bosco, Simonetta Montemagni, and Maria architectural approach to unstructured information Simi. 2013. Converting italian treebanks: Towards processing in the corporate research environment. an italian stanford dependency treebank. Nat. Lang. Eng., 10(3-4):327–348, September. Paul Clough, Robert Gaizauskas, Scott SL Piao, and Cristiano Furiassi. 2008. Non-adapted Anglicisms Yorick Wilks. 2002. Meter: Measuring text reuse. in Italian: Attitudes, frequency counts, and lexico- In Proceedings of the 40th Annual Meeting on Asso- graphic implications. Cambridge Scholars Publish- ciation for Computational Linguistics, pages 152– ing. 159. Association for Computational Linguistics. Claudio Giuliano, Alfio Massimiliano Gliozzo, and Francesco Corcoglioniti, Marco Rospocher, and Carlo Strapparava. 2009. Kernel methods for Alessio Palmero Aprosio. 2016. A 2-phase frame- minimally supervised wsd. Comput. Linguist., based knowledge extraction framework. In Proc. of 35(4):513–528, December. ACM Symposium on Applied Computing (SAC’16). Hamish Cunningham, Diana Maynard, Kalina Nidhi Kulkarni and Mark Alan Finlayson. 2011. Bontcheva, and Valentin Tablan. 2002. Gate: An jmwe: A java toolkit for detecting multi-word ex- architecture for development of robust hlt applica- pressions. In Proceedings of the Workshop on Mul- tions. In Proceedings of the 40th Annual Meeting tiword Expressions: from Parsing and Generation on Association for Computational Linguistics, to the Real World, pages 122–124. Association for ACL ’02, pages 168–175, Stroudsburg, PA, USA. Computational Linguistics. Association for Computational Linguistics. Alberto Lavelli. 2013. An ensemble model for the Joachim Daiber, Max Jakob, Chris Hokamp, and evalita 2011 dependency parsing task. In Evaluation Pablo N. Mendes. 2013. Improving efficiency and of Natural Language and Speech Tools for Italian, accuracy in multilingual entity extraction. In Pro- pages 30–36. Springer. ceedings of the 9th International Conference on Se- mantic Systems (I-Semantics). Pietro Lucisano and Maria Emanuela Piemontese. 1988. GULPEASE: una formula per la predizione Rene De La Briandais. 1959. File searching using della difficoltà dei testi in lingua italiana. Scuola e variable length keys. In Papers Presented at the the città, 3(31):110–124. March 3-5, 1959, Western Joint Computer Confer- ence, IRE-AIEE-ACM ’59 (Western), pages 295– Bernardo Magnini, Emanuele Pianta, Christian Girardi, 298, New York, NY, USA. ACM. Matteo Negri, Lorenza Romano, Manuela Speranza, Felice Dell’Orletta, Simonetta Montemagni, and Giu- Valentina Bartalesi Lenzi, and Rachele Sprugnoli. lia Venturi. 2011. Read-it: Assessing readability 2006. I-cab: the italian content annotation bank. In of italian texts with a view to text simplification. Proceedings of LREC, pages 963–968. Citeseer. In Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technolo- Bernardo Magnini, Francesco Cutugno, Mauro Fal- gies, SLPAT ’11, pages 73–83, Stroudsburg, PA, cone, and Emanuele Pianta. 2013. Evaluation of USA. Association for Computational Linguistics. Natural Language and Speech Tool for Italian: In- ternational Workshop, EVALITA 2011, Rome, Jan- Felice Dell’Orletta, Giulia Venturi, Andrea Cimino, uary 24-25, 2012, Revised Selected Papers, volume and Simonetta Montemagni. 2014. T2kˆ 2: a 7689. Springer. Christopher D Manning, Mihai Surdeanu, John Bauer, Milan Straka and Jana Straková. 2017. Tokenizing, Jenny Rose Finkel, Steven Bethard, and David Mc- pos tagging, lemmatizing and parsing ud 2.0 with Closky. 2014. The stanford corenlp natural lan- udpipe. In Proceedings of the CoNLL 2017 Shared guage processing toolkit. In ACL (System Demon- Task: Multilingual Parsing from Raw Text to Univer- strations), pages 55–60. sal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics. Rada Mihalcea, Courtney Corley, Carlo Strapparava, et al. 2006. Corpus-based and knowledge-based Jannik Strötgen and Michael Gertz. 2013. Multilin- measures of text semantic similarity. In AAAI, vol- gual and cross-domain temporal tagging. Language ume 6, pages 775–780. Resources and Evaluation, 47(2):269–298. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Luigi Talamo, Chiara Celata, and Pier Marco rado, and Jeff Dean. 2013. Distributed representa- Bertinetto. 2016. DerIvaTario: An annotated lex- tions of words and phrases and their composition- icon of Italian derivatives. Word Structure, 9(1):72– ality. In C. J. C. Burges, L. Bottou, M. Welling, 102. Z. Ghahramani, and K. Q. Weinberger, editors, Ad- vances in Neural Information Processing Systems Fabio Tamburini and Matias Melandri. 2012. 26, pages 3111–3119. Curran Associates, Inc. Anita: a powerful morphological analyser for ital- ian. In Nicoletta Calzolari (Conference Chair), Giovanni Moretti, Rachele Sprugnoli, and Sara Tonelli. Khalid Choukri, Thierry Declerck, Mehmet Uur 2015. Digging in the dirt: Extracting keyphrases Doan, Bente Maegaard, Joseph Mariani, Asuncion from texts with kd. CLiC it, page 198. Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference Lincoln Mullen, 2016. textreuse: Detect Text Reuse on Language Resources and Evaluation (LREC’12), and Document Similarity. R package version 0.1.4. Istanbul, Turkey. European Language Resources As- sociation (ELRA). Lluı́s Padró and Evgeny Stanilovsky. 2012. Freeling 3.0: Towards wider multilinguality. In LREC2012. Sara Tonelli, Ke Tran Manh, and Emanuele Pianta. 2012. Making readability indices readable. In Pro- A. Palmero Aprosio and G. Moretti. 2016. Italy goes ceedings of the First Workshop on Predicting and to Stanford: a collection of CoreNLP modules for Improving Text Readability for target reader popu- Italian. ArXiv e-prints, September. lations, pages 40–48, Montréal, Canada, June. As- sociation for Computational Linguistics. Alessio Palmero Aprosio, Claudio Giuliano, and Al- berto Lavelli. 2013. Automatic expansion of DB- Miriam Voghera. 2004. Polirematiche. La formazione pedia exploiting Wikipedia cross-language informa- delle parole in italiano, pages 56–69. tion. In Proceedings of the 10th Extended Semantic Web Conference. Eros Zanchetta and Marco Baroni. 2005. Morph-it! a free corpus-based morphological resource for the Emanuele Pianta, Luisa Bentivogli, and Christian Gi- italian language. Corpus Linguistics 2005, 1(1). rardi. 2002. Developing an aligned multilingual database. In Proc. 1st Intl Conference on Global WordNet. Citeseer. Emanuele Pianta, Christian Girardi, and Roberto Zanoli. 2008. The textpro tool suite. In LREC. Citeseer. Carlos Ramisch, Aline Villavicencio, and Christian Boitet. 2010. Multiword expressions in the wild?: the mwetoolkit comes in handy. In Proceedings of the 23rd International Conference on Computa- tional Linguistics: Demonstrations, pages 57–60. Association for Computational Linguistics. Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. Rachele Sprugnoli, Sara Tonelli, Alessio Palmero Aprosio, and Giovanni Moretti. 2018. Analysing the evolution of students’ writing skills and the im- pact of neo-standard italian with the help of com- putational linguistics. In Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2018), Torino, Italy.