=Paper= {{Paper |id=Vol-2253/paper58 |storemode=property |title=Tint 2.0: an All-inclusive Suite for NLP in Italian |pdfUrl=https://ceur-ws.org/Vol-2253/paper58.pdf |volume=Vol-2253 |authors=Alessio Palmero Aprosio,Giovanni Moretti |dblpUrl=https://dblp.org/rec/conf/clic-it/AprosioM18 }} ==Tint 2.0: an All-inclusive Suite for NLP in Italian== https://ceur-ws.org/Vol-2253/paper58.pdf
                  Tint 2.0: an All-inclusive Suite for NLP in Italian

               Alessio Palmero Aprosio                             Giovanni Moretti
               Fondazione Bruno Kessler                         Fondazione Bruno Kessler
                     Trento, Italy                                    Trento, Italy
                 aprosio@fbk.eu                                   moretti@fbk.eu



                     Abstract                          as Stanford CoreNLP1 and OpenNLP2 ) are de-
                                                       signed for English and sometimes adapted to other
    English. In this we paper present Tint 2.0,        languages, there is a lack of this kind of resources
    an open-source, fast and extendable Nat-           for Italian.
    ural Language Processing suite for Ital-              In this paper, we present a novel, extended re-
    ian based on Stanford CoreNLP. The new             lease of Tint (Palmero Aprosio and Moretti, 2016),
    release includes some improvements of              a suite of ready-to-use modules for Italian NLP. It
    the existing NLP modules, and a set of             is free to use, open source, and can be downloaded
    new text processing components for fine-           and used out-of-the-box (see Section 6). Com-
    grained linguistic analysis that were not          pared to the previous version, the suite has been
    available so far, including multi-word ex-         enriched with several modules for fine-grained lin-
    pression recognition, affix analysis, read-        guistic analysis that were not available for Italian
    ability and classification of complex verb         before.
    tenses.
                                                       2   Related work
    Italiano. In questo articolo presentiamo           There are plenty of linguistic pipelines available
    Tint 2.0, una collezione di moduli open-           for download. Most of them (such as Stanford
    source veloci e personalizzabili per l’ana-        CoreNLP and OpenNLP) are language indepen-
    lisi automatica di testi in italiano basa-         dent and, even if they are not available in Ital-
    ta su Stanford CoreNLP. La nuova versio-           ian out-of-the-box, they could be trained in ev-
    ne comprende alcune migliorie relative ai          ery existing language. A notable example in
    moduli standard, e l’integrazione di com-          this direction is UDpipe (Straka and Straková,
    ponenti totalmente nuovi per l’analisi lin-        2017), a trainable pipeline which performs most
    guistica. Questi includono per esempio il          of the common NLP tasks and is available in
    riconoscimento di espressioni poliremati-          more than 50 languages, and Freeling (Padró and
    che, l’analisi degli affissi, il calcolo del-      Stanilovsky, 2012), a C++ library providing lan-
    la leggibilità e il riconoscimento dei tempi      guage analysis functionalities for a variety of lan-
    verbali composti.                                  guages. There are also some pipelines for Ital-
                                                       ian, such as TextPro (Emanuele Pianta and Zanoli,
                                                       2008), T2K (Dell’Orletta et al., 2014), and TaNL,
1   Introduction
                                                       but none of them are released as open source (and
In recent years, Natural Language Processing           only TextPro can be downloaded and used for free
(NLP) technologies have become fundamental to          for research purposes). Other single components
deal with complex tasks requiring text analysis,       are unfortunately available only upon request to
such as Question Answering, Topic Classification,      the authors, for example the AnIta morphological
Text Simplification, etc. Both research institutions   analyser (Tamburini and Melandri, 2012).
and companies require accurate and reliable soft-         In this respect, Tint represents an exception be-
ware for free and efficient linguistic analysis, al-   cause not only it includes standard NLP mod-
lowing programmers to focus on the core of their       ules, for example Named Entity Recognition and
business or research. While most of the open-              1
                                                               http://stanfordnlp.github.io/CoreNLP/
source NLP tools freely available on the web (such         2
                                                               https://opennlp.apache.org/
Lemmatization, but it also provides within a single     entity linking, temporal expression identification,
framework additional components that are usually        keyword extraction.
available as separate tools, such as the identifica-
tion of multi-word expressions, the estimation of       4     Modules
text complexity and the detection of text reuse.        In this Section, we present a set of Tint modules,
   Multi-word expression identification is a well       briefly describing those that were already included
studied problem, but most of the tools are avail-       in the first release (Palmero Aprosio and Moretti,
able or optimized only for English. One of them,        2016) and focusing with more details on novel,
jMWE,3 is written in Java and provides a paral-         more recent ones. While the old modules per-
lel project4 that adds compatibility to CoreNLP         form traditional NLP tasks (i.e. morphological
(Kulkarni and Finlayson, 2011). The mwetoolkit5         analysis), we have recently integrated components
is written in Python and uses a CRF classifier          for a more fine-grained linguistic analysis of spe-
(Ramisch et al., 2010). The word2phrase module          cific phenomena, such as affixation, the identifi-
of word2vec attempts to learn phrases in a docu-        cation of multi-word expressions, anglicisms and
ment of any language (Mikolov et al., 2013), but it     euphonic “d”. These are the outcome of a larger
is more a statistical tool for phrase extraction than   project involving FBK and the Institute for Educa-
for multi-word detection.                               tional Research of the Province of Trento (Sprug-
   As for the assessment of text complexity,            noli et al., 2018), aimed at studying with NLP
READ-IT (Dell’Orletta et al., 2011) is the only ex-     tools the evolution of Italian texts towards the so-
isting tool that gathers readability information for    called neo-standard Italian (Berruto, 2012).
an Italian text. However, while the online demo
can be used for free without registration, the tool     4.1        Already existing modules
is not available for offline use.                       As described in (Palmero Aprosio and Moretti,
   As for text reuse detection, i.e. when an author     2016), the Tint pipeline provides a set of pre-
quotes (or borrows) another earlier or contempo-        installed modules for basic linguistic annotation:
rary author, in the last years it has become easier     tokenization, part-of-speech (POS) tagging, mor-
thanks to new algorithms and high availability of       phological analysis, lemmatization, named en-
texts (Mullen, 2016; Clough et al., 2002; Mihalcea      tity recognition and classification (NERC), depen-
et al., 2006). However, also in this case, no tools     dency parsing.
are available for Italian.                                 Among the modules, two have been imple-
                                                        mented from scratch and do not rely on the com-
3   Tool description
                                                        ponents available in Stanford CoreNLP: the to-
The Tint pipeline is based on Stanford CoreNLP          kenizer and the morphological analyser (see be-
(Manning et al., 2014), an open-source framework        low). POS tagging, dependency parsing and
written in Java, that provides most of the com-         NERC are performed using the existing modules
mon Natural Language Processing tasks out-of-           in CoreNLP, trained on the Universal Dependen-
the-box in various languages. The framework pro-        cies6 (UD) dataset in Italian (Bosco et al., 2013),
vides also an easy interface to extend the anno-        and I-CAB (Magnini et al., 2006) respectively.
tation to new tasks and/or languages. Differently          Additional modules include wrappers for tem-
from some similar tools, such as UIMA (Ferrucci         poral expression extraction and classification with
and Lally, 2004) and GATE (Cunningham et al.,           HeidelTime (Strötgen and Gertz, 2013), keyword
2002), CoreNLP is easy to use and requires only         extraction with Keyphrase Digger (Moretti et al.,
basic object-oriented programming skills to ex-         2015), and entity linking using DBpedia Spot-
tend it. In Tint, we adopt this framework to: (i)       light7 (Daiber et al., 2013) and The Wiki Machine8
port the most common NLP tasks to Italian; (ii)         (Giuliano et al., 2009).
make it easily extendable, both for writing new            Tokenizer: This module provides text segmen-
modules and replacing existing ones with more           tation in tokens and sentences. At first, the text
customized ones; and (iii) implement some new           is grossly tokenized. Then, in a second step, to-
annotators as wrappers for external tools, such as      kens that need to be put together are merged us-
    3                                                         6
      http://projects.csail.mit.edu/jmwe/                         http://universaldependencies.org/
    4                                                         7
      https://github.com/toliwa/CoreNLP-jMWE                      http://bit.ly/dbpspotlight
    5                                                         8
      http://mwetoolkit.sourceforge.net/PHITE.php                 http://bit.ly/thewikimachine
ing two customizable lists of Italian non-breaking       stylometry, authorship attribution, citation analy-
abbreviations (such as “dott.” or “S.p.A.”) and          sis, etc. Tint includes now a component to deal
regular expressions (for e-mail addresses, web           with this task, i.e. identifying parts of an input
URIs, numbers, dates). This second phase uses            text that overlap with a given corpus. First of all,
(De La Briandais, 1959) to speedup the process.          each sentence of the corpus is compared with the
   Morphological Analyser: The morphological             sentences in the processed text using the Fuzzy-
analyzer module provides the full list of morpho-        Wuzzy package11 , a Java fuzzy string matching
logical features for each annotated token. The cur-      implementation: this allows the system not to miss
rent version of the module has been trained us-          expressions that are slightly different with respect
ing the Morph-it lexicon (Zanchetta and Baroni,          to the texts in the original corpus. In this phase,
2005), but it is possible to extend or retrain it with   only long spans of text can be considered, as the
other Italian datasets. In order to grant fast perfor-   probability of an incorrect match on fuzzy com-
mance, the model storage has been implemented            parison grows as soon as the text length decreases.
with the mapDB Java library9 that provides an ex-        A second step checks whether the overlap involves
cellent variation of Cassandras Sorted String Ta-        the whole sentence and, if not, it analyzes the two
ble. To extend the coverage of the results, espe-        texts and identifies the number of overlapping to-
cially for the complex forms, such as “porta-ce-         kens. Finally, the Stanford CoreNLP quote anno-
ne” or “bi-direzionale”, the module tries to de-         tator12 is used to catch text reuse that is in between
compose the token into prefix-root-infix-suffix and      quotes, ignoring the length limitation of the fuzzy
tries to recognise the root form.                        comparison.
   See Section 5 for an extensive evaluation of the         Readability: In this module, we compute some
modules.                                                 metrics that can be useful to assess the readability
                                                         of a text, partially inspired by Dell’Orletta et al.
4.2         New modules                                  (2011) and Tonelli et al. (2012). In particular, we
Affixes annotation: This module provides a               include the following indices:
token-level annotation about word derivatives,
based on derIvaTario (Talamo et al., 2016).10 The          • Number of content words, hyphens (using
resource was built segmenting into derivational              iText Java Library13 ), sentences having less
cycles about 11,000 derivatives and annotating               than a fixed number of words, distribution of
them with a wide array of features. The mod-                 tokens based on part-of-speech.
ule uses this resource in input to segment a token
                                                           • Type-token ratio (TTR), i.e. the ratio between
into root and affixes, for example visione is anal-
                                                             the number of different lemmas and the num-
ysed as baseLemma=vedere, affix=zione and allo-
                                                             ber of tokens; high TTR indicates a high de-
morph=ione.
                                                             gree of lexical variation.
   Classification of verbal tenses: Part-of speech
tagger and morphological analyzer released with            • Lexical density, i.e. the number of content
Tint can identify and classify verbs at token level,         words divided by the total number of words.
but sometimes the modality, form and tense of a
verb is the result of a sequence of tokens, as in          • Amount of coordinate and subordinate
compound tenses such as participio passato, or               clauses, along with the ratio between them.
passive verb forms. For this reason, we include in
Tint a new tense module to provide a more com-             • Depth of the parse tree for each sentence:
plete annotation of multi-token verbal forms. The            both average and max depth are calculated on
module supports also the analysis of discontinuous           the whole text.
expressions, like for example ho sempre mangiato.
   Text reuse: Detecting text reuse is useful when,        • Gulpease formula (Lucisano and Piemontese,
in a document, we want to measure the overlap                1988) to measure the readability at document
with a given corpus. This is needed in a number of           level.
applications, for example for plagiarism detection,         11
                                                                 https://github.com/xdrop/fuzzywuzzy
                                                            12
                                                                 https://stanfordnlp.github.io/CoreNLP/quote.
       9
           http://www.mapdb.org                          html
      10                                                    13
           http://derivatario.sns.it/                            https://github.com/itext/itextpdf
  • Text difficulty based on word lists from De          5     Evaluation
    Mauro’s Dictionary of Basic Italian14 .
                                                         Tint includes a rich set of tools, evaluated sepa-
                                                         rately. In some cases, an evaluation based on the
    Multi-word expressions: A specific multi-            accuracy is not possible, because of the lack of
token annotator has been implemented to recog-           available gold standard or because the tool out-
nize more than 13,450 multi-word expressions, the        come is not comparable to other tools’ ones.
so-called ‘polirematiche’ (Voghera, 2004), manu-            When possible, Tint is compared with existing
ally collected from various online resources. The        pipelines that work with the Italian language: Tanl
list includes verbal, nominal, adjectival and prepo-     (Attardi et al., 2010), TextPro (Pianta et al., 2008)
sitional expressions (e.g. lasciar perdere, società     and TreeTagger (Schmid, 1994).
per azioni, nei confronti di, mezzo morto). This            In calculating speed, we run each experiment
annotator can identify also discontinuous multi-         10 times and consider the average execution time.
words. For example, in the expression andare a           When available, multi-thread capabilities have
genio (Italian phrase that means “to like”) an ad-       been disabled. All experiments have been exe-
verb can be included, as in andare troppo a genio.       cuted on a 2,3 GHz Intel Core i7 with 16 GB of
Similarly, in such phrases one can find nouns and        memory.
adjectives (e.g. lasciare Antonio a piedi, where            The Tanl API is not available as a download-
lasciare a piedi is an Italian multiword for leave       able package, but it’s only usable online through a
stranded).                                               REST API, therefore the speed may be influenced
    Anglicisms: A list of more than 2,500 angli-         by the network connection.
cisms, collected from the web, is included in the           No evaluation is performed for the Tint annota-
last release of Tint, and a particular annotator iden-   tors that act as wrappers for an external tools (tem-
tifies them in the text and distinguishes between        poral expression tagging, entity linking, keyword
adapted (“chattare”, “skillato”) and non-adapted         extraction).
anglicisms (“spread”, “leadership”). This module
                                                         5.1     Tokenization and sentence splitting
can then be used to track the use of borrowings
from English in Italian texts, a phenomenon much         For the task of tokenization and sentence splitting,
debated in the media and among scholars (Fanfani,        Tint outperforms in speed both TextPro and Tanl
1996; Furiassi, 2008).                                   (see Table 1).
    Euphonic “D”: For euphonic reasons, the                                System            Speed (tok/sec)
preposition a, and the conjunctions e and o usually                        Tint                      80,000
become ad, ed, od when the subsequent word be-                             Tanl API                  30,000
                                                                           TextPro 2.0               35,000
gins with a, e, o respectively. While traditionally
this rule was applied to every vowel, a more recent      Table 1:         Tokenization and sentence splitting
grammatical rule has established that the euphonic       speed.
‘d’ should be limited to cases in which it is fol-
lowed by the same vowel, for example ed ecco vs.
e ancora15 . Tint provides an annotator that identi-     5.2     Part-of-speech tagging
fies this phenomenon, and classifies each instance       The evaluation of the part-of-speech tagging is
as correct, if it follows the aforementioned rule, or    performed against the test set included in the UD
incorrect in all the other cases.                        dataset, containing 10K tokens. As the tagset used
    Corpus statistics: A collection of CoreNLP an-       is different for different tools, the accuracy is cal-
notators have been developed to extract statistics       culated only on five coarse-grained types: nouns
that can be used, for instance, to analyse traits of     (N), verbs (V), adverbs (B), adjectives (A) and
interest in texts. More specifically, the provided       other (O). Table 2 shows the results.
modules can mark and compute words and sen-
                                                         5.3     Lemmatization
tences based on token, lemma, part-of-speech and
word position in the sentence.                           Like part-of-speech tagging, lemmatization is
                                                         evaluated, both in terms of accuracy and execu-
   14                                                         16
        http://bit.ly/nuovo-demauro                              The (considerable) speed of TreeTagger includes both lemmatization
   15
        http://bit.ly/crusca-d-eufonica                  and part-of-speech tagging.
        System          Speed (tok/sec)     Accuracy                    System        Speed    LAS     UAS
        Tint                    28,000         98%                      Tint          9,000   84.67   87.05
        Tanl API                20,000           n.a.                   TextPro 2.0   1,300   87.30   91.47
        TextPro 2.0             20,000          96%                     Tanl (DeSR)     900   89.88   93.73
        TreeTagger           190,00016          92%
                                                              Table 5: Evaluation of the dependency parsing.
 Table 2: Evaluation of part-of-speech tagging.

                                                          6     Tint distribution
tion time, on the UD test set. When the lemma
is guessed starting form a morphological analysis         The Tint pipeline is released as an open source
(such as in Tint and TextPro), the speed is calcu-        software under the GNU General Public License
lated by including both tasks. Table 3 shows the          (GPL), version 3. It can be download from the Tint
results. All the tools reach the same accuracy of         website17 as a standalone package, or it can be in-
96% (with minor differences that are not statisti-        tegrated into an existing application as a Maven
cally significant).                                       dependency. The source code is available on
                                                          Github.18
        System          Speed (tok/sec)     Accuracy         The tool is written using the Stanford CoreNLP
        Tint                    97,000          96%       paradigm, therefore a third part software can be
        TextPro 2.0              9,000          96%       integrated easily into the pipeline.
        TreeTagger           190,00016          96%

       Table 3: Evaluation of lemmatization.              7     Conclusions and Future Works
                                                          In this paper, we presented the new release of Tint,
                                                          a simple, fast and accurate NLP pipeline for Ital-
5.4   Named Entity Recognition                            ian, based on Stanford CoreNLP. In the new ver-
For Named Entity Recognition, we evaluate and             sion, we have fixed some bugs and improved some
compare our system with the test set available on         of the existing modules. We have also added a set
the I-CAB dataset. We consider three classes:             of components for fine-grained linguistics analysis
PER, ORG, LOC. In training Tint, we extracted             that were not available so far.
a list of persons, locations and organizations by            In the future, we plan to improve the suite and
querying the Airpedia database (Palmero Apro-             extend it with additional modules, also based on
sio et al., 2013) for Wikipedia pages classified as       the feedback from the users through the github
Person, Place and Organisation, respec-                   project page. We are currently working on new
tively. Table 4 shows the results of the named en-        modules, in particular Word Sense Disambigua-
tity recognition task.                                    tion (WSD) based on linguistic resources such as
                                                          MultiWordNet (Pianta et al., 2002) and Seman-
      System           Speed       P          R      F1   tic Role Labelling, by porting to Italian resources
      Tint            30,000   84.37      79.97   82.11   such as FrameNet (Baker et al., 1998), now avail-
      TextPro 2.0      4,000   81.78      80.78   81.28
      Tanl API        16,000   72.89      52.50   61.04
                                                          able only in English.
                                                             The Tint pipeline will also be integrated in
          Table 4: Evaluation of the NER.                 PIKES (Corcoglioniti et al., 2016), a tool that ex-
                                                          tracts knowledge from English texts using NLP
                                                          and outputs it in a queryable form (such RDF
5.5   Dependency parsing                                  triples), so to extend it to Italian.
The evaluation of the dependency parser is per-
                                                          Acknowledgments
formed against Tanl (Attardi et al., 2013) and
TextPro (Lavelli, 2013) w.r.t the usual metrics La-       The research leading to this paper was partially
beled Attachment Score (LAS) and Unlabeled At-            supported by the EU Horizon 2020 Programme via
tachment Score (UAS). Table 5 shows the results:          the SIMPATICO Project (H2020-EURO-6-2015,
the Tint evaluation has been performed on the UD          n. 692819).
test data; LAS and UAS for TextPro and Tanl is
taken directly from the Evalita 2011 proceedings               17
                                                                    http://tint.fbk.eu/
(Magnini et al., 2013).                                        18
                                                                    https://github.com/dhfbk/tint/
References                                                 system for automatically extracting and organizing
                                                           knowledge from texts. In Proceedings of the Ninth
G. Attardi, S. Dei Rossi, and M. Simi. 2010. The Tanl      International Conference on Language Resources
  Pipeline. In Proc. of LREC Workshop on WSPP.             and Evaluation (LREC-2014).
Giuseppe Attardi, Maria Simi, and Andrea Zanelli.
  2013. Tuning desr for dependency parsing of ital-      Christian Girardi Emanuele Pianta and Roberto
  ian. In Evaluation of Natural Language and Speech        Zanoli.     2008.     The textpro tool suite.     In
  Tools for Italian, pages 37–45. Springer.                Bente Maegaard Joseph Mariani Jan Odijk Stelios
                                                           Piperidis Daniel Tapias Nicoletta Calzolari (Confer-
Collin F Baker, Charles J Fillmore, and John B Lowe.       ence Chair), Khalid Choukri, editor, Proceedings
  1998. The berkeley framenet project. In Proceed-         of the Sixth International Conference on Language
  ings of the 36th Annual Meeting of the Associa-          Resources and Evaluation (LREC’08), Marrakech,
  tion for Computational Linguistics and 17th Inter-       Morocco. European Language Resources Associa-
  national Conference on Computational Linguistics-        tion (ELRA).
  Volume 1, pages 86–90. Association for Computa-
  tional Linguistics.                                    Massimo Fanfani. 1996. Sugli-anglicismi nell”italiano
                                                          contemporaneo (xiv). Lingua nostra, 57(2):72–91.
Gateano Berruto. 2012. Sociolinguistica dell’italiano
  contemporaneo. Carocci.                                David Ferrucci and Adam Lally. 2004. Uima: An
Cristina Bosco, Simonetta Montemagni, and Maria            architectural approach to unstructured information
  Simi. 2013. Converting italian treebanks: Towards        processing in the corporate research environment.
  an italian stanford dependency treebank.                 Nat. Lang. Eng., 10(3-4):327–348, September.

Paul Clough, Robert Gaizauskas, Scott SL Piao, and       Cristiano Furiassi. 2008. Non-adapted Anglicisms
  Yorick Wilks. 2002. Meter: Measuring text reuse.         in Italian: Attitudes, frequency counts, and lexico-
  In Proceedings of the 40th Annual Meeting on Asso-       graphic implications. Cambridge Scholars Publish-
  ciation for Computational Linguistics, pages 152–        ing.
  159. Association for Computational Linguistics.
                                                         Claudio Giuliano, Alfio Massimiliano Gliozzo, and
Francesco Corcoglioniti, Marco Rospocher, and              Carlo Strapparava. 2009. Kernel methods for
  Alessio Palmero Aprosio. 2016. A 2-phase frame-          minimally supervised wsd.    Comput. Linguist.,
  based knowledge extraction framework. In Proc. of        35(4):513–528, December.
  ACM Symposium on Applied Computing (SAC’16).

Hamish Cunningham, Diana Maynard, Kalina                 Nidhi Kulkarni and Mark Alan Finlayson. 2011.
  Bontcheva, and Valentin Tablan. 2002. Gate: An           jmwe: A java toolkit for detecting multi-word ex-
  architecture for development of robust hlt applica-      pressions. In Proceedings of the Workshop on Mul-
  tions. In Proceedings of the 40th Annual Meeting         tiword Expressions: from Parsing and Generation
  on Association for Computational Linguistics,            to the Real World, pages 122–124. Association for
  ACL ’02, pages 168–175, Stroudsburg, PA, USA.            Computational Linguistics.
  Association for Computational Linguistics.
                                                         Alberto Lavelli. 2013. An ensemble model for the
Joachim Daiber, Max Jakob, Chris Hokamp, and               evalita 2011 dependency parsing task. In Evaluation
  Pablo N. Mendes. 2013. Improving efficiency and          of Natural Language and Speech Tools for Italian,
  accuracy in multilingual entity extraction. In Pro-      pages 30–36. Springer.
  ceedings of the 9th International Conference on Se-
  mantic Systems (I-Semantics).                          Pietro Lucisano and Maria Emanuela Piemontese.
                                                            1988. GULPEASE: una formula per la predizione
Rene De La Briandais. 1959. File searching using            della difficoltà dei testi in lingua italiana. Scuola e
  variable length keys. In Papers Presented at the the      città, 3(31):110–124.
  March 3-5, 1959, Western Joint Computer Confer-
  ence, IRE-AIEE-ACM ’59 (Western), pages 295–
                                                         Bernardo Magnini, Emanuele Pianta, Christian Girardi,
  298, New York, NY, USA. ACM.
                                                           Matteo Negri, Lorenza Romano, Manuela Speranza,
Felice Dell’Orletta, Simonetta Montemagni, and Giu-        Valentina Bartalesi Lenzi, and Rachele Sprugnoli.
  lia Venturi. 2011. Read-it: Assessing readability        2006. I-cab: the italian content annotation bank. In
  of italian texts with a view to text simplification.     Proceedings of LREC, pages 963–968. Citeseer.
  In Proceedings of the Second Workshop on Speech
  and Language Processing for Assistive Technolo-        Bernardo Magnini, Francesco Cutugno, Mauro Fal-
  gies, SLPAT ’11, pages 73–83, Stroudsburg, PA,           cone, and Emanuele Pianta. 2013. Evaluation of
  USA. Association for Computational Linguistics.          Natural Language and Speech Tool for Italian: In-
                                                           ternational Workshop, EVALITA 2011, Rome, Jan-
Felice Dell’Orletta, Giulia Venturi, Andrea Cimino,        uary 24-25, 2012, Revised Selected Papers, volume
  and Simonetta Montemagni. 2014. T2kˆ 2: a                7689. Springer.
Christopher D Manning, Mihai Surdeanu, John Bauer,        Milan Straka and Jana Straková. 2017. Tokenizing,
  Jenny Rose Finkel, Steven Bethard, and David Mc-          pos tagging, lemmatizing and parsing ud 2.0 with
  Closky. 2014. The stanford corenlp natural lan-           udpipe. In Proceedings of the CoNLL 2017 Shared
  guage processing toolkit. In ACL (System Demon-           Task: Multilingual Parsing from Raw Text to Univer-
  strations), pages 55–60.                                  sal Dependencies, pages 88–99, Vancouver, Canada.
                                                            Association for Computational Linguistics.
Rada Mihalcea, Courtney Corley, Carlo Strapparava,
  et al. 2006. Corpus-based and knowledge-based           Jannik Strötgen and Michael Gertz. 2013. Multilin-
  measures of text semantic similarity. In AAAI, vol-        gual and cross-domain temporal tagging. Language
  ume 6, pages 775–780.                                      Resources and Evaluation, 47(2):269–298.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-      Luigi Talamo, Chiara Celata, and Pier Marco
  rado, and Jeff Dean. 2013. Distributed representa-        Bertinetto. 2016. DerIvaTario: An annotated lex-
  tions of words and phrases and their composition-         icon of Italian derivatives. Word Structure, 9(1):72–
  ality. In C. J. C. Burges, L. Bottou, M. Welling,         102.
  Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
  vances in Neural Information Processing Systems         Fabio Tamburini and Matias Melandri.          2012.
  26, pages 3111–3119. Curran Associates, Inc.              Anita: a powerful morphological analyser for ital-
                                                            ian. In Nicoletta Calzolari (Conference Chair),
Giovanni Moretti, Rachele Sprugnoli, and Sara Tonelli.      Khalid Choukri, Thierry Declerck, Mehmet Uur
  2015. Digging in the dirt: Extracting keyphrases          Doan, Bente Maegaard, Joseph Mariani, Asuncion
  from texts with kd. CLiC it, page 198.                    Moreno, Jan Odijk, and Stelios Piperidis, editors,
                                                            Proceedings of the Eight International Conference
Lincoln Mullen, 2016. textreuse: Detect Text Reuse          on Language Resources and Evaluation (LREC’12),
  and Document Similarity. R package version 0.1.4.         Istanbul, Turkey. European Language Resources As-
                                                            sociation (ELRA).
Lluı́s Padró and Evgeny Stanilovsky. 2012. Freeling
  3.0: Towards wider multilinguality. In LREC2012.        Sara Tonelli, Ke Tran Manh, and Emanuele Pianta.
                                                            2012. Making readability indices readable. In Pro-
A. Palmero Aprosio and G. Moretti. 2016. Italy goes         ceedings of the First Workshop on Predicting and
  to Stanford: a collection of CoreNLP modules for          Improving Text Readability for target reader popu-
  Italian. ArXiv e-prints, September.                       lations, pages 40–48, Montréal, Canada, June. As-
                                                            sociation for Computational Linguistics.
Alessio Palmero Aprosio, Claudio Giuliano, and Al-
  berto Lavelli. 2013. Automatic expansion of DB-         Miriam Voghera. 2004. Polirematiche. La formazione
  pedia exploiting Wikipedia cross-language informa-        delle parole in italiano, pages 56–69.
  tion. In Proceedings of the 10th Extended Semantic
  Web Conference.                                         Eros Zanchetta and Marco Baroni. 2005. Morph-it!
                                                            a free corpus-based morphological resource for the
Emanuele Pianta, Luisa Bentivogli, and Christian Gi-        italian language. Corpus Linguistics 2005, 1(1).
  rardi. 2002. Developing an aligned multilingual
  database. In Proc. 1st Intl Conference on Global
  WordNet. Citeseer.

Emanuele Pianta, Christian Girardi, and Roberto
  Zanoli. 2008. The textpro tool suite. In LREC.
  Citeseer.

Carlos Ramisch, Aline Villavicencio, and Christian
  Boitet. 2010. Multiword expressions in the wild?:
  the mwetoolkit comes in handy. In Proceedings
  of the 23rd International Conference on Computa-
  tional Linguistics: Demonstrations, pages 57–60.
  Association for Computational Linguistics.

Helmut Schmid. 1994. Probabilistic part-of-speech
  tagging using decision trees.

Rachele Sprugnoli, Sara Tonelli, Alessio Palmero
  Aprosio, and Giovanni Moretti. 2018. Analysing
  the evolution of students’ writing skills and the im-
  pact of neo-standard italian with the help of com-
  putational linguistics. In Proceedings of the Sixth
  Italian Conference on Computational Linguistics
  (CLiC-it 2018), Torino, Italy.