<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tint 2.0: an All-inclusive Suite for NLP in Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Italy aprosio@fbk.eu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Giovanni Moretti Fondazione Bruno Kessler Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. In this we paper present Tint 2.0, an open-source, fast and extendable Natural Language Processing suite for Italian based on Stanford CoreNLP. The new release includes some improvements of the existing NLP modules, and a set of new text processing components for finegrained linguistic analysis that were not available so far, including multi-word expression recognition, affix analysis, readability and classification of complex verb tenses.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. In questo articolo presentiamo
Tint 2.0, una collezione di moduli
opensource veloci e personalizzabili per
l’analisi automatica di testi in italiano
basata su Stanford CoreNLP. La nuova
versione comprende alcune migliorie relative ai
moduli standard, e l’integrazione di
componenti totalmente nuovi per l’analisi
linguistica. Questi includono per esempio il
riconoscimento di espressioni
polirematiche, l’analisi degli affissi, il calcolo
della leggibilita` e il riconoscimento dei tempi
verbali composti.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>In recent years, Natural Language Processing
(NLP) technologies have become fundamental to
deal with complex tasks requiring text analysis,
such as Question Answering, Topic Classification,
Text Simplification, etc. Both research institutions
and companies require accurate and reliable
software for free and efficient linguistic analysis,
allowing programmers to focus on the core of their
business or research. While most of the
opensource NLP tools freely available on the web (such
as Stanford CoreNLP1 and OpenNLP2) are
designed for English and sometimes adapted to other
languages, there is a lack of this kind of resources
for Italian.</p>
      <p>
        In this paper, we present a novel, extended
release of Tint
        <xref ref-type="bibr" rid="ref22 ref31 ref7">(Palmero Aprosio and Moretti, 2016)</xref>
        ,
a suite of ready-to-use modules for Italian NLP. It
is free to use, open source, and can be downloaded
and used out-of-the-box (see Section 6).
Compared to the previous version, the suite has been
enriched with several modules for fine-grained
linguistic analysis that were not available for Italian
before.
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Related work</title>
      <p>
        There are plenty of linguistic pipelines available
for download. Most of them (such as Stanford
CoreNLP and OpenNLP) are language
independent and, even if they are not available in
Italian out-of-the-box, they could be trained in
every existing language. A notable example in
this direction is UDpipe
        <xref ref-type="bibr" rid="ref29">(Straka and Strakova´,
2017)</xref>
        , a trainable pipeline which performs most
of the common NLP tasks and is available in
more than 50 languages, and Freeling
        <xref ref-type="bibr" rid="ref15 ref21 ref32 ref33">(Padro´ and
Stanilovsky, 2012)</xref>
        , a C++ library providing
language analysis functionalities for a variety of
languages. There are also some pipelines for
Italian, such as TextPro
        <xref ref-type="bibr" rid="ref25">(Emanuele Pianta and Zanoli,
2008)</xref>
        , T2K
        <xref ref-type="bibr" rid="ref12">(Dell’Orletta et al., 2014)</xref>
        , and TaNL,
but none of them are released as open source (and
only TextPro can be downloaded and used for free
for research purposes). Other single components
are unfortunately available only upon request to
the authors, for example the AnIta morphological
analyser
        <xref ref-type="bibr" rid="ref15 ref21 ref32 ref33">(Tamburini and Melandri, 2012)</xref>
        .
      </p>
      <p>In this respect, Tint represents an exception
because not only it includes standard NLP
modules, for example Named Entity Recognition and
1http://stanfordnlp.github.io/CoreNLP/
2https://opennlp.apache.org/
Lemmatization, but it also provides within a single
framework additional components that are usually
available as separate tools, such as the
identification of multi-word expressions, the estimation of
text complexity and the detection of text reuse.</p>
      <p>
        Multi-word expression identification is a well
studied problem, but most of the tools are
available or optimized only for English. One of them,
jMWE,3 is written in Java and provides a
parallel project4 that adds compatibility to CoreNLP
        <xref ref-type="bibr" rid="ref11 ref15">(Kulkarni and Finlayson, 2011)</xref>
        . The mwetoolkit5
is written in Python and uses a CRF classifier
        <xref ref-type="bibr" rid="ref26">(Ramisch et al., 2010)</xref>
        . The word2phrase module
of word2vec attempts to learn phrases in a
document of any language
        <xref ref-type="bibr" rid="ref18">(Mikolov et al., 2013)</xref>
        , but it
is more a statistical tool for phrase extraction than
for multi-word detection.
      </p>
      <p>
        As for the assessment of text complexity,
READ-IT
        <xref ref-type="bibr" rid="ref11 ref12">(Dell’Orletta et al., 2011)</xref>
        is the only
existing tool that gathers readability information for
an Italian text. However, while the online demo
can be used for free without registration, the tool
is not available for offline use.
      </p>
      <p>
        As for text reuse detection, i.e. when an author
quotes (or borrows) another earlier or
contemporary author, in the last years it has become easier
thanks to new algorithms and high availability of
texts
        <xref ref-type="bibr" rid="ref17 ref20 ref6">(Mullen, 2016; Clough et al., 2002; Mihalcea
et al., 2006)</xref>
        . However, also in this case, no tools
are available for Italian.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Tool description</title>
      <p>
        The Tint pipeline is based on Stanford CoreNLP
        <xref ref-type="bibr" rid="ref16">(Manning et al., 2014)</xref>
        , an open-source framework
written in Java, that provides most of the
common Natural Language Processing tasks
out-ofthe-box in various languages. The framework
provides also an easy interface to extend the
annotation to new tasks and/or languages. Differently
from some similar tools, such as UIMA (Ferrucci
and Lally, 2004) and GATE
        <xref ref-type="bibr" rid="ref8">(Cunningham et al.,
2002)</xref>
        , CoreNLP is easy to use and requires only
basic object-oriented programming skills to
extend it. In Tint, we adopt this framework to: (i)
port the most common NLP tasks to Italian; (ii)
make it easily extendable, both for writing new
modules and replacing existing ones with more
customized ones; and (iii) implement some new
annotators as wrappers for external tools, such as
3http://projects.csail.mit.edu/jmwe/
4https://github.com/toliwa/CoreNLP-jMWE
5http://mwetoolkit.sourceforge.net/PHITE.php
entity linking, temporal expression identification,
keyword extraction.
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Modules</title>
      <p>
        In this Section, we present a set of Tint modules,
briefly describing those that were already included
in the first release
        <xref ref-type="bibr" rid="ref22 ref31 ref7">(Palmero Aprosio and Moretti,
2016)</xref>
        and focusing with more details on novel,
more recent ones. While the old modules
perform traditional NLP tasks (i.e. morphological
analysis), we have recently integrated components
for a more fine-grained linguistic analysis of
specific phenomena, such as affixation, the
identification of multi-word expressions, anglicisms and
euphonic “d”. These are the outcome of a larger
project involving FBK and the Institute for
Educational Research of the Province of Trento
        <xref ref-type="bibr" rid="ref28">(Sprugnoli et al., 2018)</xref>
        , aimed at studying with NLP
tools the evolution of Italian texts towards the
socalled neo-standard Italian
        <xref ref-type="bibr" rid="ref4">(Berruto, 2012)</xref>
        .
4.1
      </p>
      <sec id="sec-5-1">
        <title>Already existing modules</title>
        <p>
          As described in
          <xref ref-type="bibr" rid="ref22 ref31 ref7">(Palmero Aprosio and Moretti,
2016)</xref>
          , the Tint pipeline provides a set of
preinstalled modules for basic linguistic annotation:
tokenization, part-of-speech (POS) tagging,
morphological analysis, lemmatization, named
entity recognition and classification (NERC),
dependency parsing.
        </p>
        <p>
          Among the modules, two have been
implemented from scratch and do not rely on the
components available in Stanford CoreNLP: the
tokenizer and the morphological analyser (see
below). POS tagging, dependency parsing and
NERC are performed using the existing modules
in CoreNLP, trained on the Universal
Dependencies6 (UD) dataset in Italian
          <xref ref-type="bibr" rid="ref5">(Bosco et al., 2013)</xref>
          ,
and I-CAB
          <xref ref-type="bibr" rid="ref14 ref17">(Magnini et al., 2006)</xref>
          respectively.
        </p>
        <p>
          Additional modules include wrappers for
temporal expression extraction and classification with
HeidelTime
          <xref ref-type="bibr" rid="ref15 ref2 ref23 ref30 ref5 ref9">(Stro¨tgen and Gertz, 2013)</xref>
          , keyword
extraction with Keyphrase Digger
          <xref ref-type="bibr" rid="ref19">(Moretti et al.,
2015)</xref>
          , and entity linking using DBpedia
Spotlight7
          <xref ref-type="bibr" rid="ref9">(Daiber et al., 2013)</xref>
          and The Wiki Machine8
(Giuliano et al., 2009).
        </p>
        <p>
          Tokenizer: This module provides text
segmentation in tokens and sentences. At first, the text
is grossly tokenized. Then, in a second step,
tokens that need to be put together are merged
us6http://universaldependencies.org/
7http://bit.ly/dbpspotlight
8http://bit.ly/thewikimachine
ing two customizable lists of Italian non-breaking
abbreviations (such as “dott.” or “S.p.A.”) and
regular expressions (for e-mail addresses, web
URIs, numbers, dates). This second phase uses
          <xref ref-type="bibr" rid="ref10">(De La Briandais, 1959)</xref>
          to speedup the process.
        </p>
        <p>
          Morphological Analyser: The morphological
analyzer module provides the full list of
morphological features for each annotated token. The
current version of the module has been trained
using the Morph-it lexicon
          <xref ref-type="bibr" rid="ref35">(Zanchetta and Baroni,
2005)</xref>
          , but it is possible to extend or retrain it with
other Italian datasets. In order to grant fast
performance, the model storage has been implemented
with the mapDB Java library9 that provides an
excellent variation of Cassandras Sorted String
Table. To extend the coverage of the results,
especially for the complex forms, such as
“porta-cene” or “bi-direzionale”, the module tries to
decompose the token into prefix-root-infix-suffix and
tries to recognise the root form.
        </p>
        <p>See Section 5 for an extensive evaluation of the
modules.
4.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>New modules</title>
        <p>
          Affixes annotation: This module provides a
token-level annotation about word derivatives,
based on derIvaTario
          <xref ref-type="bibr" rid="ref31">(Talamo et al., 2016)</xref>
          .10 The
resource was built segmenting into derivational
cycles about 11,000 derivatives and annotating
them with a wide array of features. The
module uses this resource in input to segment a token
into root and affixes, for example visione is
analysed as baseLemma=vedere, affix=zione and
allomorph=ione.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>Classification of verbal tenses: Part-of speech</title>
        <p>tagger and morphological analyzer released with
Tint can identify and classify verbs at token level,
but sometimes the modality, form and tense of a
verb is the result of a sequence of tokens, as in
compound tenses such as participio passato, or
passive verb forms. For this reason, we include in
Tint a new tense module to provide a more
complete annotation of multi-token verbal forms. The
module supports also the analysis of discontinuous
expressions, like for example ho sempre mangiato.</p>
        <p>Text reuse: Detecting text reuse is useful when,
in a document, we want to measure the overlap
with a given corpus. This is needed in a number of
applications, for example for plagiarism detection,
9http://www.mapdb.org
10http://derivatario.sns.it/
stylometry, authorship attribution, citation
analysis, etc. Tint includes now a component to deal
with this task, i.e. identifying parts of an input
text that overlap with a given corpus. First of all,
each sentence of the corpus is compared with the
sentences in the processed text using the
FuzzyWuzzy package11, a Java fuzzy string matching
implementation: this allows the system not to miss
expressions that are slightly different with respect
to the texts in the original corpus. In this phase,
only long spans of text can be considered, as the
probability of an incorrect match on fuzzy
comparison grows as soon as the text length decreases.
A second step checks whether the overlap involves
the whole sentence and, if not, it analyzes the two
texts and identifies the number of overlapping
tokens. Finally, the Stanford CoreNLP quote
annotator12 is used to catch text reuse that is in between
quotes, ignoring the length limitation of the fuzzy
comparison.</p>
        <p>Readability: In this module, we compute some
metrics that can be useful to assess the readability
of a text, partially inspired by Dell’Orletta et al.
(2011) and Tonelli et al. (2012). In particular, we
include the following indices:</p>
        <p>Number of content words, hyphens (using
iText Java Library13), sentences having less
than a fixed number of words, distribution of
tokens based on part-of-speech.</p>
        <p>Type-token ratio (TTR), i.e. the ratio between
the number of different lemmas and the
number of tokens; high TTR indicates a high
degree of lexical variation.</p>
        <p>Lexical density, i.e. the number of content
words divided by the total number of words.
Amount of coordinate and subordinate
clauses, along with the ratio between them.
Depth of the parse tree for each sentence:
both average and max depth are calculated on
the whole text.</p>
        <p>
          Gulpease formula
          <xref ref-type="bibr" rid="ref13">(Lucisano and Piemontese,
1988)</xref>
          to measure the readability at document
level.
11https://github.com/xdrop/fuzzywuzzy
12https://stanfordnlp.github.io/CoreNLP/quote.
html
13https://github.com/itext/itextpdf
Text difficulty based on word lists from De
Mauro’s Dictionary of Basic Italian14.
        </p>
        <p>
          Multi-word expressions: A specific
multitoken annotator has been implemented to
recognize more than 13,450 multi-word expressions, the
so-called ‘polirematiche’
          <xref ref-type="bibr" rid="ref34">(Voghera, 2004)</xref>
          ,
manually collected from various online resources. The
list includes verbal, nominal, adjectival and
prepositional expressions (e.g. lasciar perdere, societa`
per azioni, nei confronti di, mezzo morto). This
annotator can identify also discontinuous
multiwords. For example, in the expression andare a
genio (Italian phrase that means “to like”) an
adverb can be included, as in andare troppo a genio.
Similarly, in such phrases one can find nouns and
adjectives (e.g. lasciare Antonio a piedi, where
lasciare a piedi is an Italian multiword for leave
stranded).
        </p>
        <p>Anglicisms: A list of more than 2,500
anglicisms, collected from the web, is included in the
last release of Tint, and a particular annotator
identifies them in the text and distinguishes between
adapted (“chattare”, “skillato”) and non-adapted
anglicisms (“spread”, “leadership”). This module
can then be used to track the use of borrowings
from English in Italian texts, a phenomenon much
debated in the media and among scholars (Fanfani,
1996; Furiassi, 2008).</p>
        <p>Euphonic “D”: For euphonic reasons, the
preposition a, and the conjunctions e and o usually
become ad, ed, od when the subsequent word
begins with a, e, o respectively. While traditionally
this rule was applied to every vowel, a more recent
grammatical rule has established that the euphonic
‘d’ should be limited to cases in which it is
followed by the same vowel, for example ed ecco vs.
e ancora15. Tint provides an annotator that
identifies this phenomenon, and classifies each instance
as correct, if it follows the aforementioned rule, or
incorrect in all the other cases.</p>
        <p>Corpus statistics: A collection of CoreNLP
annotators have been developed to extract statistics
that can be used, for instance, to analyse traits of
interest in texts. More specifically, the provided
modules can mark and compute words and
sentences based on token, lemma, part-of-speech and
word position in the sentence.</p>
        <p>14http://bit.ly/nuovo-demauro
15http://bit.ly/crusca-d-eufonica</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Evaluation</title>
      <p>Tint includes a rich set of tools, evaluated
separately. In some cases, an evaluation based on the
accuracy is not possible, because of the lack of
available gold standard or because the tool
outcome is not comparable to other tools’ ones.</p>
      <p>
        When possible, Tint is compared with existing
pipelines that work with the Italian language: Tanl
        <xref ref-type="bibr" rid="ref1">(Attardi et al., 2010)</xref>
        , TextPro
        <xref ref-type="bibr" rid="ref25">(Pianta et al., 2008)</xref>
        and TreeTagger
        <xref ref-type="bibr" rid="ref27">(Schmid, 1994)</xref>
        .
      </p>
      <p>In calculating speed, we run each experiment
10 times and consider the average execution time.
When available, multi-thread capabilities have
been disabled. All experiments have been
executed on a 2,3 GHz Intel Core i7 with 16 GB of
memory.</p>
      <p>The Tanl API is not available as a
downloadable package, but it’s only usable online through a
REST API, therefore the speed may be influenced
by the network connection.</p>
      <p>No evaluation is performed for the Tint
annotators that act as wrappers for an external tools
(temporal expression tagging, entity linking, keyword
extraction).
5.1</p>
      <sec id="sec-6-1">
        <title>Tokenization and sentence splitting</title>
        <p>
          For the task of tokenization and sentence splitting,
Tint outperforms in speed both TextPro and Tanl
(see Table 1).
The evaluation of the part-of-speech tagging is
performed against the test set included in the UD
dataset, containing 10K tokens. As the tagset used
is different for different tools, the accuracy is
calculated only on five coarse-grained types: nouns
(N), verbs (V), adverbs (B), adjectives (A) and
other (O). Table 2 shows the results.
tion time, on the UD test set. When the lemma
is guessed starting form a morphological analysis
(such as in Tint and TextPro), the speed is
calculated by including both tasks. Table 3 shows the
results. All the tools reach the same accuracy of
96% (with minor differences that are not
statistically significant).
For Named Entity Recognition, we evaluate and
compare our system with the test set available on
the I-CAB dataset. We consider three classes:
PER, ORG, LOC. In training Tint, we extracted
a list of persons, locations and organizations by
querying the Airpedia database
          <xref ref-type="bibr" rid="ref23">(Palmero
Aprosio et al., 2013)</xref>
          for Wikipedia pages classified as
Person, Place and Organisation,
respectively. Table 4 shows the results of the named
entity recognition task.
The evaluation of the dependency parser is
performed against Tanl
          <xref ref-type="bibr" rid="ref2">(Attardi et al., 2013)</xref>
          and
TextPro
          <xref ref-type="bibr" rid="ref23">(Lavelli, 2013)</xref>
          w.r.t the usual metrics
Labeled Attachment Score (LAS) and Unlabeled
Attachment Score (UAS). Table 5 shows the results:
the Tint evaluation has been performed on the UD
test data; LAS and UAS for TextPro and Tanl is
taken directly from the Evalita 2011 proceedings
          <xref ref-type="bibr" rid="ref15">(Magnini et al., 2013)</xref>
          .
The Tint pipeline is released as an open source
software under the GNU General Public License
(GPL), version 3. It can be download from the Tint
website17 as a standalone package, or it can be
integrated into an existing application as a Maven
dependency. The source code is available on
Github.18
        </p>
        <p>The tool is written using the Stanford CoreNLP
paradigm, therefore a third part software can be
integrated easily into the pipeline.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Works</title>
      <p>In this paper, we presented the new release of Tint,
a simple, fast and accurate NLP pipeline for
Italian, based on Stanford CoreNLP. In the new
version, we have fixed some bugs and improved some
of the existing modules. We have also added a set
of components for fine-grained linguistics analysis
that were not available so far.</p>
      <p>
        In the future, we plan to improve the suite and
extend it with additional modules, also based on
the feedback from the users through the github
project page. We are currently working on new
modules, in particular Word Sense
Disambiguation (WSD) based on linguistic resources such as
MultiWordNet
        <xref ref-type="bibr" rid="ref24">(Pianta et al., 2002)</xref>
        and
Semantic Role Labelling, by porting to Italian resources
such as FrameNet
        <xref ref-type="bibr" rid="ref3">(Baker et al., 1998)</xref>
        , now
available only in English.
      </p>
      <p>
        The Tint pipeline will also be integrated in
PIKES
        <xref ref-type="bibr" rid="ref7">(Corcoglioniti et al., 2016)</xref>
        , a tool that
extracts knowledge from English texts using NLP
and outputs it in a queryable form (such RDF
triples), so to extend it to Italian.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The research leading to this paper was partially
supported by the EU Horizon 2020 Programme via
the SIMPATICO Project (H2020-EURO-6-2015,
n. 692819).</p>
      <p>17http://tint.fbk.eu/
18https://github.com/dhfbk/tint/
system for automatically extracting and organizing
knowledge from texts. In Proceedings of the Ninth
International Conference on Language Resources
and Evaluation (LREC-2014).</p>
      <p>Christian Girardi Emanuele Pianta and Roberto
Zanoli. 2008. The textpro tool suite. In
Bente Maegaard Joseph Mariani Jan Odijk Stelios
Piperidis Daniel Tapias Nicoletta Calzolari
(Conference Chair), Khalid Choukri, editor, Proceedings
of the Sixth International Conference on Language
Resources and Evaluation (LREC’08), Marrakech,
Morocco. European Language Resources
Association (ELRA).</p>
      <p>Massimo Fanfani. 1996. Sugli-anglicismi nell”italiano
contemporaneo (xiv). Lingua nostra, 57(2):72–91.
David Ferrucci and Adam Lally. 2004. Uima: An
architectural approach to unstructured information
processing in the corporate research environment.</p>
      <p>Nat. Lang. Eng., 10(3-4):327–348, September.
Cristiano Furiassi. 2008. Non-adapted Anglicisms
in Italian: Attitudes, frequency counts, and
lexicographic implications. Cambridge Scholars
Publishing.</p>
      <p>Claudio Giuliano, Alfio Massimiliano Gliozzo, and
Carlo Strapparava. 2009. Kernel methods for
minimally supervised wsd. Comput. Linguist.,
35(4):513–528, December.</p>
      <p>Nidhi Kulkarni and Mark Alan Finlayson. 2011.
jmwe: A java toolkit for detecting multi-word
expressions. In Proceedings of the Workshop on
Multiword Expressions: from Parsing and Generation
to the Real World, pages 122–124. Association for
Computational Linguistics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Attardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Dei</given-names>
            <surname>Rossi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Simi</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>The Tanl Pipeline</article-title>
          .
          <source>In Proc. of LREC Workshop on WSPP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Attardi</surname>
          </string-name>
          , Maria Simi, and
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Zanelli</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Tuning desr for dependency parsing of italian</article-title>
          .
          <source>In Evaluation of Natural Language and Speech Tools for Italian</source>
          , pages
          <fpage>37</fpage>
          -
          <lpage>45</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Collin F Baker</surname>
          </string-name>
          ,
          <string-name>
            <surname>Charles J Fillmore</surname>
          </string-name>
          , and John B Lowe.
          <year>1998</year>
          .
          <article-title>The berkeley framenet project</article-title>
          .
          <source>In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational LinguisticsVolume 1</source>
          , pages
          <fpage>86</fpage>
          -
          <lpage>90</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Gateano</given-names>
            <surname>Berruto</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Sociolinguistica dell'italiano contemporaneo</article-title>
          .
          <source>Carocci.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          , Simonetta Montemagni, and
          <string-name>
            <given-names>Maria</given-names>
            <surname>Simi</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Converting italian treebanks: Towards an italian stanford dependency treebank</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Paul</given-names>
            <surname>Clough</surname>
          </string-name>
          , Robert Gaizauskas,
          <source>Scott SL Piao, and Yorick Wilks</source>
          .
          <year>2002</year>
          .
          <article-title>Meter: Measuring text reuse</article-title>
          .
          <source>In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics</source>
          , pages
          <fpage>152</fpage>
          -
          <lpage>159</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Corcoglioniti</surname>
          </string-name>
          , Marco Rospocher, and Alessio Palmero Aprosio.
          <year>2016</year>
          .
          <article-title>A 2-phase framebased knowledge extraction framework</article-title>
          .
          <source>In Proc. of ACM Symposium on Applied Computing (SAC'16).</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Hamish</given-names>
            <surname>Cunningham</surname>
          </string-name>
          , Diana Maynard, Kalina Bontcheva, and
          <string-name>
            <given-names>Valentin</given-names>
            <surname>Tablan</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Gate: An architecture for development of robust hlt applications</article-title>
          .
          <source>In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02</source>
          , pages
          <fpage>168</fpage>
          -
          <lpage>175</lpage>
          , Stroudsburg, PA, USA. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Joachim</given-names>
            <surname>Daiber</surname>
          </string-name>
          , Max Jakob, Chris Hokamp, and
          <string-name>
            <surname>Pablo</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Mendes</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Improving efficiency and accuracy in multilingual entity extraction</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Semantic Systems (I-Semantics).</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Rene De La Briandais</surname>
          </string-name>
          .
          <year>1959</year>
          .
          <article-title>File searching using variable length keys</article-title>
          .
          <source>In Papers Presented at the the March 3-5</source>
          ,
          <year>1959</year>
          , Western Joint Computer Conference, IRE-AIEE-ACM '
          <volume>59</volume>
          (Western), pages
          <fpage>295</fpage>
          -
          <lpage>298</lpage>
          , New York, NY, USA. ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Felice</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Giulia</given-names>
            <surname>Venturi</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Read-it: Assessing readability of italian texts with a view to text simplification</article-title>
          .
          <source>In Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies, SLPAT '11</source>
          , pages
          <fpage>73</fpage>
          -
          <lpage>83</lpage>
          , Stroudsburg, PA, USA. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Felice</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          , Giulia Venturi, Andrea Cimino, and
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>T2kˆ 2: a Alberto Lavelli</article-title>
          .
          <year>2013</year>
          .
          <article-title>An ensemble model for the evalita 2011 dependency parsing task</article-title>
          .
          <source>In Evaluation of Natural Language and Speech Tools for Italian</source>
          , pages
          <fpage>30</fpage>
          -
          <lpage>36</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Pietro</given-names>
            <surname>Lucisano</surname>
          </string-name>
          and Maria Emanuela Piemontese.
          <year>1988</year>
          .
          <article-title>GULPEASE: una formula per la predizione della difficolta` dei testi in lingua italiana</article-title>
          .
          <source>Scuola e citta`</source>
          ,
          <volume>3</volume>
          (
          <issue>31</issue>
          ):
          <fpage>110</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          , Emanuele Pianta, Christian Girardi, Matteo Negri, Lorenza Romano, Manuela Speranza, Valentina Bartalesi Lenzi, and
          <string-name>
            <given-names>Rachele</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>I-cab: the italian content annotation bank</article-title>
          .
          <source>In Proceedings of LREC</source>
          , pages
          <fpage>963</fpage>
          -
          <lpage>968</lpage>
          . Citeseer.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          , Francesco Cutugno, Mauro Falcone, and
          <string-name>
            <given-names>Emanuele</given-names>
            <surname>Pianta</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Evaluation of Natural Language</article-title>
          and Speech Tool for Italian: International Workshop, EVALITA 2011, Rome, January 24-
          <issue>25</issue>
          ,
          <year>2012</year>
          , Revised Selected Papers, volume
          <volume>7689</volume>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Christopher D Manning</surname>
          </string-name>
          , Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and
          <string-name>
            <surname>David McClosky</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The stanford corenlp natural language processing toolkit</article-title>
          .
          <source>In ACL (System Demonstrations)</source>
          , pages
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Rada</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          , Courtney Corley,
          <string-name>
            <given-names>Carlo</given-names>
            <surname>Strapparava</surname>
          </string-name>
          , et al.
          <year>2006</year>
          .
          <article-title>Corpus-based and knowledge-based measures of text semantic similarity</article-title>
          .
          <source>In AAAI</source>
          , volume
          <volume>6</volume>
          , pages
          <fpage>775</fpage>
          -
          <lpage>780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          . In C. J.
          <string-name>
            <surname>C. Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Q. Weinberger, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          , pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          . Curran Associates, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Giovanni</given-names>
            <surname>Moretti</surname>
          </string-name>
          , Rachele Sprugnoli, and
          <string-name>
            <given-names>Sara</given-names>
            <surname>Tonelli</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Digging in the dirt: Extracting keyphrases from texts with kd</article-title>
          .
          <source>CLiC it</source>
          , page
          <volume>198</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Lincoln</given-names>
            <surname>Mullen</surname>
          </string-name>
          ,
          <year>2016</year>
          . textreuse: Detect Text Reuse and
          <string-name>
            <given-names>Document</given-names>
            <surname>Similarity</surname>
          </string-name>
          .
          <source>R package version 0.1.4.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>Llu´ıs Padro´</article-title>
          and
          <string-name>
            <given-names>Evgeny</given-names>
            <surname>Stanilovsky</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Freeling 3.0: Towards wider multilinguality</article-title>
          .
          <source>In LREC2012.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>A. Palmero</given-names>
            <surname>Aprosio</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Moretti</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Italy goes to Stanford: a collection of CoreNLP modules for Italian</article-title>
          . ArXiv e-prints,
          <year>September</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Alessio</given-names>
            <surname>Palmero</surname>
          </string-name>
          <string-name>
            <surname>Aprosio</surname>
          </string-name>
          , Claudio Giuliano, and
          <string-name>
            <given-names>Alberto</given-names>
            <surname>Lavelli</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Automatic expansion of DBpedia exploiting Wikipedia cross-language information</article-title>
          .
          <source>In Proceedings of the 10th Extended Semantic Web Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Emanuele</given-names>
            <surname>Pianta</surname>
          </string-name>
          , Luisa Bentivogli, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Girardi</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Developing an aligned multilingual database</article-title>
          .
          <source>In Proc. 1st Intl Conference on Global WordNet. Citeseer.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Emanuele</given-names>
            <surname>Pianta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Girardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Zanoli</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>The textpro tool suite</article-title>
          .
          <source>In LREC. Citeseer.</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Ramisch</surname>
          </string-name>
          , Aline Villavicencio, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Boitet</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Multiword expressions in the wild?: the mwetoolkit comes in handy</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations</source>
          , pages
          <fpage>57</fpage>
          -
          <lpage>60</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Probabilistic part-of-speech tagging using decision trees.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Rachele</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          , Sara Tonelli, Alessio Palmero Aprosio, and
          <string-name>
            <given-names>Giovanni</given-names>
            <surname>Moretti</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Analysing the evolution of students' writing skills and the impact of neo-standard italian with the help of computational linguistics</article-title>
          .
          <source>In Proceedings of the Sixth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2018</year>
          ), Torino, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Milan</given-names>
            <surname>Straka</surname>
          </string-name>
          and Jana Strakova´.
          <year>2017</year>
          .
          <article-title>Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe</article-title>
          .
          <source>In Proceedings of the CoNLL</source>
          <year>2017</year>
          <article-title>Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies</article-title>
          , pages
          <fpage>88</fpage>
          -
          <lpage>99</lpage>
          , Vancouver, Canada. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Jannik</given-names>
            <surname>Stro</surname>
          </string-name>
          <article-title>¨tgen</article-title>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gertz</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Multilingual and cross-domain temporal tagging</article-title>
          .
          <source>Language Resources and Evaluation</source>
          ,
          <volume>47</volume>
          (
          <issue>2</issue>
          ):
          <fpage>269</fpage>
          -
          <lpage>298</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>Luigi</given-names>
            <surname>Talamo</surname>
          </string-name>
          , Chiara Celata, and Pier Marco Bertinetto.
          <year>2016</year>
          .
          <article-title>DerIvaTario: An annotated lexicon of Italian derivatives</article-title>
          .
          <source>Word Structure</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>72</fpage>
          -
          <lpage>102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Tamburini</surname>
          </string-name>
          and
          <string-name>
            <given-names>Matias</given-names>
            <surname>Melandri</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Anita: a powerful morphological analyser for italian</article-title>
          .
          <source>In Nicoletta Calzolari (Conference Chair)</source>
          , Khalid Choukri, Thierry Declerck, Mehmet Uur Doan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors,
          <source>Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)</source>
          , Istanbul, Turkey.
          <source>European Language Resources Association (ELRA).</source>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Sara</given-names>
            <surname>Tonelli</surname>
          </string-name>
          ,
          <source>Ke Tran Manh, and Emanuele Pianta</source>
          .
          <year>2012</year>
          .
          <article-title>Making readability indices readable</article-title>
          .
          <source>In Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations</source>
          , pages
          <fpage>40</fpage>
          -
          <lpage>48</lpage>
          , Montre´al, Canada, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>Miriam</given-names>
            <surname>Voghera</surname>
          </string-name>
          .
          <year>2004</year>
          . Polirematiche.
          <article-title>La formazione delle parole in italiano</article-title>
          , pages
          <fpage>56</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>Eros</given-names>
            <surname>Zanchetta</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Baroni</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Morph-it! a free corpus-based morphological resource for the italian language</article-title>
          .
          <source>Corpus Linguistics</source>
          <year>2005</year>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>