=Paper= {{Paper |id=Vol-3015/161 |storemode=property |title=Tint, the Swiss-Army Tool for Natural Language Processing in Italian |pdfUrl=https://ceur-ws.org/Vol-3015/paper161.pdf |volume=Vol-3015 |authors=Alessio Palmero Aprosio |dblpUrl=https://dblp.org/rec/conf/aiia/Aprosio21 }} ==Tint, the Swiss-Army Tool for Natural Language Processing in Italian== https://ceur-ws.org/Vol-3015/paper161.pdf
             Tint, the Swiss-Army Tool
     for Natural Language Processing in Italian

                  Alessio Palmero Aprosio[0000−0002−1484−0882]

                      Fondazione Bruno Kessler, Trento, Italy
                                 aprosio@fbk.eu



      Abstract. In this we paper present the last version of Tint, an open-
      source, fast and extendable Natural Language Processing suite for Italian
      based on Stanford CoreNLP. The new release includes a set of text pro-
      cessing components for fine-grained linguistic analysis, from tokenization
      to relation extraction, including part-of-speech tagging, morphological
      analysis, lemmatization, multi-word expression recognition, dependency
      parsing, named-entity recognition, keyword extraction, and much more.
      Tint is written in Java freely distributed under the GPL license. Al-
      though some modules do not perform at a state-of-the-art level, Tint
      reaches very good accuracy in all modules, and can be easily used out-
      of-the-box.

      Keywords: Natural Language Processing · Artificial Intelligence · Text
      Analysis · Readability




1   Introduction
In this paper, we present Tint, a suite of ready-to-use modules for Italian NLP.
Tint is free to use, open source, and can be downloaded and used out-of-the-box
(see Section 5). Compared to the previous versions [26, 27], the suite has been
enriched with several modules for fine-grained linguistic analysis that were not
available for Italian before (for example, constituency parsing, relation extrac-
tion, temporal expression extraxtion) Finally, some other modules have been
trained with new data (named-entity recognition and dependency parsing).


2   Related work
Most of the linguistic pipelines freely available for download (such as Stanford
CoreNLP1 and OpenNLP2 ) are language independent and, even if they are not
  Copyright © 2021 for this paper by its authors. Use permitted under Creative
  Commons License Attribution 4.0 International (CC BY 4.0).
1
  http://stanfordnlp.github.io/CoreNLP/
2
  https://opennlp.apache.org/
2       A. Palmero Aprosio

available in Italian out-of-the-box, they could be trained in every existing lan-
guage. A notable examples in this direction are UDpipe [32], SpaCy [18], and
Stanza [29], trainable pipelines which perform most of the common NLP tasks
and are available in almost all the languages included in the Universal Dependen-
cies [20], Freeling [25], a C++ library providing language analysis functionalities
for a variety of languages. There are also some other pipelines specifically written
for Italian, such as TextPro [14], and T2K [13], but none of them are released
as open source (and TextPro is the only one that can be downloaded and used
for free for research purposes).


3     Modules

The Tint software is built on top of Stanford CoreNLP, a framework that helps
users to derive linguistic annotations for text. Differently from some similar tools,
such as UIMA [15] and GATE [9], CoreNLP requires only basic object-oriented
programming skills to extend it. The centerpiece of CoreNLP is the pipeline, that
takes in raw text, runs a series of NLP annotators on the text, and produces
a final set of annotations. CoreNLP supports six languages (Arabic, Chinese,
English, French, German, and Spanish). In Tint, we use the CoreNLP paradigm
and bring it to Italian.
    Among the modules, some of them have been implemented from scratch and
do not rely on the components available in Stanford CoreNLP (for example: tok-
enization, morphological analysis, and so on). Other tasks, such as POS tagging
and dependency parsing, are performed using the existing modules in CoreNLP,
trained on Italian dataset available to the community. Finally, additional mod-
ules include wrappers for existing tools written in Java or available through a
web API, such as keyword extraction and entity linking.
    We mark with an asterisk (*) the modules that have never been described in a
previous work, and with a dagger (†) the ones that were retrained or significantly
renovated w.r.t. the past articles.


3.1   Tokenizer and sentence splitter

This module provides text segmentation in tokens and sentences. At first, the
text is grossly tokenized. Then, in a second step, tokens that need to be put
together are merged using two customizable lists of Italian non-breaking abbre-
viations (such as “dott.” or “S.p.A.”) and regular expressions (for e-mail ad-
dresses, web URIs, numbers, emoticons). This second phase uses [11] to speedup
the process.


3.2   Truecaser (*)

The truecase module recognizes the “true” case of tokens (how it would be capi-
talized in well-edited text) when this information is lost, e.g., all upper case text.
        Tint, the Swiss-Army Tool for Natural Language Processing in Italian       3




Fig. 1. Tint modules chart. Squared boxes represent modules, and the yellow back-
ground means the use of machine learning algorithms. Green blocks indicate linguistic
resources. Word vectors are depicted in blue.


It is included in the CoreNLP original package,3 and relies on a discriminative
model using the CRF sequence tagger. The model shipped with Tint has been
trained on an Italian corpus (1.3 billion words) that includes texts from different
domains: legal, narrative, news, and so on [37].

3.3    Part-of-speech tagger
The part-of-speech annotation is provided through the Maximum Entropy im-
plementation [39] included in Stanford CoreNLP. The model is trained on the
Universal Dependencies (UD) dataset for Italian [6].

3.4    Token splitter (*)
The tokenizer (see Section 3.1) is able to split a text into words, but the basic
unit for a good morphological annotation is the syntactic word. This means
3
    https://stanfordnlp.github.io/CoreNLP/truecase.html
4      A. Palmero Aprosio

that we systematically want to split off clitics, as in “dammelo” (verb “dà”,
plus pronouns “me”, and “lo”), and undo contractions, as in “alla”, that is “a”
(preposition) plus “la” (determiner). We refer to such cases as multiword tokens
because a single orthographic token corresponds to multiple (syntactic) words.
    The CoreNLP pipeline performs such task immediately after segmentation.
However, in Italian the tokenization alone is not enough to understand whether
a particular token needs to be split. For instance, depending on the context,
“delle” can be both a partitive article and a contraction of a preposition and
a determiner. Similarly, “porci” can be a noun or a verb plus clitic. We then
write a new module for the purpose, using the information provided by the POS
module to discriminate the ambiguous cases.
    To ensure compatibility with the previous versions of Tint, all the modules
provided in Tint that operate after part-of-speech are configured to work either
with and without the splitter. Modules that need the training of a model (such
as dependency/constituency parsing and named-entity recognition) are trained
in both setups: the two models are included in the Tint distribution (one can
activate the right one in the configuration file).


3.5   Morphological analyzer

The morphological analyzer module provides the full list of morphological fea-
tures for each annotated token. The current version of the module has been
trained using the Morph-it lexicon [41], but it is possible to extend or retrain
it with other Italian datasets. To extend the coverage of the results, especially
for the complex forms, such as “porta-ce-ne” or “bi-direzionale”, the module
decomposes the token into prefix-root-infix-suffix and tries to recognise the root
form.


3.6   Lemmatizer (†)

The module for the lemmatization is a rule-based system that works by com-
bining the part-of-speech output (Section 3.3) and the results of the morpholog-
ical analyzer (Section 3.5) so to disambiguate the morphological features using
the grammatical annotation. In order to increase the accuracy of the results,
the module tries to detect the genre of noun lemmas relying on the analysis
of their processed articles. For instance, for the correct lemmatization of “il
latte/the milk”, the module uses the singular article “il” to identify the correct
gender/number of the lemma “latte” and returns “latte/milk” (male, singular)
instead of “latta/metal sheet” (female, which plural form is “latte”).
    In addition, we developed a morphological guesser that is activated whenever
a form cannot be linked to any lemma through the morphological analyzer (Sec-
tion 3.5). Starting from the series form/lemma/pos in the Italian UD datasets,
we trained a statistical model using decision trees and probabilities given by
frequencies of certain suffixes in the UD. For instance, starting from the non-
existent word “insalatando” tagged as verb (probably meaning eating salad), the
         Tint, the Swiss-Army Tool for Natural Language Processing in Italian      5

guesser starts from the end of the form and, letter by letter, explores the tree of
possibilities until it reaches a result with a reasonable accuracy.
    The guesser is active by default, but can be deactivated when needed. When
active, the guessed lemmas are tagged as such, so that the researcher (or the
tool calling Tint) can use this information.


3.7     Verbal tenses classifier

Part-of speech tagger and morphological analyzer released with Tint can identify
and classify verbs at token level, but sometimes the modality, form and tense
of a verb is the result of a sequence of tokens, as in compound tenses such as
participio passato, or passive verb forms. For example, in Italian the word siamo,
takes as a single token, is the simple present form of the verb essere; if we look at
the surrounding words, we can have forms such as siamo andati (present perfect
of verb andare, active) or siamo mangiati (simple present of verb mangiare,
passive). For this reason, we include in Tint a tense module to provide a more
complete annotation of multi-token verbal forms. The module supports also the
analysis of discontinuous expressions, like for example ho sempre mangiato.


3.8     Affixes annotator

This module provides a token-level annotation about word derivatives, based
on derIvaTario [35],4 a resource manually created to achieve a high accuracy
and overcome errors coming from resources developed in a semi-automatic way
[42, 36]. The dataset was built segmenting into derivational cycles about 11,000
derivatives and annotating them with a wide array of features. The module uses
this resource in input to segment a token into root and affixes, for example
visione is analysed as baseLemma=vedere, affix=zione and allomorph=ione.


3.9     Multi-word expressions extractor

A specific multi-token annotator has been implemented to recognize more than
13,450 multi-word expressions, the so-called ‘polirematiche’ [40], manually col-
lected from various online resources. The list includes verbal, nominal, adjectival
and prepositional expressions (e.g. lasciar perdere, società per azioni, nei con-
fronti di, mezzo morto). This annotator can identify also discontinuous multi-
words. For example, in the expression andare a genio (Italian phrase that means
“to like”) an adverb can be included, as in andare troppo a genio. Similarly, in
such phrases one can find nouns and adjectives (e.g. lasciare Antonio a piedi,
where lasciare a piedi is an Italian multiword for leave stranded ).

4
    http://derivatario.sns.it/
6      A. Palmero Aprosio

3.10   Named-entities recognizer (†)

The NER module recognize persons, locations and organizations in the text. It
uses a CRF sequence tagger [16] included in Stanford CoreNLP and it is trained
on KIND [23], a dataset containing around 340K words taken from Wikinews.
   To enhance the classification, Stanford NER also accepts gazettes of names
labelled with the corresponding tag. We collect a list of persons, organizations
and locations from the Italian Wikipedia using some classes in DBpedia [3]:
Person, Organisation, and Place, respectively. In addition to this, we collect
the list of streets from OpenStreetMap [22], limiting the extraction to Italian
names.


3.11   Temporal expressions extractor and normalizer (*)

Since the first version of Tint, the task of temporal expression extraction was
provided as a wrapper to HeidelTime [33], a rule-based state-of-the-art temporal
tagger developed at Heidelberg University.
    The original English version of CoreNLP uses SUTime [7], a powerful library
for processing temporal expressions, built on top of TokensRegex, a framework
for defining regular expressions over text and tokens, and mapping matched text
to semantic objects. The current version of Tint uses SUTime and a new set
of rules written for Italian. It also normalizes the expressions according to the
TIMEX3 annotation standard. SUTime is generally run as a subcomponent of
the named-entities recognizer annotator (Section 3.10) and is active by default
(it can be disabled if not needed).
    Recognized temporal expressions can be resolved relative to the document
date. For instance, the expression “mercoledı̀ scorso” will be resolved to the
Wednesday that is immediately before to the document date, be it the current
date or any other date. The document date can be set when Tint is launched,
otherwise current date and time are used.


3.12   Constituency parser (*)

A constituency parser is a program that works out the grammatical structure
of sentences, for instance, which groups of words go together (as “phrases” and
which words are the subject or object of a verb. In Tint this task is performed
by shift-reduce [43] parser module included in Stanford CoreNLP.
    Data used for training is taken from both the Turin University Treebank [5]
and the Parallel TUT [30]. Their licence allows to use it for research purposes.
    These treebanks cannot be used as is, because the multiword tokens (see
Section 3.4) are denoted by doubling the token. In addition, part-of-speech tags
and some constituency labels need to be replaces to make the dataset compatible
with the CoreNLP parser. Conversion rules for both tagsets are included in the
Tint release. A script to apply the conversion to the dataset is also included.
       Tint, the Swiss-Army Tool for Natural Language Processing in Italian    7

3.13   Dependency parser (†)

This module provides syntactic analysis of the text and uses a transition-based
parser (included in Stanford CoreNLP) which produces typed dependency parses
of natural language sentences [8]. The parser is powered by a neural network
which accepts word embedding inputs: the model is trained on the UD dataset
and the word embeddings are built on the corpus described in Section 3.2.


3.14   Relation extraction

New regulations on transparency and the recent policy for privacy force the
public administration (PA) to make their documents available, but also to limit
the diffusion of personal data. The relation extraction module represents a first
approach to the extraction of sensitive data from PA documents in terms of
named entities and semantic relations among them.
   For this task, we rely on the Relation Extraction module [34] included in
Stanford CoreNLP. For this module to work, a relation must connect two entities.
For instance, address is used for instance to link a LOC entity representing an
address to the person or company to which the address belongs, while birthDate,
birthLoc link respectively the date and location of birth.




        Fig. 2. A screenshot of the relation extraction module online demo.
8         A. Palmero Aprosio

    Some entities are extracted using the named-entities recognizer (Section 3.10)
and the temporal expression extractor (Section 3.11). To deal with all the re-
quested relations, some additional entity types are manually added and annotate
using the TokenRegexp CoreNLP module (see Section 3.11). Additional entities
include, for example, NUMBER for numbers (such as VAT), CF for the Italian
“codice fiscale” sequence of chars, ROLE for personal and organisation roles,
and so on.
    To train the relation extraction module, we use the REDIT dataset, con-
taining documents taken from the PA domain and manually annotated with 19
relations [24].

3.15     Text reuse
Detecting text reuse is useful when, in a document, we want to measure the over-
lap with a given corpus. This is needed in a number of applications, for example
for plagiarism detection, stylometry, authorship attribution, citation analysis,
etc. Tint includes a component to deal with this task, i.e. identifying parts of an
input text that overlap with a given corpus. First of all, each sentence of the cor-
pus is compared with the sentences in the processed text using the FuzzyWuzzy
package5 , a Java fuzzy string matching implementation: this allows the system
not to miss expressions that are slightly different with respect to the texts in
the original corpus. In this phase, only long spans of text can be considered,
as the probability of an incorrect match on fuzzy comparison grows as soon as
the text length decreases. A second step checks whether the overlap involves the
whole sentence and, if not, it analyzes the two texts and identifies the number
of overlapping tokens. Finally, the Stanford CoreNLP quote annotator6 is used
to catch text reuse that is in between quotes, ignoring the length limitation of
the fuzzy comparison.

3.16     Readability and corpus statistics
In this module, we compute some metrics that can be useful to assess the read-
ability of a text, partially inspired by [12] and [38]. In particular, we include the
following indices:

    – Number of content words, hyphens (using iText Java Library7 ), sentences
      having less than a fixed number of words, distribution of tokens based on
      part-of-speech.
    – Type-token ratio (TTR), i.e. the ratio between the number of different lem-
      mas and the number of tokens; high TTR indicates a high degree of lexical
      variation.
    – Lexical density, i.e. the number of content words divided by the total number
      of words.
5
  https://github.com/xdrop/fuzzywuzzy
6
  https://stanfordnlp.github.io/CoreNLP/quote.html
7
  https://github.com/itext/itextpdf
       Tint, the Swiss-Army Tool for Natural Language Processing in Italian          9

 – Amount of coordinate and subordinate clauses, along with the ratio between
   them.
 – Depth of the parse tree for each sentence: both average and max depth are
   calculated on the whole text.
 – Gulpease formula [19] to measure the readability at document level.
 – Text difficulty based on word lists from De Mauro’s Dictionary of Basic
   Italian8 .
    In addition, a set of extractors described in [31] and mainly defining the neo-
standard Italian used by high school students (such as anglicisms, euphonic “D”,
and much more) are available out-of-the-box in Tint.
    Finally, a collection of CoreNLP annotators have been developed to extract
statistics that can be used, for instance, to analyse traits of interest in texts. More
specifically, the provided modules can mark and compute words and sentences
based on token, lemma, part-of-speech and word position in the sentence.

3.17    Keywords extraction
Keyword extraction in Tint is performed by Keyphrase Digger9 [21], a rule-based
system for keyphrase extraction. It combines statistical measures with linguistic
information given by part-of-speech patterns to identify and extract weighted
keyphrases from texts.

3.18    Entity linking
The entity linking task consists in disambiguating a word (or a set of words)
and link them to a knowledge base (KB). The biggest (and most used) available
KB is Wikipedia, and almost every linking tool relies on it. The Tint pipeline
provides a wrapper annotator that can connect to DBpedia Spotlight10 [10] and
The Wiki Machine11 [2]. Both tools are distributed as open source software
and can be used by the annotator both as external services or through a local
installation.


4    Evaluation
Tint is a complex system that relies on a variety of modules interacting each
other. For this reason, the accuracy on the single tasks does not reach state-of-
the-art accuracy. Nevertheless, Tint performs as a reasonable level in all tasks it
performs.
    An accurate evaluation of most of its modules (especially the ones that use
machine learning techniques), with a comparison with other NLP Italian tools,
is available in the previous papers [26, 27, 24].
8
   http://bit.ly/nuovo-demauro
9
   https://dh.fbk.eu/2015/12/kd-keyphrase-digger/
10
   https://www.dbpedia-spotlight.org/
11
   https://bitbucket.org/fbk/airpedia/wiki/Tutorial
10      A. Palmero Aprosio

5    Tint distibution

The Tint pipeline is released as an open source software under the GNU General
Public License (GPL), version 3. It can be download from the Tint website12 as
a standalone package, or it can be integrated into an existing application as a
Maven dependency. The source code is available on Github.13
    The tool is written using the Stanford CoreNLP paradigm, therefore a third
part software can be integrated easily into the pipeline.




                 Fig. 3. A screenshot of the Tintful web interface.


   Along with Tint, one can also try Tintful14 [17], a NLP annotation software
that can be used both to manually annotate texts and to fix mistakes in NLP
pipelines (and, in particular, in Tint). Using a paradigm similar to wiki-like
systems, a user who notices some wrong annotation can easily fix it and submit
the resulting (and right) entry back to the tool developers. The Tint online
demo, linked from the project website, uses Tintful as graphical interface and is
configured to show most of the modules described in this paper. Therefore the
annotation provided by the modules working on machine learning algorithms
that need to be trained over annotated data (named-entity recognizer, part-of-
speech tagger, dependency parser) can be edited by the occasional user. The
resulting annotation will be manually checked by linguists and added to the
next training session. Figure 3 shows the web interface of Tintful.
12
   http://tint.fbk.eu/
13
   https://github.com/dhfbk/tint/
14
   https://github.com/dhfbk/tintful
       Tint, the Swiss-Army Tool for Natural Language Processing in Italian          11

6   Conclusions and Future Work

In this paper, we presented the last release of Tint, a simple, fast and accurate
NLP pipeline for Italian, based on Stanford CoreNLP. In the new version, we
fixed some bugs and improved some of the existing modules. We also added a
set of components for fine-grained linguistics analysis that were not available so
far.
     In the future, we plan to improve the suite and extend it with additional
modules, in particular Word Sense Disambiguation (WSD) based on linguistic
resources such as MultiWordNet [28] and Semantic Role Labelling, by porting
to Italian resources such as FrameNet [4], now available only in English.
     We also plan to increase the accuracy of the trained modules (such as part-
of-speech tagger and named-entity recognizer) using deep learning techniques
and including a pretrained language model at a different granularity (words,
characters) into the process [1].


References

 1. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence
    labeling. In: Proceedings of the 27th International Conference on Computational
    Linguistics. pp. 1638–1649. Association for Computational Linguistics, Santa Fe,
    New Mexico, USA (Aug 2018)
 2. Aprosio, A.P., Giuliano, C.: The wiki machine: an open source software for entity
    linking and enrichment. ArXiv e-prints (2016)
 3. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A
    nucleus for a web of open data. In: Aberer, K., Choi, K.S., Noy, N., Allemang, D.,
    Lee, K.I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber,
    G., Cudré-Mauroux, P. (eds.) The Semantic Web. pp. 722–735. Springer Berlin
    Heidelberg, Berlin, Heidelberg (2007)
 4. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The berkeley framenet project. In: Proceed-
    ings of the 36th Annual Meeting of the Association for Computational Linguistics
    and 17th International Conference on Computational Linguistics-Volume 1. pp.
    86–90. Association for Computational Linguistics (1998)
 5. Bosco, C., Lombardo, V., Vassallo, D., Lesmo, L.: Building a treebank for Italian:
    a data-driven annotation schema. In: Proceedings of the Second International Con-
    ference on Language Resources and Evaluation (LREC’00). European Language
    Resources Association (ELRA), Athens, Greece (May 2000), http://www.lrec-
    conf.org/proceedings/lrec2000/pdf/220.pdf
 6. Bosco, C., Montemagni, S., Simi, M.: Converting Italian treebanks: Towards
    an Italian Stanford dependency treebank. In: Proceedings of the 7th Lin-
    guistic Annotation Workshop and Interoperability with Discourse. pp. 61–
    69. Association for Computational Linguistics, Sofia, Bulgaria (Aug 2013),
    https://aclanthology.org/W13-2308
 7. Chang, A.X., Manning, C.: SUTime: A library for recognizing and normalizing
    time expressions. In: Proceedings of the Eighth International Conference on Lan-
    guage Resources and Evaluation (LREC’12). pp. 3735–3740. European Language
    Resources Association (ELRA), Istanbul, Turkey (May 2012)
12      A. Palmero Aprosio

 8. Chen, D., Manning, C.D.: A fast and accurate dependency parser using neural
    networks. In: EMNLP. pp. 740–750 (2014)
 9. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: Gate: An
    architecture for development of robust hlt applications. In: Proceedings
    of the 40th Annual Meeting on Association for Computational Lin-
    guistics. pp. 168–175. ACL ’02, Association for Computational Linguis-
    tics, Stroudsburg, PA, USA (2002). https://doi.org/10.3115/1073083.1073112,
    http://dx.doi.org/10.3115/1073083.1073112
10. Daiber, J., Jakob, M., Hokamp, C., Mendes, P.N.: Improving efficiency and ac-
    curacy in multilingual entity extraction. In: Proceedings of the 9th International
    Conference on Semantic Systems (I-Semantics) (2013)
11. De La Briandais, R.: File searching using variable length keys. In:
    Papers Presented at the the March 3-5, 1959, Western Joint Com-
    puter Conference. pp. 295–298. IRE-AIEE-ACM ’59 (Western), ACM,
    New York, NY, USA (1959). https://doi.org/10.1145/1457838.1457895,
    http://doi.acm.org/10.1145/1457838.1457895
12. Dell’Orletta, F., Montemagni, S., Venturi, G.: Read-it: Assessing readability of
    italian texts with a view to text simplification. In: Proceedings of the Second
    Workshop on Speech and Language Processing for Assistive Technologies. pp. 73–
    83. SLPAT ’11, Association for Computational Linguistics, Stroudsburg, PA, USA
    (2011), http://dl.acm.org/citation.cfm?id=2140499.2140511
13. Dell’Orletta, F., Venturi, G., Cimino, A., Montemagni, S.: T2kˆ 2: a system for au-
    tomatically extracting and organizing knowledge from texts. In: Proceedings of the
    Ninth International Conference on Language Resources and Evaluation (LREC-
    2014) (2014)
14. Emanuele Pianta, C.G., Zanoli, R.: The textpro tool suite. In: Chair), N.C.C.,
    Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Tapias, D. (eds.)
    Proceedings of the Sixth International Conference on Language Resources and
    Evaluation (LREC’08). European Language Resources Association (ELRA), Mar-
    rakech, Morocco (2008)
15. Ferrucci, D., Lally, A.: Uima: An architectural approach to unstructured
    information processing in the corporate research environment. Nat. Lang.
    Eng. 10(3-4), 327–348 (Sep 2004). https://doi.org/10.1017/S1351324904003523,
    http://dx.doi.org/10.1017/S1351324904003523
16. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into
    information extraction systems by Gibbs sampling. In: Proceedings of the 43rd An-
    nual Meeting of the Association for Computational Linguistics (ACL’05). pp. 363–
    370. Association for Computational Linguistics, Ann Arbor, Michigan (Jun 2005).
    https://doi.org/10.3115/1219840.1219885, https://aclanthology.org/P05-1045
17. Frasnelli, V., Bocchi, L., Palmero Aprosio, A.: Erase and rewind: Man-
    ual correction of NLP output through a web interface. In: Proceedings of
    the 59th Annual Meeting of the Association for Computational Linguis-
    tics and the 11th International Joint Conference on Natural Language Pro-
    cessing: System Demonstrations. pp. 107–113. Association for Computational
    Linguistics, Online (Aug 2021). https://doi.org/10.18653/v1/2021.acl-demo.13,
    https://aclanthology.org/2021.acl-demo.13
18. Honnibal, M., Johnson, M.: An improved non-monotonic transition sys-
    tem for dependency parsing. In: Proceedings of the 2015 Conference on
    Empirical Methods in Natural Language Processing. pp. 1373–1378. Asso-
       Tint, the Swiss-Army Tool for Natural Language Processing in Italian            13

    ciation for Computational Linguistics, Lisbon, Portugal (September 2015),
    https://aclweb.org/anthology/D/D15/D15-1162
19. Lucisano, P., Piemontese, M.E.: GULPEASE: una formula per la predizione della
    difficoltà dei testi in lingua italiana. Scuola e città 3(31), 110–124 (1988)
20. de Marneffe, M.C., Manning, C.D., Nivre, J., Zeman, D.: Universal Dependencies.
    Computational Linguistics 47(2), 255–308 (07 2021)
21. Moretti, G., Sprugnoli, R., Tonelli, S.: Digging in the dirt: Extracting keyphrases
    from texts with kd. CLiC it p. 198 (2015)
22. OpenStreetMap contributors: Planet dump retrieved from https://planet.osm.org
    . https://www.openstreetmap.org (2017)
23. Paccosi, T., Palmero Aprosio, A.: KIND: an Italian Multi-Domain Dataset for
    Named Entity Recognition. In: arXiv preprint (2021)
24. Paccosi, T., Palmero Aprosio, A.: REDIT: a Tool and Dataset for Extraction of
    Personal Data in Documents of the Public Administration Domain. In: CLiC-it
    2021 Italian Conference on Computational Linguistics (2021)
25. Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In:
    LREC2012 (2012)
26. Palmero Aprosio, A., Moretti, G.: Italy goes to Stanford: a collection of CoreNLP
    modules for Italian. ArXiv e-prints (Sep 2016)
27. Palmero Aprosio, A., Moretti, G.: Tint 2.0: an all-inclusive suite for nlp in italian.
    In: Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-
    it. vol. 10, p. 12 (2018)
28. Pianta, E., Bentivogli, L., Girardi, C.: Developing an aligned multilingual database.
    In: Proc. 1st Int’l Conference on Global WordNet. Citeseer (2002)
29. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: A Python natural
    language processing toolkit for many human languages. In: Proceedings of the
    58th Annual Meeting of the Association for Computational Linguistics: System
    Demonstrations (2020), https://nlp.stanford.edu/pubs/qi2020stanza.pdf
30. Sanguinetti, M., Bosco, C.: PartTUT: The Turin University Parallel Treebank, pp.
    51–69. Springer International Publishing, Cham (2015)
31. Sprugnoli, R., Tonelli, S., Aprosio, A.P., Moretti, G.: Analysing the evolution of
    students’ writing skills and the impact of neo-standard italian with the help of
    computational linguistics. In: Proceedings of the Sixth Italian Conference on Com-
    putational Linguistics (CLiC-it 2018). Torino, Italy (2018)
32. Straka, M., Straková, J.: Tokenizing, pos tagging, lemmatizing and parsing ud
    2.0 with udpipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual
    Parsing from Raw Text to Universal Dependencies. pp. 88–99. Association for
    Computational Linguistics, Vancouver, Canada (2017)
33. Strötgen, J., Armiti, A., Van Canh, T., Zell, J., Gertz, M.: Time for more languages:
    Temporal tagging of arabic, italian, spanish, and vietnamese. ACM Transactions
    on Asian Language Information Processing (TALIP) 13(1), 1–21 (2014)
34. Surdeanu, M., McClosky, D., Smith, M., Gusev, A., Manning, C.: Customiz-
    ing an information extraction system to a new domain. In: Proceedings of
    the ACL 2011 Workshop on Relational Models of Semantics. pp. 2–10. As-
    sociation for Computational Linguistics, Portland, Oregon, USA (Jun 2011),
    https://aclanthology.org/W11-0902
35. Talamo, L., Celata, C., Bertinetto, P.M.: DerIvaTario: An annotated lexicon of
    Italian derivatives. Word Structure 9(1), 72–102 (2016)
36. Tamburini, F., Melandri, M.: Anita: a powerful morphological analyser for ital-
    ian. In: Chair), N.C.C., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B.,
14      A. Palmero Aprosio

    Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eight
    International Conference on Language Resources and Evaluation (LREC’12). Eu-
    ropean Language Resources Association (ELRA), Istanbul, Turkey (2012)
37. Tonelli, S., Palmero Aprosio, A., Mazzon, M.: The impact of phrases on italian
    lexical simplification. In: Proceedings of the Fourth Italian Conference on Compu-
    tational Linguistics (CLiC-it 2017). pp. 316–320 (2017)
38. Tonelli, S., Tran Manh, K., Pianta, E.: Making readability indices readable. In:
    Proceedings of the First Workshop on Predicting and Improving Text Readability
    for target reader populations. pp. 40–48. Association for Computational Linguis-
    tics, Montréal, Canada (June 2012), http://www.aclweb.org/anthology/W12-2206
39. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-
    of-speech tagging with a cyclic dependency network. In: Proceedings of
    the 2003 Conference of the North American Chapter of the Associa-
    tion for Computational Linguistics on Human Language Technology - Vol-
    ume 1. pp. 173–180. NAACL ’03, Association for Computational Linguis-
    tics, Stroudsburg, PA, USA (2003). https://doi.org/10.3115/1073445.1073478,
    http://dx.doi.org/10.3115/1073445.1073478
40. Voghera, M.: Polirematiche. La formazione delle parole in italiano pp. 56–69 (2004)
41. Zanchetta, E., Baroni, M.: Morph-it! A free corpus-based morphological resource
    for the Italian language. Corpus Linguistics 2005 1(1) (2005)
42. Zanchetta, E., Baroni, M.: Morph-it! a free corpus-based morphological resource
    for the italian language. In: Proceedings of corpus linguistics 2005. University of
    Birmingham UK (2005)
43. Zhu, M., Zhang, Y., Chen, W., Zhang, M., Zhu, J.: Fast and accurate shift-
    reduce constituent parsing. In: Proceedings of the 51st Annual Meeting of the
    Association for Computational Linguistics (Volume 1: Long Papers). pp. 434–
    443. Association for Computational Linguistics, Sofia, Bulgaria (Aug 2013),
    https://aclanthology.org/P13-1043