=Paper=
{{Paper
|id=Vol-2006/paper048
|storemode=property
|title=INFORMed PA: A NER for the Italian Public Administration Domain
|pdfUrl=https://ceur-ws.org/Vol-2006/paper048.pdf
|volume=Vol-2006
|authors=Lucia Passaro,Alessandro Lenci,Anna Gabbolini
|dblpUrl=https://dblp.org/rec/conf/clic-it/PassaroLG17
}}
==INFORMed PA: A NER for the Italian Public Administration Domain==
INFORMed PA: A NER for the Italian Public Administration Domain
Lucia C. Passaro? , Alessandro Lenci? , Anna Gabbolini??
?
Dipartimento di Filologia, Letteratura e Linguistica, University of Pisa (Italy)
??
ETI3 | Evolution, Technology & Innovation
lucia.passaro@for.unipi.it
alessandro.lenci@unipi.it
anna.gabbolini@eti3.it
Abstract In this paper, we focus on Named Entity Recog-
nition (NER) for PA. Several approaches have
English. In this paper, we illustrate the been proposed in literature including Rule-based,
creation of a NER for the Public Ad- Machine Learning-based and Hybrid methods.
ministration (PA) domain. We discuss
Hand-made Rule-based NERs focus on extract-
the creation of an annotated corpus with
ing names using lots of human-made rules. In
documents from the Italian Albo Pretorio
general, these systems consist of a set of patterns
Nazionale and provide results of the sys-
based on grammatical (e.g., part of speech), syn-
tem evaluation.
tactic (e.g., word precedence) and orthographic
Italiano. In questo lavoro mostriamo la features (e.g., capitalization) in combination with
creazione di un NER per il dominio della dictionaries (Budi and Bressan, 2003; Appelt et
Pubblica Amministrazione (PA). Presenti- al., 1993; Grishman, 1995). These approaches
amo la creazione del corpus formato da usually give good results, but require long devel-
documenti dell’Albo Pretorio Nazionale e opment time by expert linguists. On the one hand,
mostriamo i risultati della valutazione del these systems have better results for restricted do-
sistema. mains, being capable of detecting very complex
entities, but, on the other one, they lack portability
and robustness and do not necessarily adapt well
1 Introduction to new domains and languages.
In the Public Administration (PA) domain, the Machine learning techniques, on the contrary,
rapid adoption of the new legislation about the use a collection of annotated documents for train-
governance transparency has been forcing Italian ing the classifiers. Therefore the development time
municipalities to produce their acts in a digital moves from the definition of rules to the prepa-
form and to make them available for both citizens ration of annotated corpora (Bikel et al., 1997;
and authorities. However, the acts delivered by Borthwick et al., 1998; McCallum and Li, 2003).
PAs are typically in a free-text electronic format, The systems identify and classify nouns using ma-
which is not convenient for searching, decision- chine learning algorithms such as Maximum En-
support, and data analysis. Therefore, the de- tropy (Berger et al., 1996), Support Vector Ma-
velopment of NLP tools to extract high-quality chines (Cortes and Vapnik, 1995) and Conditional
structured information, including Named Entities Random Field (Lafferty et al., 2001). More re-
(NEs) such as Persons and Organizations, repre- cently, also deep learning architectures have been
sents a key factor to enable the access to the wealth proposed for Named Entity Recognition (Chiu and
of information produced by PAs, and a crucial step Nichols, 2015; Strubell et al., 2017).
in turning the keyword of “transparency” into re- Finally, Hybrid NER systems, combine rule-
ality. The potentialities of NLP tools can be ex- based and machine learning-based methods, and
ploited to mine the large document repositories make new methods using strongest points from
produced by PA daily, with the aim of identifying each method (Srihari et al., 2000).
trends in their activity, suggesting possible syner- Existing general purpose Italian corpora anno-
gies to increase their efficiency, and raising “red tated with NEs such as I-CAB (Magnini et al.,
flags” about suspicious behaviors, especially for 2006) are not optimal for training a NER for the
their relationships with private companies. domain of PA because of the gap between bu-
reaucratic language and standard Italian, and also section 3 we describe the adaptation of the system
because of the lack of important classes such as to PA texts and its performances (section 4.1). In
act and normative references, that are very use- section 5, we report on the annotation of relations
ful in PA-oriented applications. To tackle these that we performed on a sample of the corpus and
problems, we decided to create a new corpus finally discuss the results and ongoing work.
from scratch starting from: (i) administrative doc-
uments belonging to the Italian Albo Pretorio; 2 The CoLingLab NER
(ii) the CoLingLab NER, a general NER trained
The standard Italian CoLingLab NER was trained
on I-CAB, from which we took the initial config-
on the Italian Content Annotation Treebank (I-
uration of features. The corpus of PA documents
CAB (Magnini et al., 2006)), a corpus of Italian
written in Italian “bureaucratese”, has the charac-
news, annotated with semantic information at dif-
teristics described in Brunato (2015):
ferent levels: Temporal Expressions, Named En-
1. Pseudo-technicisms or collateral technicisms tities, relations between entities. I-CAB is com-
(e.g., balneazione, fattispecie); posed of 525 news documents taken from the lo-
2. Abstract nouns with -zione/-mento suffixes cal newspaper ‘L’Adige’ (time span: September-
(e.g., stipulazione, espletamento), deverbal October of 2004). The NEs annotated in the cor-
nouns, usually with zero suffix (e.g., suben- pus are: Locations (L OC), Geo-Political Entities
tro, scorporo, utilizzo) and denominal verbs (G PE), Organizations (O RG) and Persons (P ER).
(e.g., relazionare, disdettare); As we said before, this model is unsatisfac-
3. Archaic terms (e.g., allorché, suddetto) and tory for the domain of Public Administration in
latinisms (e.g. una tantum, pro capite); two main respects. First, its classes are insuffi-
4. Forestierisms (e.g., governance, front office); cient to deal with the type of information in the
5. Uncommon and formal terms (e.g., diniego PA documents, that are full of references to other
for rifiuto); “linked” acts and legislative reference; second,
6. Stereotyped phrases (e.g., entro e non oltre, the language used in these documents is a pecu-
in riferimento all’oggetto); liar and highly complex variant of standard Italian
7. Abbreviations and acronyms. (cf. above). In addition, the performance of the
model, attested at ∼0.66 of F1-score on a portion
For the creation of a NER for PA, we decided to of I-CAB decreases dramatically on the PA doc-
exploit the existing architecture employed for the uments, reaching a F1-score of ∼0.35. To mea-
project SEMPLICE1 and in particular we adopted sure such performances, in the test set we mapped
a statistical method based on the Stanford NER ORG PA (cf. below) with ORG, and in the train-
(Finkel et al., 2005), a system implemented in ing set we mapped G PE with L OC.
Java and available for download under the GNU
General Public License. This choice allowed us to 3 A NER for PA Documents
easily compare the gain obtained by enriching the The adaptation of the CoLingLab NER to the PA
training corpus with PA documents and to speed domain included the extension of the standard
up the development process. Moreover, using NE classes (Rau, 1991; Grishman and Sundheim,
a Conditional Random Field (CRF) (Lafferty et 1996; Tjong Kim Sang, 2002; Tjong Kim Sang
al., 2001) as learning algorithm made it possible and De Meulder, 2003) to other entity types par-
for us to compare the PA model with other ticularly important in the context of municipali-
domain-adapted NERs (Passaro and Lenci, 2014). ties. In particular, we added the class ACT, to
mark other administrative documents (normally,
This paper is structured as follows: In section PA texts refer to other documents related to the
2, we present the CoLingLab NER and we show same procedure), the class L AW for the relevant
its performance on a sample of PA documents; in legislation, and an additional class of organiza-
1
The SEMantic instruments for PubLIc administrators
tions, O RG PA, for municipal departments.
and CitizEns (SEMPLICE; www.semplicepa.it) is a 2-
year project funded by Regione Toscana in collaboration with 3.1 The PA Corpus
IT companies to develop NLP-based tools for knowledge
management, information extraction and opinion mining for For the creation of the corpus, we used documents
local public administrations. taken from the Albo Pretorio Nazionale with the
aim of capturing the variability of the texts pro- NEs have been annotated on the CONLL (Nivre
duced by PA. Overall, the corpus includes 460 et al., 2007) texts using the standard IOB method.
documents, for a total of 724,623 tokens, anno- In order to deal with acts, we decided to tag them
tated with the following NEs: (i) ACT: documents with different “labels” to distinguish their sub-
belonging to the Albo Pretorio Nazionale, with components: the type (marked with ACT T), the
their type (optional), number and date: Determina number (marked with ACT N), the date (marked
n. 4 del 12/02/2011; (ii) L AW: legislative refer- with ACT D), functional tokens ( ACT X) and un-
ences: art. 183 comma 7 del D.Lgs. n. 267/2000; parsable tokens (marked wirth ACT U). For exam-
(iii) L OC: locations and geo-political entities: Co- ple, the act Delibera di giunta comunale numero
mune di Pisa; (iv) O RG PA: organizations related 53 del 23/10/2016 is annotated as follows: Delib-
to the Public Administration such as municipal era di giunta comunale (ACT T) numero (ACT X)
Departments: Sezione Anagrafe; (v) O RG: organi- 53 (ACT N) del (ACT X) 23/10/2016 (ACT D),
zations: Consip Spa; (vi) P ER: physical persons. while the act DD/67/2012 is annotated as ACT U.
The corpus has been linguistically annotated by This method allows for a simpler normalization of
means of a pipeline of general purpose NLP tools normative references, which is crucial for docu-
and in particular, it has been POS-tagged with the ment retrieval because of the high variability of
Part-Of-Speech tagger described in Dell’Orletta law mentions in the PA texts.
(2009), dependency parsed with the DeSR parser The inter-annotator agreement between two an-
(Attardi et al., 2009). Finally, complex terms like notators (attested at ∼0.8) has been calculated us-
forze dell’ordine (security force) have been identi- ing the Cohen’s K index on a sample of 25 docu-
fied using the EXTra term extraction tool (Passaro ments of 25 different municipalities, for a total of
and Lenci, 2016). 26,190 tokens.
3.2 Annotation 4 System Overview
NE annotation has been performed by means of To train the NER, no information from gazetteers
an incremental process: first 100 documents have was used. The model includes the following
been annotated by 2 annotators (one of them was groups of features:
a domain expert). In a second phase we trained
S EQUENCES : Next and previous words and a
a CRF model on these documents and we used it
window of 6 words (3 preceding and 3 fol-
to automatically annotate new documents. Finally,
lowing the target word) and their classes;
we identified the most common errors of the clas-
N-G RAMS : Character-level features, i.e., sub-
sifier and two new annotators manually revised the
strings of the word with a maximum length
output. This process has been repeated for each
of 6 letters;
group of 100 documents up to covering the whole
O RTHOGRAPHY: “word shape” features such
corpus that includes 460 distinct documents. The
as spelling, capital letters, presence of
average length of the documents is 1,575.26 to-
non–alphabetical characters etc.;
kens and the total number of the tokens is 724,623.
L INGUISTIC FEATURES : The word position in
Figure 1 shows the distribution of the different NE
the sentence (numeric attribute), the lemma,
classes in the corpus.
and the PoStag (nominal attribute);
T ERMS : We employed complex terms as features
to train the model. Terms have been extracted
with EXTra (Passaro and Lenci, 2016).
4.1 System’s Performances
We trained the CRF model based on the CoL-
ingLab NER on the annotated PA corpus, and we
tested its performances first with cross-validation
and then on a sample of new 25 documents of
25 different municipalities. This choice stems
from the fact that very often different municipal-
Figure 1: Distribution of the NEs in the corpus.
ities tend to use different templates and different
ways to refer to particular entities. This is partic- Precision Recall F1-Score
ularly common in some NE classes such as ACTS ACT 0.9747 0.8477 0.9068
and O RG PA, that vary a lot across municipalities. LAW 0.9494 0.9615 0.9554
For example, some of the analyzed texts contain LOC 0.799 0.6913 0.7413
strings of the form YYYY/G/NNNNN to refer to the ORG 0.8017 0.7686 0.7848
acts, where the number is actually a string encod- ORG PA 0.8706 0.7957 0.8315
ing both the date (year: YYYY), a code for the PER 0.9142 0.8694 0.8912
type (G) and the number of the act (NNNNN). MicroAVG 0.914 0.8355 0.873
Other municipalities instead adopt a less strictly MacroAVG 0.8849 0.8224 0.8518
codified pattern to indicate the act such as Type of
act, number N* of DD/MM/YYYY. Likewise, de- Table 2: System results (on a sample of 25 texts)
pending on the writing style (and conventions) of
the municipalities, the various departments (i.e.,
O RG PA) can include both strings like Corpo dei
Vigili Urbani and codes like Tec-01/ICT. To eval-
uate the system performance with respect to the
variation of the naming conventions adopted by
different municipalities, we randomly selected 25
municipalities and one document for each of them
balanced for length.
Table 1 reports on the results obtained in cross
validation and Table 2 shows the performance on
the sample of 25 documents. Figure 2 shows also
the confusion matrix for that sample.
In order to investigate the contribution of non-
linguistic features, we performed ablation experi- Figure 2: Confusion Matrix (25 texts)
ments and we tested the results on the sample of
25 documents. The ∆F1-Score for such groups is PART O F: the relation of hyponymy, which can
as follows: S EQUENCES: 3%; N-G RAMS: 1%; occur between: (i) two locations (e.g. a
O RTHOGRAPHY: 4%. In addition, we performed Municipality in Province); (ii) two organiza-
an additional experiment by training the NER on a tions (e.g. a participated into a holding com-
combination of I-CAB and the PA documents. In pany); (iii) a person and an organization (e.g.
this case, we noticed a ∆F1-Score of 2% by re- a member of an organization). Implicit at-
spect to the original model. tribute for this reation is “work in”.
L OCATION: an entity placed into a particular lo-
Precision Recall F1-Score cation, occurring between: (i) an organiza-
ACT 0.7876 0.8914 0.8356 tion and a location (e.g. an organization lo-
LAW 0.827 0.8423 0.8343 cated in a certain region). Possible attributes
LOC 0.702 0.7398 0.7196 for this relation are “work in” and “placed
ORG 0.7085 0.689 0.6977 in”; (ii) a person and and a location (e.g. a
ORG PA 0.6158 0.7774 0.6855 person living in a particular area). Possi-
PER 0.8373 0.8776 0.8567 ble attributes are “work in”, “born in” and
MacroAVG 0.7464 0.8029 0.7716 “placed in”.
I S R ELATED T O: an underspecified relation be-
Table 1: System results (10-fold cross validation) tween any entity pair.
5 Towards a Relational Classifier for PA Preliminary experiments have been performed
to examine the characteristics of an automatic
For a subset of the corpus, we also annotated classifier for extracting relations from administra-
the semantic relations occurring between two en- tive acts, and the performance seem to be very
tities in the domain of the PA, using the following promising, despite the size of the training set,
scheme: which includes in total 100 documents so far. The
extension of the annotated corpus and the training word embeddings. Moreover, we will focus on the
of the relational classifier are currently ongoing. development of classifiers for Relation Extraction
and Entity Linking.
6 Discussion
Acknowledgments
The results show that the NER reaches satisfac-
tory results for most of the classes, although leg- This research has been supported from the Project
ging behind in the recognition of PA Organiza- SEMantic instruments for PubLIc administrators
tions, which, among others, tend to have a higher and CitizEns (SEMPLICE), funded by Regione
formal variability, including for example both en- Toscana, and the Company ETI3 | Evolution, Tech-
tities like Corpo dei Vigili Urbani and Tec-01/ICT. nology & Innovation. Special acknowledgements
Moreover, in the recognition of Location names in go to Roberto Battistelli and Francesco Sandrelli
the domain of the PA, the system is expected to de- (ETI3 ) for support, and to the students Roswita
tect entities with a non-standard detail level going Candusso, Carmela Cinquesanti, Federica Sem-
from the name of the municipalities (e.g. Comune plici and Ludovica Vasile for manual annotation.
di Pisa) to very detailed addresses (e.g., via S.
Maria n. 36, 56126 Pisa (PI) interno 15). A simi-
lar problem occurs in the recognition of very small References
organizations, whose name contains the name of Douglas E. Appelt, Jerry R Hobbs, John Bear, David
its founder (i.e., Mario Rossi snc). In these cases, Israel, and Mabry Tyson. 1993. Fastus: A finite-
especially when snc is omitted, the system predicts state processor for information extraction from real-
world text. In IJCAI, volume 93, pages 1172–1178.
the class P ER instead of the correct class O RG. We
are confident that adding lexicons and gazetteers Giuseppe Attardi, Felice Dell’Orletta, Maria Simi, and
will improve the identification of entities of this Joseph Turian. 2009. Accurate dependency parsing
kind, but it could be interesting to investigate au- with a stacked multilayer perceptron. In EVALITA
2009 - Evaluation of NLP and Speech Tools for Ital-
tomatic normalization, disambiguation and entity ian 2009, LNCS, Reggio Emilia (Italy). Springer.
linking approaches (Hoffart et al., 2011; Han et al.,
2011). Adam L. Berger, Vincent J. Della Pietra, and Stephen
A. Della Pietra. 1996. A maximum entropy ap-
proach to natural language processing. Computa-
7 Conclusions and Ongoing Work tional linguistics, 22(1):39–71.
Named entities play an important role in admin- Daniel M. Bikel, Scott Miller, Richard Schwartz, and
istrative acts, especially in those - like the docu- Ralph Weischedel. 1997. Nymble: A high-
ments in the Albo Pretorio - describing the main performance learning name-finder. In Proceedings
actions taken by Municipalities. This kind of in- of the Fifth Conference on Applied Natural Lan-
guage Processing, pages 194–201, Washington, DC.
formation is very useful to fullfil the obligations Association for Computational Linguistics.
related to supervisory monitoring, disclosure, pe-
riodic self-assessment, and review of the govern- Andrew Borthwick, John Sterling, Eugene Agichtein,
ment decisions. and Ralph Grishman. 1998. Nyu: Description of the
mene named entity system as used in muc-7. In In
In this paper, we presented a NER for PA that Proceedings of the Seventh Message Understanding
shows a significant ability to identify the relevant Conference (MUC-7.
entities, and in particular legislative reference and
connected acts. It is important to stress the lexical Dominique Brunato. 2015. A study on linguistic com-
plexity from a computational linguistics perspective.
and syntactic complexity of bureaucratic language a corpus-based investigation of italian bureaucratic
represents a big challenge for NLP tools and meth- texts. Ph.D. Thesis, University of Siena.
ods. Such a complexity derives from the techni-
Indra Budi and Stéphane Bressan. 2003. Association
cal lexis of other domain-specific languages with
rules mining for name entity recognition. In Web In-
which PA deals daily, such as education, environ- formation Systems Engineering, 2003. WISE 2003.
ment, ICT technologies, public health and so on. Proceedings of the Fourth International Conference
In near feature we plan to explore the possibility on, pages 325–328. IEEE.
of re-engineering our system to take advantage of Jason P.C. Chiu and Eric Nichols. 2015. Named en-
new algorithms for entity extraction such as neu- tity recognition with bidirectional lstm-cnns. arXiv
ral networks and in particular from character level preprint arXiv:1511.08308.
Corinna Cortes and Vladimir Vapnik. 1995. Support- Joakim Nivre, Johan Hall, Sandra Kübler, Ryan Mc-
vector networks. Mach. Learn., 20(3):273–297. Donald, Jens Nilsson, Sebastian Riedel, and Deniz
Yuret. 2007. The conll 2007 shared task on de-
Felice Dell’Orletta. 2009. Ensemble system for part- pendency parsing. In Proceedings of the CoNLL
of-speech tagging. In EVALITA 2009 - Evaluation Shared Task Session of EMNLP-CoNLL 2007, pages
of NLP and Speech Tools for Italian 2009, LNCS, 915–932, Prague (Czech Republic). Association for
Reggio Emilia (Italy). Springer. Computational Linguistics.
Jenny Rose Finkel, Trond Grenager, and Christopher Lucia C. Passaro and Alessandro Lenci. 2014. ”il pi-
Manning. 2005. Incorporating non-local informa- ave mormorava...”: Recognizing locations and other
tion into information extraction systems by gibbs named entities in italian texts on the great war.
sampling. In Proceedings of the 43rd Annual Meet- In Proceedings of the First Italian Conference on
ing on Association for Computational Linguistics, Computational Linguistics CLiC-it 2014 & and of
ACL ’05, pages 363–370, Ann Arbor, Michigan the Fourth International Workshop EVALITA 2014,
(USA). Association for Computational Linguistics. pages 286–290, Pisa (Italy).
Lucia C. Passaro and Alessandro Lenci. 2016. Ex-
Ralph Grishman and Beth Sundheim. 1996. Mes- tracting terms with extra. In Proceedings of the EU-
sage understanding conference-6: A brief history. ROPHRAS 2015 – Computerised and Corpus-based
In Proceedings of the 16th conference on Compu- Approaches to Phraseology: Monolingual and
tational linguistics-Volume 1, pages 466–471. Asso- Multilingual Perspectives, pages 188–196, Malaga
ciation for Computational Linguistics. (Spain).
Ralph Grishman. 1995. The nyu system for muc-6 Lisa F. Rau. 1991. Extracting company names from
or where’s the syntax? In Proceedings of the 6th text. In Artificial Intelligence Applications, 1991.
Conference on Message Understanding, pages 167– Proceedings., Seventh IEEE Conference on, vol-
175, Columbia, Maryland. Association for Compu- ume 1, pages 29–32. IEEE.
tational Linguistics.
Rohini Srihari, Cheng Niu, and Wei Li. 2000. A hy-
Xianpei Han, Le Sun, and Jun Zhao. 2011. Collective brid approach for named entity and sub-type tag-
entity linking in web text: A graph-based method. ging. In Proceedings of the Sixth Conference on
In Proceedings of the 34th International ACM SIGIR Applied Natural Language Processing, ANLC ’00,
Conference on Research and Development in Infor- pages 247–254, Seattle, Washington (USA). Asso-
mation Retrieval, pages 765–774, Beijing (China). ciation for Computational Linguistics.
Emma Strubell, Patrick Verga, David Belanger, and
Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-
Andrew McCallum. 2017. Fast and accurate en-
dino, Hagen Fürstenau, Manfred Pinkal, Marc Span-
tity recognition with iterated dilated convolutions.
iol, Bilyana Taneva, Stefan Thater, and Gerhard
In Proceedings of the 2017 Conference on Empiri-
Weikum. 2011. Robust disambiguation of named
cal Methods in Natural Language Processing, pages
entities in text. In Proceedings of the Conference on
2660–2670.
Empirical Methods in Natural Language Process-
ing, pages 782–792, Edinburgh (United Kingdom). Erik F. Tjong Kim Sang and Fien De Meulder.
Association for Computational Linguistics. 2003. Introduction to the conll-2003 shared task:
Language-independent named entity recognition. In
John Lafferty, Andrew McCallum, and Fernando C. N. Proceedings of the seventh conference on Natural
Pereira. 2001. Conditional random fields: Prob- language learning at HLT-NAACL 2003-Volume 4,
abilistic models for segmenting and labeling se- pages 142–147. Association for Computational Lin-
quence data. In Proceedings of the Eighteenth In- guistics.
ternational Conference on Machine Learning, pages
282–289, San Francisco, CA (USA). Morgan Kauf- Erik F. Tjong Kim Sang. 2002. Introduction to
mann Publishers Inc. the conll-2002 shared task: language-independent
named entity recognition. In Proceedings of the
Bernardo Magnini, Emanuele Pianta, Manuela Sper- 6th conference on Natural language learning, vol-
anza, Valentina Bartalesi Lenzi, and Rachele Sprug- ume 31, pages 1–4.
noli. 2006. Italian content annotation bank (i-cab):
Named entities.
Andrew McCallum and Wei Li. 2003. Early results for
named entity recognition with conditional random
fields, feature induction and web-enhanced lexicons.
In Proceedings of the seventh conference on Natu-
ral language learning at HLT-NAACL 2003-Volume
4, pages 188–191, Edmonton (Canada). Association
for Computational Linguistics.