=Paper=
{{Paper
|id=Vol-1749/paper35
|storemode=property
|title=Building a Computational Lexicon by using SQL
|pdfUrl=https://ceur-ws.org/Vol-1749/paper35.pdf
|volume=Vol-1749
|authors=Alessandro Mazzei
|dblpUrl=https://dblp.org/rec/conf/clic-it/Mazzei16
}}
==Building a Computational Lexicon by using SQL==
Building a computational lexicon by using SQL
Alessandro Mazzei
Dipartimento di Informatica
Universit degli Studi di Torino
Corso Svizzera 185, 10149 Torino
mazzei@di.unito.it
Abstract resources cannot be employed in statistical or rule-
based natural language morho-syntactic analyzer
English. This paper presents some issues or generator.
about a computational lexicon employed
A notable exception is the PAROLE-SIMPLE-
in a generation system for Italian (Mazzei
CLIPS lexicon, that is a four-level (i.e. phonologi-
et al., 2016). The paper has three goals:
cal, morphological, syntactical, semantic) general
(i) to describe the SQL resources produced
purpose lexicon composed by 53, 044 lemmata
during the construction of the lexicon; (ii)
(Ruimy et al., 1998). Unfortunately, a strong lim-
to describe the algorithm for building the
itation for the usage of PAROLE-SIMPLE-CLIPS
lexicon; (iii) to present an ongoing work
is the licence, since it is not freely available for
for enhancing the lexicon by using the syn-
research or commercial use.
tactic information extracted from a tree-
bank. Rule-based natural language realization en-
gines, that are systems performing linearisa-
Italiano. Questo lavoro descrive tion and morphological inflections of a proto-
la costruzione di un lessico com- syntactic input tree (Gatt and Reiter, 2009), need
putazionale per la generazione auto- wide coverage morpho-syntactic information as
matica dell’italiano (Mazzei et al., 2016). knowledge-base. In other words, to perform re-
Il lavoro ha tre obiettivi: (i) descrivere alization, that is the last step of natural language
alcune risorse SQL prodotte funzional- generation (Reiter and Dale, 2000), one needs two
mente alla costruzione del lessico; (ii) main kinds of linguistic knowledge: (i) the gram-
descrivere l’algoritmo per la costruzione matical/syntactical knowledge that specifies the
del lessico; (iii) presentare un lavoro in syntactic rules of the language and which is usu-
divenire per migliorare il lessico che usa ally encoded into formal rules; (ii) the morpholog-
l’informazione sintattica estratta da un ical and lexical knowledge, which is usually en-
treebank. coded into a computational lexicon. In the port-
ing of the SimpleNLG system to Italian (hence-
forth SimpleNLG-IT) (Mazzei et al., 2016), we
1 Introduction
have used the grammar (Patota, 2006) as the lin-
A number of free large multilingual resources cov- guistic reference for the syntax: we have encoded
ering Italian have been released, e.g. MultiWord- the Italian syntactic inflections and word ordering
net, UniversalWordnet, BabelNet (Pianta et al., by using IF-THEN-ELSE rules in Java. However,
2002; de Melo and Weikum, 2009; Navigli and since Italian has a high number of irregularities
Ponzetto, 2012). Moreover, several lexical cor- for verb and adjective inflections, we needed for
pora have been built specifically for Italian, as the a specifically designed computational lexicon too.
detailed map of the Italian NLP resources pro- We needed for a lexicon that has both a good cov-
duced within the PARLI project shows1 . Unfor- erage and a detailed account of the morphological
tunately most resources are designed to represent irregularities.
lexical semantics rather than morpho-syntactic re- In order to build this specific lexicon, that we
lations among the words. As a consequence, these have called SimpleLEX-IT, we have decided to
1
http://parli.di.unito.it/resources_ merge three free resources for Italian, namely
en.html Morph-it! (Zanchetta and Baroni, 2005), the Vo-
cabolario di base della lingua italiana (De Mauro, small number of also very common Italian words
1985) and, for some specific issues, Wikipedia. are not included in the lexicon, e.g. sposa (bride),
The differences between the three resources can be ovest (west) or aceto (vinegar). Morph-it! repre-
referred to both the reasons for which the authors sents extensionally the Italian language by listing
developed them and the adopted methodology and all the morphological inflections, i.e. adjective,
approach they applied in their development: the verbs, nouns inflections are represented as a list
first is a hand-made list of basic words; the second rather than by using morphological rules. We con-
one is an extensional corpus based morphological verted Morph-it! in SQL by exploiting its original
lexicon; the third one is a collection of encyclope- feature structure: we used one single attribute to
dic entries about irregular verbs in Italian. represent one single feature3 . We used one table
This paper is organized as follows: in Section 2 to collect all the lemmata and seven tables, with a
we describe the conversion of the three lexical re- different number of attributes, to collect the vari-
sources used into a relational database; in Sec- ous inflected forms:
tion 3 we provide some details about the algorithm • the table lemmata is formed by 3 attributes: a
used to build SimpleLEX-IT; in Section 4 we de- lemma, its PoS and its ID (integer). This table
scribe a work in progress to enrich the lexicon by contains 34, 725 records. A number of lemmata
using the syntactic information extracted from a belonging to the original version of Morph-it!
treebank; finally, Section 5 closes the paper with have been excluded in our conversion: proper
conclusions. nouns, emoticons and cardinals beginning with
a digit (e.g. 15mila).
2 Using relational database for • the tables det demo table, pro demo table,
representing linguistic data pronou table are used to collect inflected form
of demonstrative determiners (116 records, 4
In order to merge different lexical resources we
attributes: ID word, form, ID lemma, number,
needed to convert them in a common compu-
gender), demonstrative pronouns (95 records,
tational representation. We used a relational
5 attributes: ID word, form, ID lemma, num-
database2 (SQL henceforth) since all the three re-
ber, gender), personal pronouns (63 records, 7
sources are originally provided as text files, orga-
attributes: ID word, form, ID lemma, person,
nized as tables or simple list.
number, gender, clitics).
The first resource that we exploited for popu-
• the tables adv table, adj table, nou table,
lating SimpleLEX-IT is Morph-it! (Zanchetta and
ver table are used to collect inflected form of
Baroni, 2005). The dataset released in the Morph-
adverbs (1, 594 records, 3 attributes: ID word,
it! project consists of a lexicon organized accord-
form, ID lemma), adjectives (72, 367 records,
ing to the inflected word forms, with associated
6 attributes: ID word, form, ID lemma, kind,
lemmas and morphological features. The lexicon
number, gender), nouns (35, 618 records, 5
is provided by the authors as a text file where the
attributes: ID word, form, ID lemma, num-
values of the information about each lexical en-
ber, gender) and verbs (392, 139, 8 attributes:
try are separated by a tab key. It is an alphabet-
records: ID word, form, ID lemma, mode, time,
ically ordered list of triples form-lemma-features.
person, number, gender) respectively.
An example of the annotation for the form corsi
(ran) is: The second resource we exploited for popu-
lating SimpleLEX-IT is the “Vocabolario di base
corsi correre-VER:ind past+1+s della lingua italiana” (VdB-IT henceforth), a col-
where the features are the part of speech (PoS, lection of around 7, 000 words created by the lin-
VERb), the mood of the verb (indicative), the guist Tullio De Mauro and his team4 (De Mauro,
tense (past), the person (1), and the number 1985). The development of this vocabulary has
(singular). The last released version of Morph- been mainly driven by the distinction between the
it! (v.48, 2009-02-23) contains 505, 074 differ-
3
ent forms corresponding to 35, 056 lemmas. It Morph-IT! is provided with a script that allows for a
naive conversion into SQL that use one single table and one
has been realized starting from a large newspa- single attribute for all the features.
per corpus, nevertheless it is not balanced and a 4
The second edition of the vocabulary has been an-
nounced (Chiari and De Mauro, 2014) and it is going to be
2
We used the PostgreSQL database . released (p.c.).
most frequent words (around 5.000) and the most we found that Morph-it! and VdB share 4, 086
familiar words (around 2.000). VdB-IT is there- nouns and 1, 448 verbs, but there are 245 lemmas
fore organized in the following three sections: belonging to VdD and not belonging to Morph-it!:
• the vocabolario fondamentale (fundamental vo- most of these words are nouns, for instance lava-
cabulary), which contains 2, 000 words featured piatti, chimica, incinta, but we found too a system-
by the highest frequency into a balanced cor- atic difference for verbs. Indeed, VdB consider as
pus of Italian texts (composed of novels, movie proper reflexive a number of verbs, for instance
and theater scripts, newspapers, basic scholas- avvantaggiarsi, sdraiarsi. In contrast, these verbs
tic books); amore (love), lavoro (work), pane are are treated as improper reflexive in Morph-
(bread) are in this section. it!, which annotates avvantaggiare and sdraiare as
• the vocabolario di alto uso (vocabulary of high their lemmata.
usage), which includes other 2, 937 words with
high frequency, but lesser than the vocabolario 3 Building SimpleLEX-IT 1.0
fondamentale; ala (wing), seta (silk), toro (bull)
are in this section In this section we describe the algorithm used
• the vocabolario di alta disponibilità (vocabu- to build the computational lexicon SimpleLEX-
lary of high availability), is composed of 1, 753 IT, which is based on the three resources de-
words not often used in written language, but scribed in the Section 2, and that has been used
featured by a high frequency in spoken lan- in SimpleNLG-IT.
guage, which are indeed perceived as especially A computational lexicon can be split in two ma-
familiar by native speakers; aglio (garlic), cas- jor classes: open and closed classes. The closed
cata (waterfall), passeggero (passenger) are in class, that are usually composed by function words
this section. (i.e. prepositions, determiners, conjunctions, pro-
The list of lemmata of VdB has been converted in nouns, etc.) is one to which new words are very
SQL by using one single table, called lemmadema rarely added. In contrast, the open classes, that
(6540 records), which have two attributes, i.e. an is usually composed by lexical words (i.e. nouns,
ID (integer) and the lemma. verbs, adjectives, adverbs), accept the addition of
The third resource that we used for the lexi- new words. We adopted the same strategy of
con creation is Wikipedia. Our reference grammar (Vaudry and Lapalme, 2013): we built by hand
(Patota, 2006) reports a partial list of the principal the closed part of the Italian lexicon and we built
Italian irregular verbs, but we decided to use the automatically the open part by using the available
larger list of verbs reported in Wikipedia5 (VerIrr resources.
henceforth). Another linguistic distinction for Ital- In order to build the open class for SimpleLEX-
ian verbs reported in Wikipedia6 (VerInc hence- IT we needed both a large coverage and a detailed
forth) has been exploited in the lexicon: the in- account of morphological irregularities, also con-
coativi verbs are a subclass of the third conjuga- sidering their high frequency in Italian. More-
tion that have a special behavior in the present time over, in order to have good time execution per-
(e.g. capire). So, in order to produce the correct formance in the realiser (cf. (De Oliveira and
conjugation of these verbs in SimpleNLG-IT, they Sripada, 2014)), a trade-off between the size of
needed to be marked in the lexicon. Both these the lexicon and its usability for our task must
lists of verbs have been converted in SQL by using be achieved, which consists in assuming a form
two distinct tables which have two attributes, i.e. of word classification where fundamental Italian
an ID (integer) and the verb in the infinitive form. words are distinguished from the less-fundamental
The two tables are verbiirregolari (858 records) ones. In order to balance completeness and effi-
and verbiincoativi (726 records). ciency in SimpleLEX-IT, we put in the lexicon the
A notable advantage of the SQL representation open classes words belonging to the intersection
for linguistic resources is the possibility to extract of VdB-IT and Morph-it!.
intrinsic information with simple queries. Indeed, We reported in Algorithm 1 the process used
for the insertion and the annotation of the words
5
https://it.wikipedia.org/wiki/Verbi_ belonging to the open classes in SimpleLEX-IT.
irregolari_italiani
6
https://it.wikipedia.org/wiki/Verbi_ Note that in order to recognize proper reflexive
incoativi verbs, we check if the infinitive form of the verb
foreach adverb ∈ Morph-it! ∩ VdB-IT do manage auxiliary verbs: in order to produce some
Add the adverb in normal form into SimpleLEX-IT
end complex verb tense, e.g. passato prossimo, the
foreach adjective ∈ Morph-it! ∩ VdB-IT do
Add the adjective in normal form (masculine-singular) and in user needs to give in input to the realiser the cor-
feminine-singular, masculine-plural, feminine-plural forms, into
SimpleLEX-IT rect auxiliary, i.e. essere (to be, e.g. Io sono nato
end a Napoli) or avere (to have, e.g. Io ho amato la
foreach noun ∈ Morph-it! ∩ VdB-IT do
Add the noun in normal form (singular), the plural form, and the scuola.). Our reference grammar reports complex
gender into SimpleLEX-IT
end rules based on lexical semantics in order to choose
foreach verb ∈ Morph-it! ∩ VdB-IT do
if verb ∈ VerIrr then
the correct auxiliary verb and, unfortunately, these
Add all the inflections for the indicativo presente, rules have many exceptions. So, we can use UDT-
congiuntivo presente, futuro semplice, condizionale,
imperfetto, participio passato, passato remoto into IT to empirically decide the correct auxiliary in
SimpleLEX-IT
else SimpleNLG-IT. By following this idea, we con-
if verb ∈ VerInc then verted UDT-IT in SQL by exploiting its original
Set active the incoativo feature in the entry
end feature structuree. We used one table to collect
if the verb is properly reflexive (i.e. ”...rsi”) then
Set active the reflexive feature in the entry information about the sentences, and one table to
end
Add the verb in normal form into SimpleLEX-IT collect information about the words:
end
end
• the table sentence ud is formed by 4 attributes:
Algorithm 1: The algorithm for building an ID (integer), the original treebank (i.e. TUT,
the adverbs, adjectives, nouns and verbs in ISST, etc.), the original ID, the section (i.e.
SimpleLEX-IT DEV, TRAIN, TEST).
• the table words ud is used to collect all the
words of the UDT-IT. It uses 21 attributes: one
has the postfix “rsi”, since MorphIT! contains this attribute id sentence, contains the id of the sen-
inflection as its normal form. In Table 1 we re- tence in the table sentence ud, and 20 attributes
ported some statistics about SimpleLEX-IT com- correspond to the featured used in the UD anno-
position. Most of the lexicon is composed by tation.
nouns (58%), followed by verbs (21%), adjectives In order to find the correct auxiliary for a specific
(19%), and adverbs (2%). verb in UDT-IT, we need to exclude passive, re-
flexive and modal verb constructions in the query.
PoS Number %
Adverb 146 2 We found 512 verbs of SimpleLEX-IT that are
Verb (irr.) 283 4 used in UDT-IT with an auxiliary. It is interesting
Verb (reg.) 1168 17 to note that 60 verbs are used both with the aux-
Adjective 1333 19
Noun 4092 58 iliary essere and with the auxiliary avere: this is
Total 7022 100 grammatical for some verbs (e.g. vivere), but more
often we found an annotation error in the UDT-IT.
Table 1: Number of adverbs, adjectives, nouns and Finally, note that another possible use of UDT-
verbs in SimpleLEX-IT. IT regards the evaluation of the lexicon. In fu-
ture work we plan to quantify the coverage of
SimpleLEX-IT by using the TEST section of the
4 Work in Progress: adding information UDT-IT.
from a treebank
5 Conclusions
The Universal Dependency Treebank (UDT) is a
recent project that releases freely available tree- In this paper we have presented some issues about
banks for 33 languages (in this work, version 1.2) the computational lexicon SimpleLEX-IT. We de-
(Nivre et al., 2016). Each UDT is split in three sec- scribed the algorithm used to build the lexicon,
tions, train, dev and test, which can be exploited in three SQL resources produced as side effects of
the evaluation of NLP/NLG systems. the lexicon building and a work in progress about
We are working on the idea of adding more in- the extraction of syntactic information from UD-
formation in SimpleLEX-IT by using UDT-IT, i.e. IT.
the Italian section of UDT. A specific case that we All the resources described in this paper can
are currently considering regards auxiliary verbs. be downloaded at https://github.com/
The current version of SimpleNLG-IT does not alexmazzei/SimpleLEX-IT.
References Pierre-Luc Vaudry and Guy Lapalme. 2013. Adapt-
ing simplenlg for bilingual english-french realisa-
Isabella Chiari and Tullio De Mauro. 2014. The New tion. In Proc. of ENLG 2013.
Basic Vocabulary of Italian as a linguistic resource.
In Roberto Basili, Alessandro Lenci, and Bernardo Eros Zanchetta and Marco Baroni. 2005. Morph-it!
Magnini, editors, 1th Italian Conference on Compu- a free corpus-based morphological resource for the
tational Linguistics (CLiC-it), volume 1, pages 93– italian language. Corpus Linguistics 2005, 1(1).
97. Pisa University Press, December.
Tullio De Mauro. 1985. Guida all’uso delle parole.
Libri di base. Editori Riuniti.
Gerard de Melo and Gerhard Weikum. 2009. Towards
a universal wordnet by learning from combined evi-
dence. In Proceedings of the 18th ACM Conference
on Information and Knowledge Management (CIKM
2009), pages 513–522, New York, NY, USA. ACM.
Rodrigo De Oliveira and Somayajulu Sripada. 2014.
Adapting simplenlg for brazilian portuguese realisa-
tion. In Proc. of INLG 2014.
Albert Gatt and Ehud Reiter. 2009. SimpleNLG: A
Realisation Engine for Practical Applications. In
Proc. of ENLG 2009, ENLG ’09.
Alessandro Mazzei, Cristina Battaglino, and Cristina
Bosco. 2016. SimpleNLG-IT: adapting Sim-
pleNLG to Italian. In Proc. of INLG 2016. TO AP-
PEAR.
Roberto Navigli and Simone Paolo Ponzetto. 2012.
BabelNet: The automatic construction, evaluation
and application of a wide-coverage multilingual se-
mantic network. Artificial Intelligence, 193:217–
250.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
ter, Yoav Goldberg, Jan Hajic, Christopher D. Man-
ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.
2016. Universal Dependencies v1:A Multilingual
Treebank Collection. In Proc. of LREC’16, may.
TO APPEAR.
Giuseppe Patota. 2006. Grammatica di riferimento
dell’italiano contemporaneo. Guide linguistiche.
Garzanti Linguistica.
Emanuele Pianta, Luisa Bentivogli, and Christian Gi-
rardi. 2002. Multiwordnet: developing an aligned
multilingual database. In Proceedings of the First
International Conference on Global WordNet, Jan-
uary.
Ehud Reiter and Robert Dale. 2000. Building Natural
Language Generation Systems. Cambridge Univer-
sity Press, New York, NY, USA.
Nilda Ruimy, Ornella Corazzari, Elisabetta Gola, An-
tonietta Spanu, Nicoletta Calzolari, and Antonio
Zampolli. 1998. The European LE-PAROLE
project: the Italian Syntactic Lexicon. In Proceed-
ings of the First International Conference on Lan-
guage resources and Evaluation , pages 241–248.