=Paper= {{Paper |id=Vol-1749/paper35 |storemode=property |title=Building a Computational Lexicon by using SQL |pdfUrl=https://ceur-ws.org/Vol-1749/paper35.pdf |volume=Vol-1749 |authors=Alessandro Mazzei |dblpUrl=https://dblp.org/rec/conf/clic-it/Mazzei16 }} ==Building a Computational Lexicon by using SQL== https://ceur-ws.org/Vol-1749/paper35.pdf
                   Building a computational lexicon by using SQL

                                         Alessandro Mazzei
                                     Dipartimento di Informatica
                                    Universit degli Studi di Torino
                                   Corso Svizzera 185, 10149 Torino
                                      mazzei@di.unito.it


                     Abstract                           resources cannot be employed in statistical or rule-
                                                        based natural language morho-syntactic analyzer
    English. This paper presents some issues            or generator.
    about a computational lexicon employed
                                                           A notable exception is the PAROLE-SIMPLE-
    in a generation system for Italian (Mazzei
                                                        CLIPS lexicon, that is a four-level (i.e. phonologi-
    et al., 2016). The paper has three goals:
                                                        cal, morphological, syntactical, semantic) general
    (i) to describe the SQL resources produced
                                                        purpose lexicon composed by 53, 044 lemmata
    during the construction of the lexicon; (ii)
                                                        (Ruimy et al., 1998). Unfortunately, a strong lim-
    to describe the algorithm for building the
                                                        itation for the usage of PAROLE-SIMPLE-CLIPS
    lexicon; (iii) to present an ongoing work
                                                        is the licence, since it is not freely available for
    for enhancing the lexicon by using the syn-
                                                        research or commercial use.
    tactic information extracted from a tree-
    bank.                                                  Rule-based natural language realization en-
                                                        gines, that are systems performing linearisa-
    Italiano.       Questo lavoro descrive              tion and morphological inflections of a proto-
    la costruzione di un lessico com-                   syntactic input tree (Gatt and Reiter, 2009), need
    putazionale per la generazione auto-                wide coverage morpho-syntactic information as
    matica dell’italiano (Mazzei et al., 2016).         knowledge-base. In other words, to perform re-
    Il lavoro ha tre obiettivi: (i) descrivere          alization, that is the last step of natural language
    alcune risorse SQL prodotte funzional-              generation (Reiter and Dale, 2000), one needs two
    mente alla costruzione del lessico; (ii)            main kinds of linguistic knowledge: (i) the gram-
    descrivere l’algoritmo per la costruzione           matical/syntactical knowledge that specifies the
    del lessico; (iii) presentare un lavoro in          syntactic rules of the language and which is usu-
    divenire per migliorare il lessico che usa          ally encoded into formal rules; (ii) the morpholog-
    l’informazione sintattica estratta da un            ical and lexical knowledge, which is usually en-
    treebank.                                           coded into a computational lexicon. In the port-
                                                        ing of the SimpleNLG system to Italian (hence-
                                                        forth SimpleNLG-IT) (Mazzei et al., 2016), we
1   Introduction
                                                        have used the grammar (Patota, 2006) as the lin-
A number of free large multilingual resources cov-      guistic reference for the syntax: we have encoded
ering Italian have been released, e.g. MultiWord-       the Italian syntactic inflections and word ordering
net, UniversalWordnet, BabelNet (Pianta et al.,         by using IF-THEN-ELSE rules in Java. However,
2002; de Melo and Weikum, 2009; Navigli and             since Italian has a high number of irregularities
Ponzetto, 2012). Moreover, several lexical cor-         for verb and adjective inflections, we needed for
pora have been built specifically for Italian, as the   a specifically designed computational lexicon too.
detailed map of the Italian NLP resources pro-          We needed for a lexicon that has both a good cov-
duced within the PARLI project shows1 . Unfor-          erage and a detailed account of the morphological
tunately most resources are designed to represent       irregularities.
lexical semantics rather than morpho-syntactic re-         In order to build this specific lexicon, that we
lations among the words. As a consequence, these        have called SimpleLEX-IT, we have decided to
  1
    http://parli.di.unito.it/resources_                 merge three free resources for Italian, namely
en.html                                                 Morph-it! (Zanchetta and Baroni, 2005), the Vo-
cabolario di base della lingua italiana (De Mauro,     small number of also very common Italian words
1985) and, for some specific issues, Wikipedia.        are not included in the lexicon, e.g. sposa (bride),
The differences between the three resources can be     ovest (west) or aceto (vinegar). Morph-it! repre-
referred to both the reasons for which the authors     sents extensionally the Italian language by listing
developed them and the adopted methodology and         all the morphological inflections, i.e. adjective,
approach they applied in their development: the        verbs, nouns inflections are represented as a list
first is a hand-made list of basic words; the second   rather than by using morphological rules. We con-
one is an extensional corpus based morphological       verted Morph-it! in SQL by exploiting its original
lexicon; the third one is a collection of encyclope-   feature structure: we used one single attribute to
dic entries about irregular verbs in Italian.          represent one single feature3 . We used one table
   This paper is organized as follows: in Section 2    to collect all the lemmata and seven tables, with a
we describe the conversion of the three lexical re-    different number of attributes, to collect the vari-
sources used into a relational database; in Sec-       ous inflected forms:
tion 3 we provide some details about the algorithm     • the table lemmata is formed by 3 attributes: a
used to build SimpleLEX-IT; in Section 4 we de-           lemma, its PoS and its ID (integer). This table
scribe a work in progress to enrich the lexicon by        contains 34, 725 records. A number of lemmata
using the syntactic information extracted from a          belonging to the original version of Morph-it!
treebank; finally, Section 5 closes the paper with        have been excluded in our conversion: proper
conclusions.                                              nouns, emoticons and cardinals beginning with
                                                          a digit (e.g. 15mila).
2       Using relational database for                  • the tables det demo table, pro demo table,
        representing linguistic data                      pronou table are used to collect inflected form
                                                          of demonstrative determiners (116 records, 4
In order to merge different lexical resources we
                                                          attributes: ID word, form, ID lemma, number,
needed to convert them in a common compu-
                                                          gender), demonstrative pronouns (95 records,
tational representation. We used a relational
                                                          5 attributes: ID word, form, ID lemma, num-
database2 (SQL henceforth) since all the three re-
                                                          ber, gender), personal pronouns (63 records, 7
sources are originally provided as text files, orga-
                                                          attributes: ID word, form, ID lemma, person,
nized as tables or simple list.
                                                          number, gender, clitics).
   The first resource that we exploited for popu-
                                                       • the tables adv table, adj table, nou table,
lating SimpleLEX-IT is Morph-it! (Zanchetta and
                                                          ver table are used to collect inflected form of
Baroni, 2005). The dataset released in the Morph-
                                                          adverbs (1, 594 records, 3 attributes: ID word,
it! project consists of a lexicon organized accord-
                                                          form, ID lemma), adjectives (72, 367 records,
ing to the inflected word forms, with associated
                                                          6 attributes: ID word, form, ID lemma, kind,
lemmas and morphological features. The lexicon
                                                          number, gender), nouns (35, 618 records, 5
is provided by the authors as a text file where the
                                                          attributes: ID word, form, ID lemma, num-
values of the information about each lexical en-
                                                          ber, gender) and verbs (392, 139, 8 attributes:
try are separated by a tab key. It is an alphabet-
                                                          records: ID word, form, ID lemma, mode, time,
ically ordered list of triples form-lemma-features.
                                                          person, number, gender) respectively.
An example of the annotation for the form corsi
(ran) is:                                                 The second resource we exploited for popu-
                                                       lating SimpleLEX-IT is the “Vocabolario di base
    corsi correre-VER:ind past+1+s                     della lingua italiana” (VdB-IT henceforth), a col-
where the features are the part of speech (PoS,        lection of around 7, 000 words created by the lin-
VERb), the mood of the verb (indicative), the          guist Tullio De Mauro and his team4 (De Mauro,
tense (past), the person (1), and the number           1985). The development of this vocabulary has
(singular). The last released version of Morph-        been mainly driven by the distinction between the
it! (v.48, 2009-02-23) contains 505, 074 differ-
                                                           3
ent forms corresponding to 35, 056 lemmas. It                Morph-IT! is provided with a script that allows for a
                                                       naive conversion into SQL that use one single table and one
has been realized starting from a large newspa-        single attribute for all the features.
per corpus, nevertheless it is not balanced and a          4
                                                             The second edition of the vocabulary has been an-
                                                       nounced (Chiari and De Mauro, 2014) and it is going to be
    2
        We used the PostgreSQL database .              released (p.c.).
most frequent words (around 5.000) and the most          we found that Morph-it! and VdB share 4, 086
familiar words (around 2.000). VdB-IT is there-          nouns and 1, 448 verbs, but there are 245 lemmas
fore organized in the following three sections:          belonging to VdD and not belonging to Morph-it!:
• the vocabolario fondamentale (fundamental vo-          most of these words are nouns, for instance lava-
   cabulary), which contains 2, 000 words featured       piatti, chimica, incinta, but we found too a system-
   by the highest frequency into a balanced cor-         atic difference for verbs. Indeed, VdB consider as
   pus of Italian texts (composed of novels, movie       proper reflexive a number of verbs, for instance
   and theater scripts, newspapers, basic scholas-       avvantaggiarsi, sdraiarsi. In contrast, these verbs
   tic books); amore (love), lavoro (work), pane         are are treated as improper reflexive in Morph-
   (bread) are in this section.                          it!, which annotates avvantaggiare and sdraiare as
• the vocabolario di alto uso (vocabulary of high        their lemmata.
   usage), which includes other 2, 937 words with
   high frequency, but lesser than the vocabolario       3   Building SimpleLEX-IT 1.0
   fondamentale; ala (wing), seta (silk), toro (bull)
   are in this section                                   In this section we describe the algorithm used
• the vocabolario di alta disponibilità (vocabu-        to build the computational lexicon SimpleLEX-
   lary of high availability), is composed of 1, 753     IT, which is based on the three resources de-
   words not often used in written language, but         scribed in the Section 2, and that has been used
   featured by a high frequency in spoken lan-           in SimpleNLG-IT.
   guage, which are indeed perceived as especially          A computational lexicon can be split in two ma-
   familiar by native speakers; aglio (garlic), cas-     jor classes: open and closed classes. The closed
   cata (waterfall), passeggero (passenger) are in       class, that are usually composed by function words
   this section.                                         (i.e. prepositions, determiners, conjunctions, pro-
The list of lemmata of VdB has been converted in         nouns, etc.) is one to which new words are very
SQL by using one single table, called lemmadema          rarely added. In contrast, the open classes, that
(6540 records), which have two attributes, i.e. an       is usually composed by lexical words (i.e. nouns,
ID (integer) and the lemma.                              verbs, adjectives, adverbs), accept the addition of
   The third resource that we used for the lexi-         new words. We adopted the same strategy of
con creation is Wikipedia. Our reference grammar         (Vaudry and Lapalme, 2013): we built by hand
(Patota, 2006) reports a partial list of the principal   the closed part of the Italian lexicon and we built
Italian irregular verbs, but we decided to use the       automatically the open part by using the available
larger list of verbs reported in Wikipedia5 (VerIrr      resources.
henceforth). Another linguistic distinction for Ital-       In order to build the open class for SimpleLEX-
ian verbs reported in Wikipedia6 (VerInc hence-          IT we needed both a large coverage and a detailed
forth) has been exploited in the lexicon: the in-        account of morphological irregularities, also con-
coativi verbs are a subclass of the third conjuga-       sidering their high frequency in Italian. More-
tion that have a special behavior in the present time    over, in order to have good time execution per-
(e.g. capire). So, in order to produce the correct       formance in the realiser (cf. (De Oliveira and
conjugation of these verbs in SimpleNLG-IT, they         Sripada, 2014)), a trade-off between the size of
needed to be marked in the lexicon. Both these           the lexicon and its usability for our task must
lists of verbs have been converted in SQL by using       be achieved, which consists in assuming a form
two distinct tables which have two attributes, i.e.      of word classification where fundamental Italian
an ID (integer) and the verb in the infinitive form.     words are distinguished from the less-fundamental
The two tables are verbiirregolari (858 records)         ones. In order to balance completeness and effi-
and verbiincoativi (726 records).                        ciency in SimpleLEX-IT, we put in the lexicon the
   A notable advantage of the SQL representation         open classes words belonging to the intersection
for linguistic resources is the possibility to extract   of VdB-IT and Morph-it!.
intrinsic information with simple queries. Indeed,          We reported in Algorithm 1 the process used
                                                         for the insertion and the annotation of the words
  5
    https://it.wikipedia.org/wiki/Verbi_                 belonging to the open classes in SimpleLEX-IT.
irregolari_italiani
  6
    https://it.wikipedia.org/wiki/Verbi_                 Note that in order to recognize proper reflexive
incoativi                                                verbs, we check if the infinitive form of the verb
    foreach adverb ∈ Morph-it! ∩ VdB-IT do                                   manage auxiliary verbs: in order to produce some
          Add the adverb in normal form into SimpleLEX-IT
    end                                                                      complex verb tense, e.g. passato prossimo, the
    foreach adjective ∈ Morph-it! ∩ VdB-IT do
          Add the adjective in normal form (masculine-singular) and in       user needs to give in input to the realiser the cor-
          feminine-singular, masculine-plural, feminine-plural forms, into
          SimpleLEX-IT                                                       rect auxiliary, i.e. essere (to be, e.g. Io sono nato
    end                                                                      a Napoli) or avere (to have, e.g. Io ho amato la
    foreach noun ∈ Morph-it! ∩ VdB-IT do
          Add the noun in normal form (singular), the plural form, and the   scuola.). Our reference grammar reports complex
          gender into SimpleLEX-IT
    end                                                                      rules based on lexical semantics in order to choose
    foreach verb ∈ Morph-it! ∩ VdB-IT do
          if verb ∈ VerIrr then
                                                                             the correct auxiliary verb and, unfortunately, these
                 Add all the inflections for the indicativo presente,        rules have many exceptions. So, we can use UDT-
                 congiuntivo presente, futuro semplice, condizionale,
                 imperfetto, participio passato, passato remoto into         IT to empirically decide the correct auxiliary in
                 SimpleLEX-IT
          else                                                               SimpleNLG-IT. By following this idea, we con-
                 if verb ∈ VerInc then                                       verted UDT-IT in SQL by exploiting its original
                        Set active the incoativo feature in the entry
                 end                                                         feature structuree. We used one table to collect
                 if the verb is properly reflexive (i.e. ”...rsi”) then
                        Set active the reflexive feature in the entry        information about the sentences, and one table to
                 end
                 Add the verb in normal form into SimpleLEX-IT               collect information about the words:
          end
    end
                                                                             • the table sentence ud is formed by 4 attributes:
 Algorithm 1: The algorithm for building                                        an ID (integer), the original treebank (i.e. TUT,
 the adverbs, adjectives, nouns and verbs in                                    ISST, etc.), the original ID, the section (i.e.
 SimpleLEX-IT                                                                   DEV, TRAIN, TEST).
                                                                             • the table words ud is used to collect all the
                                                                                words of the UDT-IT. It uses 21 attributes: one
has the postfix “rsi”, since MorphIT! contains this                             attribute id sentence, contains the id of the sen-
inflection as its normal form. In Table 1 we re-                                tence in the table sentence ud, and 20 attributes
ported some statistics about SimpleLEX-IT com-                                  correspond to the featured used in the UD anno-
position. Most of the lexicon is composed by                                    tation.
nouns (58%), followed by verbs (21%), adjectives                             In order to find the correct auxiliary for a specific
(19%), and adverbs (2%).                                                     verb in UDT-IT, we need to exclude passive, re-
                                                                             flexive and modal verb constructions in the query.
                      PoS             Number          %
                    Adverb              146            2                     We found 512 verbs of SimpleLEX-IT that are
                   Verb (irr.)          283            4                     used in UDT-IT with an auxiliary. It is interesting
                   Verb (reg.)         1168           17                     to note that 60 verbs are used both with the aux-
                   Adjective           1333           19
                     Noun              4092           58                     iliary essere and with the auxiliary avere: this is
                     Total             7022          100                     grammatical for some verbs (e.g. vivere), but more
                                                                             often we found an annotation error in the UDT-IT.
Table 1: Number of adverbs, adjectives, nouns and                               Finally, note that another possible use of UDT-
verbs in SimpleLEX-IT.                                                       IT regards the evaluation of the lexicon. In fu-
                                                                             ture work we plan to quantify the coverage of
                                                                             SimpleLEX-IT by using the TEST section of the
4     Work in Progress: adding information                                   UDT-IT.
      from a treebank
                                                                             5   Conclusions
The Universal Dependency Treebank (UDT) is a
recent project that releases freely available tree-                          In this paper we have presented some issues about
banks for 33 languages (in this work, version 1.2)                           the computational lexicon SimpleLEX-IT. We de-
(Nivre et al., 2016). Each UDT is split in three sec-                        scribed the algorithm used to build the lexicon,
tions, train, dev and test, which can be exploited in                        three SQL resources produced as side effects of
the evaluation of NLP/NLG systems.                                           the lexicon building and a work in progress about
   We are working on the idea of adding more in-                             the extraction of syntactic information from UD-
formation in SimpleLEX-IT by using UDT-IT, i.e.                              IT.
the Italian section of UDT. A specific case that we                             All the resources described in this paper can
are currently considering regards auxiliary verbs.                           be downloaded at https://github.com/
The current version of SimpleNLG-IT does not                                 alexmazzei/SimpleLEX-IT.
References                                                 Pierre-Luc Vaudry and Guy Lapalme. 2013. Adapt-
                                                              ing simplenlg for bilingual english-french realisa-
Isabella Chiari and Tullio De Mauro. 2014. The New            tion. In Proc. of ENLG 2013.
   Basic Vocabulary of Italian as a linguistic resource.
   In Roberto Basili, Alessandro Lenci, and Bernardo       Eros Zanchetta and Marco Baroni. 2005. Morph-it!
   Magnini, editors, 1th Italian Conference on Compu-        a free corpus-based morphological resource for the
   tational Linguistics (CLiC-it), volume 1, pages 93–       italian language. Corpus Linguistics 2005, 1(1).
   97. Pisa University Press, December.

Tullio De Mauro. 1985. Guida all’uso delle parole.
  Libri di base. Editori Riuniti.

Gerard de Melo and Gerhard Weikum. 2009. Towards
  a universal wordnet by learning from combined evi-
  dence. In Proceedings of the 18th ACM Conference
  on Information and Knowledge Management (CIKM
  2009), pages 513–522, New York, NY, USA. ACM.

Rodrigo De Oliveira and Somayajulu Sripada. 2014.
  Adapting simplenlg for brazilian portuguese realisa-
  tion. In Proc. of INLG 2014.

Albert Gatt and Ehud Reiter. 2009. SimpleNLG: A
  Realisation Engine for Practical Applications. In
  Proc. of ENLG 2009, ENLG ’09.

Alessandro Mazzei, Cristina Battaglino, and Cristina
  Bosco. 2016. SimpleNLG-IT: adapting Sim-
  pleNLG to Italian. In Proc. of INLG 2016. TO AP-
  PEAR.

Roberto Navigli and Simone Paolo Ponzetto. 2012.
  BabelNet: The automatic construction, evaluation
  and application of a wide-coverage multilingual se-
  mantic network. Artificial Intelligence, 193:217–
  250.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
  ter, Yoav Goldberg, Jan Hajic, Christopher D. Man-
  ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
  Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.
  2016. Universal Dependencies v1:A Multilingual
  Treebank Collection. In Proc. of LREC’16, may.
  TO APPEAR.

Giuseppe Patota. 2006. Grammatica di riferimento
  dell’italiano contemporaneo. Guide linguistiche.
  Garzanti Linguistica.

Emanuele Pianta, Luisa Bentivogli, and Christian Gi-
  rardi. 2002. Multiwordnet: developing an aligned
  multilingual database. In Proceedings of the First
  International Conference on Global WordNet, Jan-
  uary.

Ehud Reiter and Robert Dale. 2000. Building Natural
  Language Generation Systems. Cambridge Univer-
  sity Press, New York, NY, USA.

Nilda Ruimy, Ornella Corazzari, Elisabetta Gola, An-
  tonietta Spanu, Nicoletta Calzolari, and Antonio
  Zampolli. 1998. The European LE-PAROLE
  project: the Italian Syntactic Lexicon. In Proceed-
  ings of the First International Conference on Lan-
  guage resources and Evaluation , pages 241–248.