The Role of a Computational Lexicon for Query Expansion in Full-
                             Text Search
    Emiliano Giovannetti, Davide Albanesi, Andrea Bellandi, Simone Marchi, Mafalda
                                   Papini, Flavia Sciolette
           Istituto di Linguistica Computazionale, Via G. Moruzzi 1, 56124, Pisa
                              name.surname@ilc.cnr.it

                    Abstract                            60s on the development of the very first
                                                        question answering (QA) systems already
    English. This work describes the first              included linguistic resources as support
    experiments       conducted   with     a            datasets. To bring some “old school” examples,
    computational lexicon of Italian in a               the “General Inquirer” QA system (Stone et al.,
    context of query expansion for full-text            1962) used a thesaurus for “coding words as to
    search. An application, composed of a               concept membership” while Simmon’s
    graphical user interface and backend                “Protosynthex” was equipped with a synonym
    services to access the lexicon and the              dictionary (Simmons et al, 1963) to “expand the
    database containing the corpus to be                meaning of the question's words to any desired
    queried, was developed. The text was                level”. One of the first works specifically
    morphologically analysed to improve                 focussed on the use of a lexical resource for
    the precision of the search process.                NLP tasks was about COMPLEX (for
    Some examples of queries are given to               “COMPutational LEXicon”), a resource
    show the potential of a text search                 developed at IBM (Klavans, 1988).
    approach supported by a complex and                    The support of linguistic resources has
    stratified lexical resource.                        proved its potential in the field of information
                                                        retrieval (IR) too, as highlighted in many of Bill
    Italiano. Il presente lavoro illustra i             Woods’ works, culminating in the introduction
    primi esperimenti condotti con un                   of his conceptual indexing technique and the
    lessico computazionale dell’italiano in             conceptual taxonomy resource (Woods, 1997)
    un contesto di query expansion per la               and later refined in an article entitled
    ricerca full-text. È stata sviluppata una           “Linguistic      Knowledge       can     Improve
    applicazione composta da una                        Information Retrieval” (Woods, et al, 2000).
    interfaccia grafica utente e un backend             More recently, other researchers have stressed
    di servizi che permette l’accesso sia al            the importance of the availability of a “Lexical
    lessico che al database contenente il               Knowledge Base” (another way to refer to a
    corpus da interrogare. Il testo è stato             computational lexicon) in tasks such as Word
    analizzato morfologicamente al fine di              Sense Disambiguation, since their use, in some
    migliorare la precisione del processo di            contexts, can outperform supervised systems
    ricerca. Alcuni esempi di query sono                (Agirre et al., 2009).
    forniti al fine di mostrare le potenzialità            The use of linguistic resources in QA of the
    di un approccio di ricerca sul testo                earliest period of computational linguistics can
    supportato da una risorsa lessicale                 be considered as the precursor of “query
    complessa e stratificata.                           expansion” (QE), the technique that Manning
                                                        and Raghavanat describe as the most used
1    Introduction                                       “local method” in IR to tackle those situations
                                                        in which “the same concept may be referred to
The need of techniques going beyond the mere            using different words” (Manning et al., 2008).
“search by keyword” in the querying of textual             Though QE may be obtained in different
resources dates back to the dawn of                     ways (among which query reformulations based
computational linguistics. Seminal works in the         on query log mining) we are here interested in

 Copyright ©️ 2021 for this paper by its authors. Use
permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
those applications that make use of lexical         been exploited for tasks of full-text search or
resources.                                          information retrieval.
   Most of the works, published from the 90s to
nowadays (proving that QE is still being            2.1    The Parole-Simple-Clips Lexicon
investigated), exploit WordNet (Fellbaum,           “PAROLE-SIMPLE-CLIPS” (PSC) is a
1998), the de facto and most widespread             computational lexicon of Italian, developed
ontological (or lexical, depending from the         from 1996 to 2003 by ILC-CNR (Ruimy et al.,
point of view) multilingual resource. Ellen         2002). Currently, the resource is stored as a
Vorhees was one of the first and used               MySQL database available on CLARIN4, and
WordNet’s IS_A relations to improve text            represents a unicum among the available
retrieval (Vorhees, 1993). Moving on directly to    linguistic resources for Italian, thanks to its
the most recent works, WordNet has been used        richness and articulated structure of data. Based
with all its ontological features to expand         on the Generative Lexicon theory (Pustejovsky,
queries in a semantic text search context in        1995), the schema on which the linguistic
(Ngo et al., 2018) while in (Azad and Deepak,       information is encoded is composed of four
2019) the authors combined WordNet and              distinct, but strictly interconnected layers of
Wikipedia for QE, exploiting the first to expand    analysis: phonology, morphology, syntax, and
individual terms and the second to expand           semantics.
phrase terms.                                          In these features lies the motivation of this
   The research work here illustrated places        work, since the available linguistic information
itself in the context of full-text search carried   may be combined in ways that go well beyond
out using a lexical resource-driven QE              what resources such as WordNet allow to do in
technique. However, the focus of this research,     the context of text search support. Even
differently from that of the cited works, is not    considering semantics alone, the information in
on the specific QE technique and the relative       PSC is detailed with fine-grained features that
evaluation, but on the resource we chose to         are not described in WordNet’s network of
exploit, introduced in the next section, in place   synsets: PSC encodes the meaning of each
of WordNet and on the frontend and backend          lexical sense as an array of information,
technologies implemented to query the text, as      including “templates” (see below), semantic
described in details in Section 3. The              traits, semantic roles, and argumental
advantages derived from the adoption of a rich      structures.
and highly structured computational lexicon            In this work, we document the first steps in
will also be remarked through some query            the use of PSC for QE. At this stage we used: i)
examples shown in Section 4. The developed          the Morphological Units, classified according
application can be freely accessed and used to      to their POS, which represent the lemmas of the
query the corpus1.                                  computational lexicon; ii) the Phonological
                                                    Units that represent the inflected forms of the
2    The Context and the Resource                   lemmas; iii) the Semantic Units (SemUs), that
This work stems from the activities conducted       describe the senses expressed by the words.
by the Institute of Computational Linguistics of    Furthermore, we considered the following
CNR (ILC-CNR) in the context of the Talmud          morphological and semantic information: i)
Translation Project2. The need of providing a       morphological traits (e.g. gender, number); ii)
way to query the Italian translation of the         relations between SemUs (at the moment
Talmud3 on a linguistic basis was the initial       limited to synonymy and hyponymy); iii) the
spark that led to the idea of experimenting the     association between SemUs and “templates”,
use of a computational lexicon for Italian. As a    representing sets of senses, labeled according to
matter of fact, this resource (described below)     one of the types represented in the Simple
represents a “linguistic mine” which has never      Ontology (Lenci et. al., 2001). The other parts

1
  https://klab.ilc.cnr.it/talmudSearch/
2                                                   4
  https://www.talmud.it/                             https://dspace-clarin-
3
  The corpus here queried is limited to eight       it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/I
tractates of the babylonian Talmud: Rosh            LC-88.
Hashanah, Berakhot, Ta'anit, Kiddushin, Chagigah,
Beitza, Sukkah, and Megillah
of linguistic information will be the subject of      the same time, a list of services to query both
future works, according to an incremental             PSC and the database storing the Italian
approach.                                             translation of the Talmud needed to be
                                                      developed in order to answer to the interface
3    The Process and the Application                  requests (Section 3.2). The interface itself was
                                                      designed on the basis of the available linguistic
The whole search process involves a series of
                                                      information exposed from PSC and developed
steps that can be summarized as follows (see
                                                      accordingly (Section 3.3). Finally, to improve
Fig. 1 for a schematic functional architecture of
                                                      the precision of the search process, the queried
the application):
                                                      corpus was also POS-tagged (Section 3.4).
    i) the user inserts a first set of data to        3.1    A First Conversion of PSC
    formulate the desired query in the Graphical
    User Interface;                                   The first phase of our work was to consider the
    ii) the interface requests, via Web API, the      relational database of PSC as the data source for
    lexicon backend services which return the         the generation of a first Linked Data (LD)
    linguistic data matching the initial query;       conversion. Two main reasons led to the need
    iii) the user completes the query taking into     for a conversion of PSC: i) to ease the reuse of
    account the linguistic data and starts the        the lexicon itself, in virtue of the intrinsic nature
    search;                                           of LD, ii) the possibility of performing
    iv) the interface executes the query              automated reasoning on data if appropriately
    expansion and requests, via Web API, the          modeled taking into account ontological
    text backend services which collect, tag, and     principles, for example to compute inferred
    return the matching textual portions of the       closures, infer new knowledge on the basis of
    Talmud;                                           class taxonomies, property hierarchies, and so
    v) the interface shows the results to the user.   on. Accordingly to the LD principles, we first
                                                      had to look for existing vocabularies for the
                                                      modeling of lexicons.
                                                         In the context of the Semantic Web, the de
                                                      facto standard for representing lexical
                                                      information is the lemon model (Cimiano et al.,
                                                      2016). Its core module, called OntoLex, allows
                                                      to represent grammatical, basic morphological
                                                      and semantic information by means of three
                                                      main classes: Lexical Entry, Form (lemma and
                                                      inflected forms), and Lexical Sense. Lemon
                                                      relies on external vocabularies to define
                                                      semantic relations between senses: in this
                                                      conversion we modelled PSC’s synonymy and
                                                      hyponymy with LexInfo ontology5. Currently,
                                                      the converted resource includes 72006 lexical
                                                      entries (48735 nouns, 6522 verbs, and 11830
                                                      adjectives), 469726 inflected forms, and 57130
                                                      senses. Explicit lexico-semantic relations
                                                      include 1803 meronyms, 4060 synonyms, and
                                                      44487 hyponyms. This initial conversion of
                                                      PSC as Linked Data was purely functional to
    Figure 1. Functional architecture of the
                                                      the linguistic querying of the Italian translation
                 application.
                                                      of the Babylonian Talmud6. Therefore, it was
                                                      decided to convert a selected number of
First of all, to make the lexicon efficiently         linguistic data to be exploited for the process of
queryable, it needed to be transformed from           query expansion. At the time of writing this
relational data into linked data (Section 3.1). At
5
 https://lexinfo.net/                                 performing linguistic searches experiments on the
6
 We remark that the conversion of PSC Simple is not   Italian translation of the Talmud.
the focus of this work, but it was necessary for
proposal, a complete conversion of PSC as            between three types of search using the
LOD (Linked Open Data) is in progress. This          available tabs: Keyword, Form/Lemma, or
complete conversion will also take full              Semantic Traits.
advantage of the already available works on the         The first one is the classic keyword-based
resource as documented in (Khan et al., 2018)        search. The second type, via the Form/Lemma
and (Del Gratta et al., 2015).                       tab, allows to search for a specific word form or
                                                     the set of inflected forms of a given lemma by
3.2   Setting up the Backend                         specifying some morphological constraints. By
Once the computational lexicon was converted,        entering a word in the text field, the GUI
the implementation of the querying system            invokes the lexicon backend services to retrieve
continued with the creation of the backend           the lemmas corresponding to the indicated
services needed to access both the lexicon and       parameters and displays them with their
the database storing the text to be queried.         different senses. Users can then proceed with
Regarding the lexicon, a GraphDB7 repository,        the search or they can select one or more
containing all the converted data, was set up.       lemmas and apply to them morphological
The access to the repository was implemented         constraints by clicking on the three dots icon on
with a set of REST services that can be invoked      their right. The selection of at least one of the
from any web client8. The services have been         senses enables the semantic extension search
based on the already available backend of            feature: a drop-down menu allows users to look
LexO, a collaborative web tool for the creation      for all the other senses in the lexicon appearing
and editing of lemon lexical resources               as hypernyms, hyponyms, or synonyms at a
(Bellandi, 2021). At the same time, a list of        specified distance. The forms obtained with this
analogous services was made available to             extension are subject to the propagation of the
retrieve the textual portions of the corpus          morphological constraints applied to the lexical
matching the expanded queries coming from the        entry to which they are linked, whether explicit
frontend of the system. The Italian translation      (entered from the interface) or implicit (in the
of the babylonian Talmud is currently stored as      case of a search by form). Finally, the “semantic
a MySQL database, where each segment of text         traits” tab provides two template trees on which
appears both in its original and POS-tagged          multiple selections are possible: the first click
version (see 3.4).                                   selects a template with all its descendants, the
                                                     second deselects the descendants, and the third
3.3   The Graphical User Interface                   deselects the node itself. When the selection
The GUI (Fig. 2) set up to query the corpus was      changes, the lexicon is queried to obtain the list
developed using Angular9, one of the most            of senses linked to the chosen templates. Users
widespread frameworks for frontend Web               can then select the desired senses which will be
development, which provides high levels of           used to retrieve the forms of the relative lemmas
portability and scalability. In this first version   to be used in the QE.
of the search system, the interface was                 All the entered data are used to compose the
conceived as a sort of “hub” of the whole            expanded query, which will be constituted by
architecture: from the one side to interact with     all the inflected forms provided by the lexicon
the user and from the other side to invoke the       and matching the indicated morphological
services exposed by GraphDB and the Talmud           constraints, semantic extension, or templates.
database. The interface is divided into two             The results coming from the backend
sections. In the left-hand column, the available     services accessing the Talmud database are then
tractates of the Talmud that can be queried are      displayed in a table on the right-hand side, upon
represented as a tree allowing the user to specify   which a panel lists the forms retrieved from the
the search context at different levels of            lexicon and used for the QE.
granularity. The right-hand section contains the
search parameters, where the user can choose

7                                                    8
 Ontotext GraphDB is a highly efficient and robust     The source code of the REST services is available
graph database with RDF/OWL and SPARQL               at https://github.com/andreabellandi/LexO-backend
                                                     9
support                                                https://angular.io/
(https://graphdb.ontotext.com/documentation/free/f
ree/graphdb-free.html)
3.4    POS-Tagging of the Text                         match with homographs, the corpus was
                                                       automatically analyzed and annotated with
For the purpose of reducing the lexical
                                                       morphological information.
ambiguity in cases where a searched word could


        Figure 2. The graphical user interface showing the example of lemma “insegnamento”.

   In particular, we parsed all the sentences of       shows it as a noun with one single sense. The
the eight tractates of the babylonian Talmud           user then adds a morphological constraint by
with Stanford's Stanza tools (Qi et al., 2020)         setting the “number” trait to “plural”. Finally,
using the pre-trained model based on the UD            the user extends the search to direct hyponyms
Italian ISDT treebank10. The tool was                  (distance = 1) and submits the query.
configured to use the processors for                      This is a simple case of propagation of the
tokenization, multi-word token expansion, and          morphological traits through semantics. The
Part-of-Speech tagging, which also includes the        lexicon contains the two following key
attribution of morphological traits. Each              information: i) the fact that the sense of
morphologically annotated textual segment was          “insegnamento”       has     three    hyponyms:
then stored in the MySQL database to return            erudizione”        (erudition),      “istruzione”
just the forms matching with the morphological         (instruction), and “catechesi” (catechesis); ii)
constraints coming from the GUI.                       all the inflected forms and the relative
                                                       morphological traits of the searched word and
4     Examples of Queries                              its three hyponyms. On the basis of these data,
                                                       the system composes the final query, which
In this last section, we show a concrete
                                                       allows to search for all the plural forms of the
application of the approach by introducing
                                                       four lemmas as nouns. As a result, 103 textual
some query examples. Each query can also be
                                                       segments are retrieved, containing the words
tested by the reader by accessing the available
                                                       “insegnamenti” (97 matches) and “istruzioni”
application.
                                                       (6 matches) (Fig. 2).
   The first two examples show the search for
                                                          The second example involves the verb
words with specific morphological traits and the
                                                       “permettere” (to permit/allow), searched as a
application of semantic extension. In these
                                                       lemma, with morphological constraints on the
cases, the “Form/Lemma” type of search is
                                                       finite mood (“indicative”, “subjunctive”,
selected. In the first example, the word
                                                       “imperative”, “conditional”). In addition, the
“insegnamento” (teaching) is inserted as a
                                                       user selects just one of the two available senses
lemma. The system finds it in the lexicon and
                                                       of the verb (the one with the definition “dare a
10
  https://universaldependencies.org/treebanks/it_isd
t/index.html
qlcu la possibilità' di fare qlco” - to give sb the   implemented, the road is cleared for the next
chance to do smth -) and then extends the search      steps.
to its synonyms. In this case, the lexicon               The first critical issue that will need to be
proposes two synonyms of the selected sense:          faced involves the limitedness of the resource,
the (single) senses of words “concedere” and          covering most - but not all - the lemmas, forms,
“consentire”. The resulting expanded query            and senses of standard contemporary Italian and
retrieves from the database a total of 405            that lacks many domain-related terms or senses.
matches, containing 334 strings of “permettere”       To fill this gap the resource will have to be
(for 131 available forms of the lexicon), 44          updated and enriched with more entries.
strings of “concedere” (for 45 available forms)          At the same time, as anticipated, a more in-
and 27 strings of “concedere” (for 41 forms).         depth and rigorous conversion of PSC will have
   The last type of search is structured as a more    to be carried out, a process that will probably
explorative querying of the corpus. In the            take a lot of time and research effort and that for
semantic traits tab, the user can choose one or       the sake of this first experiment would have
more between noun/verb or adjectival templates        been premature and unnecessary. As soon as the
(group of senses), to look for all words relative     whole conversion will be ready, the rest of the
to a specific semantic field, such as objects,        information encoded in the lexicon will be
weather verbs, metalanguage, etc.                     made available and integrated in the search
   In this example, the user selects the template     process.
“Air animal”, which appears as a “leaf” of the           Though the benefits of the availability of a
sub-tree under the parent-node “Entity”. Once         computational lexicon wrt WordNet (or a
the template is chosen, the system retrieves          similar resource) may seem obvious in a context
from the lexicon all the relative senses and          of QE for full-text search, an empirical
shows them in a window. It is then possible to        evaluation would be desirable. However, the set
select all the available 165 senses or just some      up of a benchmark conceived for this purpose
of them. Finally, the user can run the search: the    appears anything but easy, mainly due to the
system composes the expanded query and                lack of comparable works or evaluation
retrieves 226 textual segments of the Talmud          campaigns focussing on the role of linguistic
containing words (both as lemmas and inflected        resources as support.
forms) with senses referring to the semantic             In conclusion, we believe these first
field of “Air animal”: “uccello” (bird), “mosca”      experiments carried out by querying the
(fly), “cavallette” (grasshoppers), and so on.        talmudic text appear promising, especially
   Among future developments, a feature for a         considering that only a small part of the lexicon
“grouped” selection of multiple templates will        has been used. In addition, the support in the
be added, that will allow to search for textual       disambiguation provided by the POS tagging of
segments containing co-occurrences of words           the text suggests that an hybridization of a
referring to the specified templates. To bring an     resource-driven QE technique with a deeper
example, the grouped selection of templates           stochastic annotation of the corpus to be queried
“Color” and “Earth animal” will retrieve              may constitute an interesting experimental field
segments containing multiword expressions             to be investigated.
such as “vacca rossa” (red cow), “gatta nera”
(black she-cat), “oche bianche” (white gooses),       Acknowledgments
etc.
                                                      This work was conducted in the context of the
5    Conclusion                                       TALMUD project and the scientific
                                                      cooperation between S.c.a r.l. PTTB and ILC-
As shown in this paper, the availability of a rich    CNR.
and structured linguistic resource (as the
computational lexicon we have taken into              References
account) seems to provide an edge over the            Eneko Agirre, Oier Lopez de Lacalle, and Aitor
standard query expansion techniques for full-           Soroa. 2009. Knowledge-Based WSD on
text search based on WordNet. Now that a very           Specific Domains: Performing better than
first portion of the resource has been made             Generic Supervised WSD. In IJCAI'09:
available (though with a preliminary                    Proceedings of the 21st international joint
conversion) and the web application has been            conference on Artificial intelligence. 1501-1506.
Andrea Bellandi. 2021. LexO: An Open-source            Using Common Hypernyms and Combined
  System      for    Managing    OntoLex-Lemon         Features. preprint arXiv:1807.05574.
  Resources. Language Resources & Evaluation.
                                                    James Pustejovsky. 1995. The Generative Lexicon.
  https://doi.org/10.1007/s10579-021-09546-4
                                                      MA: MIT Press.
Ontology-Lexicon Community Group (W3C).
                                                    Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,
  Phillip Cimiano, John P. McCrae, and Paul
                                                      and Christopher D. Manning. 2020. Stanza: A
  Buitelaar (eds). 2016. Lexicon Model for
                                                      Python Natural Language Processing Toolkit for
  Ontologies:        Community          Report.
                                                      Many Human Languages. In Association for
  https://www.w3.org/2016/05/ontolex/#overview
                                                      Computational Linguistics (ACL) System
Riccardo Del Gratta, Francesca Frontini, Fahad        Demonstrations.
  Khan, and Monica Monachini. 2015. Converting
                                                    Nilda Ruimy, Monica Monachini, Raffaella
  the PAROLE SIMPLE CLIPS Lexicon into RDF
                                                       Distante, Elisabetta Guazzini, Stefano Molino,
  with lemon. Semantic web 6: 387-392.
                                                       Marisa Ulivieri., Nicoletta Calzolari, and
Christiane Fellbaum. 1998. WordNet: An electronic      Antonio Zampolli. 2002. Clips, a multi-level
  lexical database. MA: MIT Press.                     italian computational lexicon: A glimpse to data.
                                                       In Proceedings of the Third International
Azad Hiteshwar Kumar, and Akshay Deepak. 2019.
                                                       Conference on Language Resources and
  A new approach for query expansion using
                                                       Evaluation (LREC02).
  Wikipedia and WordNet. Information sciences
  492: 147-163.                                     Robert F. Simmons, Sheldon Klein, and Keren
                                                      McConlogue. 1963. Indexing and dependency
Alessandro Lenci, Nuria Bel, Federica Busa,
                                                      logic for answering English questions. American
  Nicoletta Calzolari, Elisabetta Gola, Monica
                                                      Documentation 15(3): 196-204.
  Monachini, Antoine Ogonowski, Ivonne Peters,
  Wim Peters, Nilda Ruimy, Marta Villegas, and      Philip J. Stone, Robert F. Bales, J. Zvi Namenwirth,
  Antonio Zampolli. 2000. SIMPLE: A General           and Daniel Ogilvie. 1962. The general inquirer:
  Framework for the Development of Multilingual       A computer system for content analysis and
  Lexicons. International Journal of Lexicography     retrieval based on the sentence as a unit of
  13(4): 249–263.                                     information. Behavioral Science 7(4): 484–498.
Fahad Khan, Andrea Bellandi, Francesca Frontini,    Ellen M. Voorhees. 1993. Using WordNet to
  and Monica Monachini. 2018. One Language to          disambiguate word senses for text retrieval. In
  rule them all: Modelling Morphological Patterns      SIGIR '93: Proceedings of the 16th annual
  in a Large Scale Italian Lexicon with SWRL. In       international ACM SIGIR conference on
  Proceedings of the 11th International                Research and development in information
  Conference on Language Resources and                 retrieval.
  Evaluation - LREC2018, 2018, Miyazaki, Japan.        https://doi.org/10.1145/160688.160715
  hal-01832652
                                                    William A. Woods. 1997. Conceptual indexing: A
Judith Klavans. 1988. COMPLEX: a computational        better way to organize knowledge. Technical
   lexicon for natural language systems. In           Report SMLI TR-97-61, Sun Microsystems
   COLING '88: Proceedings of the 12th                Laboratories, Mountain View, CA, April.
   conference on Computational Linguistics.           www.sun.com/research/techrep/1997/abstract61
   https://doi.org/10.3115/991719.991802              .html.
Christopher D. Manning, Prabhakar Raghavan, and     William A. Woods, Lawrence A. Bookman, Ann
  Hinrich Schütze. 2008. Introduction to              Houston, Robert J. Kuhns, Paul Martin, and
  Information Retrieval, Cambridge University         Stephen Green. 2000. Linguistic knowledge can
  Press.                                              improve information retrieval. In ANLC '00:
                                                      Proceedings of the sixth conference on Applied
Vuong M. Ngo, Tru H. Cao, and Tuan M. V. Le.
                                                      natural            language         processing.
  2018. WordNet-Based Information Retrieval
                                                      https://doi.org/10.3115/974147.974183