The Role of a Computational Lexicon for Query Expansion in Full- Text Search Emiliano Giovannetti, Davide Albanesi, Andrea Bellandi, Simone Marchi, Mafalda Papini, Flavia Sciolette Istituto di Linguistica Computazionale, Via G. Moruzzi 1, 56124, Pisa name.surname@ilc.cnr.it Abstract 60s on the development of the very first question answering (QA) systems already English. This work describes the first included linguistic resources as support experiments conducted with a datasets. To bring some “old school” examples, computational lexicon of Italian in a the “General Inquirer” QA system (Stone et al., context of query expansion for full-text 1962) used a thesaurus for “coding words as to search. An application, composed of a concept membership” while Simmon’s graphical user interface and backend “Protosynthex” was equipped with a synonym services to access the lexicon and the dictionary (Simmons et al, 1963) to “expand the database containing the corpus to be meaning of the question's words to any desired queried, was developed. The text was level”. One of the first works specifically morphologically analysed to improve focussed on the use of a lexical resource for the precision of the search process. NLP tasks was about COMPLEX (for Some examples of queries are given to “COMPutational LEXicon”), a resource show the potential of a text search developed at IBM (Klavans, 1988). approach supported by a complex and The support of linguistic resources has stratified lexical resource. proved its potential in the field of information retrieval (IR) too, as highlighted in many of Bill Italiano. Il presente lavoro illustra i Woods’ works, culminating in the introduction primi esperimenti condotti con un of his conceptual indexing technique and the lessico computazionale dell’italiano in conceptual taxonomy resource (Woods, 1997) un contesto di query expansion per la and later refined in an article entitled ricerca full-text. È stata sviluppata una “Linguistic Knowledge can Improve applicazione composta da una Information Retrieval” (Woods, et al, 2000). interfaccia grafica utente e un backend More recently, other researchers have stressed di servizi che permette l’accesso sia al the importance of the availability of a “Lexical lessico che al database contenente il Knowledge Base” (another way to refer to a corpus da interrogare. Il testo è stato computational lexicon) in tasks such as Word analizzato morfologicamente al fine di Sense Disambiguation, since their use, in some migliorare la precisione del processo di contexts, can outperform supervised systems ricerca. Alcuni esempi di query sono (Agirre et al., 2009). forniti al fine di mostrare le potenzialità The use of linguistic resources in QA of the di un approccio di ricerca sul testo earliest period of computational linguistics can supportato da una risorsa lessicale be considered as the precursor of “query complessa e stratificata. expansion” (QE), the technique that Manning and Raghavanat describe as the most used 1 Introduction “local method” in IR to tackle those situations in which “the same concept may be referred to The need of techniques going beyond the mere using different words” (Manning et al., 2008). “search by keyword” in the querying of textual Though QE may be obtained in different resources dates back to the dawn of ways (among which query reformulations based computational linguistics. Seminal works in the on query log mining) we are here interested in Copyright ©️ 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). those applications that make use of lexical been exploited for tasks of full-text search or resources. information retrieval. Most of the works, published from the 90s to nowadays (proving that QE is still being 2.1 The Parole-Simple-Clips Lexicon investigated), exploit WordNet (Fellbaum, “PAROLE-SIMPLE-CLIPS” (PSC) is a 1998), the de facto and most widespread computational lexicon of Italian, developed ontological (or lexical, depending from the from 1996 to 2003 by ILC-CNR (Ruimy et al., point of view) multilingual resource. Ellen 2002). Currently, the resource is stored as a Vorhees was one of the first and used MySQL database available on CLARIN4, and WordNet’s IS_A relations to improve text represents a unicum among the available retrieval (Vorhees, 1993). Moving on directly to linguistic resources for Italian, thanks to its the most recent works, WordNet has been used richness and articulated structure of data. Based with all its ontological features to expand on the Generative Lexicon theory (Pustejovsky, queries in a semantic text search context in 1995), the schema on which the linguistic (Ngo et al., 2018) while in (Azad and Deepak, information is encoded is composed of four 2019) the authors combined WordNet and distinct, but strictly interconnected layers of Wikipedia for QE, exploiting the first to expand analysis: phonology, morphology, syntax, and individual terms and the second to expand semantics. phrase terms. In these features lies the motivation of this The research work here illustrated places work, since the available linguistic information itself in the context of full-text search carried may be combined in ways that go well beyond out using a lexical resource-driven QE what resources such as WordNet allow to do in technique. However, the focus of this research, the context of text search support. Even differently from that of the cited works, is not considering semantics alone, the information in on the specific QE technique and the relative PSC is detailed with fine-grained features that evaluation, but on the resource we chose to are not described in WordNet’s network of exploit, introduced in the next section, in place synsets: PSC encodes the meaning of each of WordNet and on the frontend and backend lexical sense as an array of information, technologies implemented to query the text, as including “templates” (see below), semantic described in details in Section 3. The traits, semantic roles, and argumental advantages derived from the adoption of a rich structures. and highly structured computational lexicon In this work, we document the first steps in will also be remarked through some query the use of PSC for QE. At this stage we used: i) examples shown in Section 4. The developed the Morphological Units, classified according application can be freely accessed and used to to their POS, which represent the lemmas of the query the corpus1. computational lexicon; ii) the Phonological Units that represent the inflected forms of the 2 The Context and the Resource lemmas; iii) the Semantic Units (SemUs), that This work stems from the activities conducted describe the senses expressed by the words. by the Institute of Computational Linguistics of Furthermore, we considered the following CNR (ILC-CNR) in the context of the Talmud morphological and semantic information: i) Translation Project2. The need of providing a morphological traits (e.g. gender, number); ii) way to query the Italian translation of the relations between SemUs (at the moment Talmud3 on a linguistic basis was the initial limited to synonymy and hyponymy); iii) the spark that led to the idea of experimenting the association between SemUs and “templates”, use of a computational lexicon for Italian. As a representing sets of senses, labeled according to matter of fact, this resource (described below) one of the types represented in the Simple represents a “linguistic mine” which has never Ontology (Lenci et. al., 2001). The other parts 1 https://klab.ilc.cnr.it/talmudSearch/ 2 4 https://www.talmud.it/ https://dspace-clarin- 3 The corpus here queried is limited to eight it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/I tractates of the babylonian Talmud: Rosh LC-88. Hashanah, Berakhot, Ta'anit, Kiddushin, Chagigah, Beitza, Sukkah, and Megillah of linguistic information will be the subject of the same time, a list of services to query both future works, according to an incremental PSC and the database storing the Italian approach. translation of the Talmud needed to be developed in order to answer to the interface 3 The Process and the Application requests (Section 3.2). The interface itself was designed on the basis of the available linguistic The whole search process involves a series of information exposed from PSC and developed steps that can be summarized as follows (see accordingly (Section 3.3). Finally, to improve Fig. 1 for a schematic functional architecture of the precision of the search process, the queried the application): corpus was also POS-tagged (Section 3.4). i) the user inserts a first set of data to 3.1 A First Conversion of PSC formulate the desired query in the Graphical User Interface; The first phase of our work was to consider the ii) the interface requests, via Web API, the relational database of PSC as the data source for lexicon backend services which return the the generation of a first Linked Data (LD) linguistic data matching the initial query; conversion. Two main reasons led to the need iii) the user completes the query taking into for a conversion of PSC: i) to ease the reuse of account the linguistic data and starts the the lexicon itself, in virtue of the intrinsic nature search; of LD, ii) the possibility of performing iv) the interface executes the query automated reasoning on data if appropriately expansion and requests, via Web API, the modeled taking into account ontological text backend services which collect, tag, and principles, for example to compute inferred return the matching textual portions of the closures, infer new knowledge on the basis of Talmud; class taxonomies, property hierarchies, and so v) the interface shows the results to the user. on. Accordingly to the LD principles, we first had to look for existing vocabularies for the modeling of lexicons. In the context of the Semantic Web, the de facto standard for representing lexical information is the lemon model (Cimiano et al., 2016). Its core module, called OntoLex, allows to represent grammatical, basic morphological and semantic information by means of three main classes: Lexical Entry, Form (lemma and inflected forms), and Lexical Sense. Lemon relies on external vocabularies to define semantic relations between senses: in this conversion we modelled PSC’s synonymy and hyponymy with LexInfo ontology5. Currently, the converted resource includes 72006 lexical entries (48735 nouns, 6522 verbs, and 11830 adjectives), 469726 inflected forms, and 57130 senses. Explicit lexico-semantic relations include 1803 meronyms, 4060 synonyms, and 44487 hyponyms. This initial conversion of PSC as Linked Data was purely functional to Figure 1. Functional architecture of the the linguistic querying of the Italian translation application. of the Babylonian Talmud6. Therefore, it was decided to convert a selected number of First of all, to make the lexicon efficiently linguistic data to be exploited for the process of queryable, it needed to be transformed from query expansion. At the time of writing this relational data into linked data (Section 3.1). At 5 https://lexinfo.net/ performing linguistic searches experiments on the 6 We remark that the conversion of PSC Simple is not Italian translation of the Talmud. the focus of this work, but it was necessary for proposal, a complete conversion of PSC as between three types of search using the LOD (Linked Open Data) is in progress. This available tabs: Keyword, Form/Lemma, or complete conversion will also take full Semantic Traits. advantage of the already available works on the The first one is the classic keyword-based resource as documented in (Khan et al., 2018) search. The second type, via the Form/Lemma and (Del Gratta et al., 2015). tab, allows to search for a specific word form or the set of inflected forms of a given lemma by 3.2 Setting up the Backend specifying some morphological constraints. By Once the computational lexicon was converted, entering a word in the text field, the GUI the implementation of the querying system invokes the lexicon backend services to retrieve continued with the creation of the backend the lemmas corresponding to the indicated services needed to access both the lexicon and parameters and displays them with their the database storing the text to be queried. different senses. Users can then proceed with Regarding the lexicon, a GraphDB7 repository, the search or they can select one or more containing all the converted data, was set up. lemmas and apply to them morphological The access to the repository was implemented constraints by clicking on the three dots icon on with a set of REST services that can be invoked their right. The selection of at least one of the from any web client8. The services have been senses enables the semantic extension search based on the already available backend of feature: a drop-down menu allows users to look LexO, a collaborative web tool for the creation for all the other senses in the lexicon appearing and editing of lemon lexical resources as hypernyms, hyponyms, or synonyms at a (Bellandi, 2021). At the same time, a list of specified distance. The forms obtained with this analogous services was made available to extension are subject to the propagation of the retrieve the textual portions of the corpus morphological constraints applied to the lexical matching the expanded queries coming from the entry to which they are linked, whether explicit frontend of the system. The Italian translation (entered from the interface) or implicit (in the of the babylonian Talmud is currently stored as case of a search by form). Finally, the “semantic a MySQL database, where each segment of text traits” tab provides two template trees on which appears both in its original and POS-tagged multiple selections are possible: the first click version (see 3.4). selects a template with all its descendants, the second deselects the descendants, and the third 3.3 The Graphical User Interface deselects the node itself. When the selection The GUI (Fig. 2) set up to query the corpus was changes, the lexicon is queried to obtain the list developed using Angular9, one of the most of senses linked to the chosen templates. Users widespread frameworks for frontend Web can then select the desired senses which will be development, which provides high levels of used to retrieve the forms of the relative lemmas portability and scalability. In this first version to be used in the QE. of the search system, the interface was All the entered data are used to compose the conceived as a sort of “hub” of the whole expanded query, which will be constituted by architecture: from the one side to interact with all the inflected forms provided by the lexicon the user and from the other side to invoke the and matching the indicated morphological services exposed by GraphDB and the Talmud constraints, semantic extension, or templates. database. The interface is divided into two The results coming from the backend sections. In the left-hand column, the available services accessing the Talmud database are then tractates of the Talmud that can be queried are displayed in a table on the right-hand side, upon represented as a tree allowing the user to specify which a panel lists the forms retrieved from the the search context at different levels of lexicon and used for the QE. granularity. The right-hand section contains the search parameters, where the user can choose 7 8 Ontotext GraphDB is a highly efficient and robust The source code of the REST services is available graph database with RDF/OWL and SPARQL at https://github.com/andreabellandi/LexO-backend 9 support https://angular.io/ (https://graphdb.ontotext.com/documentation/free/f ree/graphdb-free.html) 3.4 POS-Tagging of the Text match with homographs, the corpus was automatically analyzed and annotated with For the purpose of reducing the lexical morphological information. ambiguity in cases where a searched word could Figure 2. The graphical user interface showing the example of lemma “insegnamento”. In particular, we parsed all the sentences of shows it as a noun with one single sense. The the eight tractates of the babylonian Talmud user then adds a morphological constraint by with Stanford's Stanza tools (Qi et al., 2020) setting the “number” trait to “plural”. Finally, using the pre-trained model based on the UD the user extends the search to direct hyponyms Italian ISDT treebank10. The tool was (distance = 1) and submits the query. configured to use the processors for This is a simple case of propagation of the tokenization, multi-word token expansion, and morphological traits through semantics. The Part-of-Speech tagging, which also includes the lexicon contains the two following key attribution of morphological traits. Each information: i) the fact that the sense of morphologically annotated textual segment was “insegnamento” has three hyponyms: then stored in the MySQL database to return erudizione” (erudition), “istruzione” just the forms matching with the morphological (instruction), and “catechesi” (catechesis); ii) constraints coming from the GUI. all the inflected forms and the relative morphological traits of the searched word and 4 Examples of Queries its three hyponyms. On the basis of these data, the system composes the final query, which In this last section, we show a concrete allows to search for all the plural forms of the application of the approach by introducing four lemmas as nouns. As a result, 103 textual some query examples. Each query can also be segments are retrieved, containing the words tested by the reader by accessing the available “insegnamenti” (97 matches) and “istruzioni” application. (6 matches) (Fig. 2). The first two examples show the search for The second example involves the verb words with specific morphological traits and the “permettere” (to permit/allow), searched as a application of semantic extension. In these lemma, with morphological constraints on the cases, the “Form/Lemma” type of search is finite mood (“indicative”, “subjunctive”, selected. In the first example, the word “imperative”, “conditional”). In addition, the “insegnamento” (teaching) is inserted as a user selects just one of the two available senses lemma. The system finds it in the lexicon and of the verb (the one with the definition “dare a 10 https://universaldependencies.org/treebanks/it_isd t/index.html qlcu la possibilità' di fare qlco” - to give sb the implemented, the road is cleared for the next chance to do smth -) and then extends the search steps. to its synonyms. In this case, the lexicon The first critical issue that will need to be proposes two synonyms of the selected sense: faced involves the limitedness of the resource, the (single) senses of words “concedere” and covering most - but not all - the lemmas, forms, “consentire”. The resulting expanded query and senses of standard contemporary Italian and retrieves from the database a total of 405 that lacks many domain-related terms or senses. matches, containing 334 strings of “permettere” To fill this gap the resource will have to be (for 131 available forms of the lexicon), 44 updated and enriched with more entries. strings of “concedere” (for 45 available forms) At the same time, as anticipated, a more in- and 27 strings of “concedere” (for 41 forms). depth and rigorous conversion of PSC will have The last type of search is structured as a more to be carried out, a process that will probably explorative querying of the corpus. In the take a lot of time and research effort and that for semantic traits tab, the user can choose one or the sake of this first experiment would have more between noun/verb or adjectival templates been premature and unnecessary. As soon as the (group of senses), to look for all words relative whole conversion will be ready, the rest of the to a specific semantic field, such as objects, information encoded in the lexicon will be weather verbs, metalanguage, etc. made available and integrated in the search In this example, the user selects the template process. “Air animal”, which appears as a “leaf” of the Though the benefits of the availability of a sub-tree under the parent-node “Entity”. Once computational lexicon wrt WordNet (or a the template is chosen, the system retrieves similar resource) may seem obvious in a context from the lexicon all the relative senses and of QE for full-text search, an empirical shows them in a window. It is then possible to evaluation would be desirable. However, the set select all the available 165 senses or just some up of a benchmark conceived for this purpose of them. Finally, the user can run the search: the appears anything but easy, mainly due to the system composes the expanded query and lack of comparable works or evaluation retrieves 226 textual segments of the Talmud campaigns focussing on the role of linguistic containing words (both as lemmas and inflected resources as support. forms) with senses referring to the semantic In conclusion, we believe these first field of “Air animal”: “uccello” (bird), “mosca” experiments carried out by querying the (fly), “cavallette” (grasshoppers), and so on. talmudic text appear promising, especially Among future developments, a feature for a considering that only a small part of the lexicon “grouped” selection of multiple templates will has been used. In addition, the support in the be added, that will allow to search for textual disambiguation provided by the POS tagging of segments containing co-occurrences of words the text suggests that an hybridization of a referring to the specified templates. To bring an resource-driven QE technique with a deeper example, the grouped selection of templates stochastic annotation of the corpus to be queried “Color” and “Earth animal” will retrieve may constitute an interesting experimental field segments containing multiword expressions to be investigated. such as “vacca rossa” (red cow), “gatta nera” (black she-cat), “oche bianche” (white gooses), Acknowledgments etc. This work was conducted in the context of the 5 Conclusion TALMUD project and the scientific cooperation between S.c.a r.l. PTTB and ILC- As shown in this paper, the availability of a rich CNR. and structured linguistic resource (as the computational lexicon we have taken into References account) seems to provide an edge over the Eneko Agirre, Oier Lopez de Lacalle, and Aitor standard query expansion techniques for full- Soroa. 2009. Knowledge-Based WSD on text search based on WordNet. Now that a very Specific Domains: Performing better than first portion of the resource has been made Generic Supervised WSD. In IJCAI'09: available (though with a preliminary Proceedings of the 21st international joint conversion) and the web application has been conference on Artificial intelligence. 1501-1506. Andrea Bellandi. 2021. LexO: An Open-source Using Common Hypernyms and Combined System for Managing OntoLex-Lemon Features. preprint arXiv:1807.05574. Resources. Language Resources & Evaluation. James Pustejovsky. 1995. The Generative Lexicon. https://doi.org/10.1007/s10579-021-09546-4 MA: MIT Press. Ontology-Lexicon Community Group (W3C). Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, Phillip Cimiano, John P. McCrae, and Paul and Christopher D. Manning. 2020. Stanza: A Buitelaar (eds). 2016. Lexicon Model for Python Natural Language Processing Toolkit for Ontologies: Community Report. Many Human Languages. In Association for https://www.w3.org/2016/05/ontolex/#overview Computational Linguistics (ACL) System Riccardo Del Gratta, Francesca Frontini, Fahad Demonstrations. Khan, and Monica Monachini. 2015. Converting Nilda Ruimy, Monica Monachini, Raffaella the PAROLE SIMPLE CLIPS Lexicon into RDF Distante, Elisabetta Guazzini, Stefano Molino, with lemon. Semantic web 6: 387-392. Marisa Ulivieri., Nicoletta Calzolari, and Christiane Fellbaum. 1998. WordNet: An electronic Antonio Zampolli. 2002. Clips, a multi-level lexical database. MA: MIT Press. italian computational lexicon: A glimpse to data. In Proceedings of the Third International Azad Hiteshwar Kumar, and Akshay Deepak. 2019. Conference on Language Resources and A new approach for query expansion using Evaluation (LREC02). Wikipedia and WordNet. Information sciences 492: 147-163. Robert F. Simmons, Sheldon Klein, and Keren McConlogue. 1963. Indexing and dependency Alessandro Lenci, Nuria Bel, Federica Busa, logic for answering English questions. American Nicoletta Calzolari, Elisabetta Gola, Monica Documentation 15(3): 196-204. Monachini, Antoine Ogonowski, Ivonne Peters, Wim Peters, Nilda Ruimy, Marta Villegas, and Philip J. Stone, Robert F. Bales, J. Zvi Namenwirth, Antonio Zampolli. 2000. SIMPLE: A General and Daniel Ogilvie. 1962. The general inquirer: Framework for the Development of Multilingual A computer system for content analysis and Lexicons. International Journal of Lexicography retrieval based on the sentence as a unit of 13(4): 249–263. information. Behavioral Science 7(4): 484–498. Fahad Khan, Andrea Bellandi, Francesca Frontini, Ellen M. Voorhees. 1993. Using WordNet to and Monica Monachini. 2018. One Language to disambiguate word senses for text retrieval. In rule them all: Modelling Morphological Patterns SIGIR '93: Proceedings of the 16th annual in a Large Scale Italian Lexicon with SWRL. In international ACM SIGIR conference on Proceedings of the 11th International Research and development in information Conference on Language Resources and retrieval. Evaluation - LREC2018, 2018, Miyazaki, Japan. https://doi.org/10.1145/160688.160715 hal-01832652 William A. Woods. 1997. Conceptual indexing: A Judith Klavans. 1988. COMPLEX: a computational better way to organize knowledge. Technical lexicon for natural language systems. In Report SMLI TR-97-61, Sun Microsystems COLING '88: Proceedings of the 12th Laboratories, Mountain View, CA, April. conference on Computational Linguistics. www.sun.com/research/techrep/1997/abstract61 https://doi.org/10.3115/991719.991802 .html. Christopher D. Manning, Prabhakar Raghavan, and William A. Woods, Lawrence A. Bookman, Ann Hinrich Schütze. 2008. Introduction to Houston, Robert J. Kuhns, Paul Martin, and Information Retrieval, Cambridge University Stephen Green. 2000. Linguistic knowledge can Press. improve information retrieval. In ANLC '00: Proceedings of the sixth conference on Applied Vuong M. Ngo, Tru H. Cao, and Tuan M. V. Le. natural language processing. 2018. WordNet-Based Information Retrieval https://doi.org/10.3115/974147.974183