Automatic Construction of the Knowledge Base of an Onomasiological Dictionary Gerardo Sierra*, Laura Hernández Language Engineering Group, Engineering Institute Universidad Nacional Autónoma de México, Ciudad Universitaria, México ABSTRACT ers must handle (Dutoit et al, 2002 and Bilac et al, 2004). For almost 14 years in the Language Engineering Group we The most demanding task might be the one arisen from the have worked on a wide variety of Natural Language Processing (NLP) problems, being one of the earliest in the creation and opera- different ways in which a person can express the same con- tion of onomasiological dictionaries. During that time we have fo- cept, and also the fact that user definitions might not match cused on search engine dictionary improvement, but recently our the formal definitions found in conventional dictionaries. aim has been a development methodology for creating specialized onomasiological dictionaries in a semi-automatic way. In short, natural language onomasiological dictionaries To automate the creation of onomasiological dictionaries neces- need a rich knowledge base which includes not only formal, sarily implies the automatic execution of used processes to populate but also informal definitions. Knowledge bases can be ob- the dictionaries knowledge base. Due to the nature of these diction- aries, the definitions that must be included in the knowledge base tained from ontologies, like in the projects Genoma KB are both normative and colloquial. (Cabré et al, 2004) and ONTODIC (Alcina, 2009). Howev- In this paper we present a proposal for semi-automatically popu- er, given the main goal of onomasiological dictionaries, for lating the knowledge base of these dictionaries. this work we decided to extract their Knowledge Bases from definitions written in texts. These definitions, on the other 1 INTRODUCTION hand, can be used not only to populate the Knowledge Base, but also to create ontologies (Sierra, 2008). An onomasiological dictionary is a dictionary that works in back to front way from “regular” or semasiological diction- 2 DEBO aries. In onomasiological dictionaries users already know the definition of a term, but they do not know or have for- DEBO is the first onomasiological dictionary developed in gotten the name for that concept (this last problem is com- the Language Engineering Group and it works with user monly known as having a word on the tip of the tongue) queries given in natural language. DEBO is a specialized (Zock et al, 2011). dictionary and it was originally made as a dictionary of Nat- Onomasiological dictionaries have been classified into ural Disasters, but today its structure and search engine has visual dictionaries, reverse dictionaries, thesaurus and syno- been extrapolated to other areas such as Linguistics, Metrol- nym dictionaries. These dictionaries were created in order to ogy, Veterinary, and Sexuality. solve the tip-of-the-tongue problem, but people still have difficulty using them because they require either that the 2.1 The search method user knows the precise words to describe the term, or its classification (i.e. when using a reverse dictionary to find The dictionary works with a search engine developed by the word potato, you might have to know that a potato is a Sierra (1999) and improved later by Hernández (2011). This tuber, and that tubers are a kind of plant). With visual dic- engine is comprised by tionaries there is also the problem that not every concept has • A number of terms of an area of specialization, which a visual image to represent it. For these reasons it has been are the ones that the dictionaries can retrieve as a pos- suggested that free-text searcher ―also known as Natural sible answer to the user’s queries. Language searching― is a viable option for solving this • A knowledge base that includes a variety of both nor- problem (Lancaster, 1972) since they allow the user to de- mative and colloquial definitions. scribe their idea of the concept in the way they would use to explain it to another human. • A set of key words extracted from the definitions and The creation of onomasiological dictionaries that solve associated with the terms. inputs written in natural language improves the user experi- • A stop list that contains a catalog of “empty words”, ence, but it creates some major challenges that the develop- such as prepositions, articles and conjunctions. * GSierraM@iingen.unam.mx 1 Sierra et al. • A set of groups of words called paradigms, which are The user’s definition is related to paradigms 1, 2 and 3, groups of words with similar meaning either in area of while one of the definitions of the term homophobic is relat- specialization or in regular speech. ed to the same paradigms in exactly the same order, which means that the term homophobic will be on the top of the The search method follows 5 steps: output for this query. (1) The system receives the query of the user as an input. 2.2 The search engine performance (2) The system analyzes de input and extracts its key- Hernández (2011) created the Onomasiological Dictionary words by filtrering them with the aid of the stop list. of Sexuality for Mexican Spanish (DOS-MX) which used (3) The system searches among the paradigms the ones this search method. The knowledge base of this dictionary to which each keyword of the input corresponds. consisted of 975 both colloquial and normative definitions for 332 terms. All the definitions were found and retrieved (4) The system searches for terms that coincide in at manually from the Internet. least one paradigm with the input’s one. This dictionary had an added difficulty since it also had (5) The system retrieves the terms ordered by the num- to be able to handle double-meaning words and phrases that ber of paradigms that each term has in common with are very commonly used in Mexico when talking about sex. the input ―in case of a tie, the system ranks each In order to cope with this additional component, the diction- term according to the order in which the paradigms ary’s paradigms were extended to include double meaning are presented in the definition against the input―. words and even pejorative terms (see Paradigm 3 on Fig. 2), The terms are divided in “very probable”, “probable” taking into account not only formal synonyms but also col- and “not too probable” columns. loquial equivalents. In total there were over 33,000 different words organized in over 25,000 paradigms. Fig. 3. Example of an output of the Onomasiological Dic- tionary of Sexuality for Mexican Spanish (DOS-MX) Fig. 1. Diagram of the search method. There was an experiment to test the precision of the For example, suppose someone enters as an input of the DOS-MX. This experiment consisted on making students dictionary “someone who hates gays”, and in the knowledge write definitions of sexuality terms and to give their defini- base there is the definition “homophobic: a person who des- tions to another student who wouldn’t know which terms pises homosexuals”, and in the knowledge base there are were described and would try to guess. also the following three paradigms. The precision of the dictionary was 71% when tested Paradigm 1 Paradigm 2 Paradigm 3 with natural language entries from users that weren’t in- someone hate gay volved in the development of the dictionary, which is not person loathe homosexual bad compared to other non-English onomasiological dic- people contemn lesbian tionaries such as the one of El-Kahlout et al (2004) which Individual despise queer has a precision of 66% in similar tests. However, a vast op- dude abhor dyke portunity to improve exists. Fig. 2. Example of paradigms in the knowledge base of an onomasiological dictionary of sexuality. 2 Onomasiological Dictionary 2.3 A new improvement proposal Pragmatic pattern as context modifier: “Desde el punto After the experience of the sexuality dictionary, it was con- de vista de la sexología” (From a sexology point of cluded that the use of paradigms is not enough to try and view) cover all the ways in which a person can describe a concept. It was clear that there is a need to obtain many more differ- In order to automatically detect the features or components ent definitions in order to have a wide variety of expressions of a definitional context, Alarcón et al (2007) propose fif- and ideas for every concept. teen definitional verbal patterns divided into simple and But increasing the number of definitions will also tend to compound ones (see Table 1). increase the number of options from which the dictionary will have to choose, which is why there is also a need for Simple verbal definitional Compound verbal organizing the definitions and terms into some sort of cate- patterns definitional patterns gories that will facilitate the selection of the correct terms. • concebir (to conceive) • consistir de (to consist of) The main problem then is to find a way to obtain a large number of definitions for the terms and classify them. This • definir (to define) • consistir en (to consist in) should be done in an automatized way, because by doing it • entender (to understand) • constar de (to comprise) manually will take too long and imply high resource usage. • identificar (to identify) • denominar también (also denominated) • significar (to signify) • llamar también (also called) 3 ECODE • servir para (to serve for) ECODE is a program that was developed in the Lan- • usar como (to use as) guage Engineering Group with the objective of automatical- ly detecting definitional contexts from specialized texts • usar para (to use for) (Alarcón, et al 2008). • utilizar como (to utilise as) According to Alarcón, et al (2007), a definitional context • utilizar para (to utilise for) is a textual fragment in which the definition of a term oc- Table 1. Definitional verbal patterns used by ECODE curs. It is structured by a term and its definition, both being connected typographically by means of syntactic or typo- The program processes the specialized texts and searches graphic patterns. for definitional contexts. However, not every verbal defini- These patterns in Spanish can be punctuation marks, such tional pattern that is found truly corresponds to a definition. as comas, colons and parenthesis; verbs like definir (to de- There are some other expressions that use the same patterns fine) or significar (to mean); discourse markers similar to en with objectives other than give a definition. For this reason, otras palabras (in other words), o sea (that is); and even Alarcón et al (2006) analyzed the use of these patterns in pragmatic patterns like en este conexto (in this context) or non-definitional contexts and found some sequences of en términos generales (in general terms). For example: words that are often used near a definitional verbal pattern. Those sequences were found in some specific positions. Desde el punto de vista de la sexología, se puede definir For instance, some negation words like no (not) or tampoco una relación sexual como el acto en el que dos personas (either) were found in the first position before or after the mantienen contacto físico con el objeto de dar y/o recibir definitional verbal pattern; also adverbs like tan (so) as well placer sexual, o con fines reproductivos. as sequences like no más de (not more than) were found (From a sexology point of view, a sexual intercourse can between the definitional verb and the nexus como (like); be defined as the act in which two people have physical finally, syntactic sequences like adjective + verb were found contact with the objective of giving and/or getting sexual in the first position after the definitional verb. pleasure or with reproductive purposes) Once the system has eliminated non-definitional con- texts, it proceeds to identify the features that form the defi- nitions. For this, it uses a decision tree based on regular ex- The following features can be obtained from this example: pressions which allows the system to identify and tag the Term: “relación sexual” (sexual intercourse). position of every feature. These regular expressions are: Definition: “acto en el que dos personas mantienen contacto físico con el objeto de dar y/o recibir placer sexual, o Term = BORDER (Determinant) + Noun + Adjective. con fines reproductivos” (act in which two people have {0,2} .* BORDER physical contact with the objective of giving and/or Pragmatic Pattern = BORDER (sign) (Preposition | getting sexual pleasure, or with reproductive purposes). Adverb) .* (sign) BORDER Connecting verbal pattern: “se puede definir […] como” Definition = BORDER Determinant + Noun .* (can be defined as). BORDER 3 Sierra et al. The whole process of definitional contexts detection is or its parts; the third kind uses synonyms or generic terms to shortened in the following diagram. describe the term; and finally, the fourth kind of definitions describes the term by providing its uses. In order to automatically provide a category for each def- inition obtained through DESCRIBE, the verbal patterns have been divided accordingly to the kind of definition in which they usually appear. Definition type Verbal definitional patterns • concebir (to conceive) • definir (to define) Analytic • entender (to understand) Fig. 4. ECODE architecture (taken from Alarcón, 2006) • identificar (to identify) • significar (to signify) 4 DESCRIBE • consistir de (to consist of) ECODE was originally developed as a definitions extractor Extensional • consistir en (to consist in) from specialized texts. However, the same definitional ver- • constar de (to comprise) bal patterns that are used in formal documents are also used • denominar también (also denominated) in informal ones. Synonymic • llamar también (also called) With this in mind, the Language Engineering Group has • servir para (to serve for) been working on the development of DESCRIBE, an ex- tended scope of ECODE which extracts definitions from • usar como (to use as) texts written in colloquial language. Functional • usar para (to use for) This new adaptation consists on a module that automati- • utilizar como (to utilise as) cally extracts search results from the Internet about a partic- • utilizar para (to utilise for) ular term, and then retrieves the contents of the web pages that match that search and analyses them looking for new Table 2. Definition types and their definitional verbal pat- definitions. terns This tool removes all definitions that are repeated, and looks not only in formal websites, but also in open forums, This definition classification is the first step in the ontol- personal webpages, blogs and chats, which provides a rich ogy creation since, for instance, analytical definitions allow variety of definitions. us to obtain hyponym and hypernym relations, while from In the end, DESCRIBE retrieves a list of definition can- extensional definitions meronymy and holonymy relations didates that still have to be depurated, since some of the can be recovered (Soler et al, 2008). candidates might not really be definitions. 5 DEFINITION CLASSIFICATION 6 DEFINITION CANDIDATES’ DEPURATION In order to give the dictionary search engine another feature SYSTEM to help the correct identification and ranking of output As most systems in Natural Language Processing, terms, it has been considered classifying the definitions to DESCRIBE is not perfect and sometimes the definition can- match not only the words and the order in which they ap- didates turn out to be wrong, or the definition might be mis- pear, but also the type of definition given by the user and the placed in a particular category. ones in the knowledge base. The Language Engineering group has developed a tool to There are four kinds of definitions based on the Aristo- help the manual revision of the definition candidates’ validi- telic definitional model: analytic, extensional, synonymic ty and their categorization correctness. This tool presents a and functional (Sierra 2008). According to the LingualLinks series of definition candidates to dictionary developers. Eve- Libray, the first one refers to “a description of the range of ry candidate shown to the user has also the category in reference of a lexical unit” that allows readers to distinguish which DESCRIBE placed it. the term from similar words; the second kind refers to those definitions that list the objects that fall under the definition 4 Onomasiological Dictionary The system allows the developers to easily accept or re- didates marked as such. The purpose of creating this corpus ject a candidate as a definition and it also allows them to is to obtain training data for a machine learning system di- change the category into which the definition was originally rected to improve the automatic detection of definitional placed. contexts. This system helps in the task of polishing the definitions that will be part of the knowledge base of the dictionary, but it also keeps a record of the definition candidates that have ACKNOWLEDGEMENTS been rejected. This record is intended to be used as a corpus We would like to acknowledge DGAPA for the sponsorship that will serve as training data for a machine learning sys- of the project “Análisis estilométrico para la detección de tem that will be used to improve the precision of ECODE similitud textual”. We also thank the CONACYT Thematic and, in consequence, of DESCRIBE itself. Network “Tecnologías de la Información y la Comunica- ción”. REFERENCES Alarcón, R. (2006). Extracción automática de contextos definitorios en corpus especializados. Propuesta para el desarrollo de un ECCODE (extractor de candidatos a contextos definitorios). Instituto Universita- rio de Lingüística Aplicada, Universidad Pompeu Fabra, Barcelona (Doctoral thesis): Alarcón, R., Bach, C., and Sierra, G. (2008). Extracción de contextos defi- nitorios en corpus especializados: Hacia la elaboración de una herra- mienta de ayuda terminográfica. Revista Española de Lingüística 37, 247-278. Alarcón, R., and Sierra, G. (2006). Reglas léxico-metalingüísticas para la Fig. 5. Example of the use of the Definition Candidates’ extracción automática de contextos definitorios. Avances en la Ciencia Depuration System. de la Computación, VII Encuentro Nacional de Ciencias de la Compu- tación, 242-247. Alarcón, R., Sierra, G., and Bach, C. (2007). Developing a Definitional Knowledge Extraction System. Proc. 3rd Language and Technology CONCLUSIONS Conference (L&TC'07), Adam Mickiewicz University, Poznan, Polo- The definitions included in the knowledge base of special- nia. ized onomasiological dictionaries must cover both formal Alcina, A. (2009). Metodología y tecnologías para la elaboración de dic- and informal concepts, and they also must cover as many cionarios terminológicos onomasiológicos. Terminología y sociedad forms of expressing them as possible in order to procure del conocimiento. Bern: Peter Lang, 33-58. more accurate solutions for its users. Bilac, S., Watanabe, W., Hashimoto, T., Tokunaga, T. y Tanaka, H. (2004). It is also convenient to classify the definitions in the Dictionary search based on the target word description. Proceedings of the Tenth Annual Meeting of The Association for Natural Language knowledge base and the ones given by the user according to Processing (NLP2004), 556-559. their type, so as to provide the search engine with more fea- Cabré, M. T., Bach, C., Estopà, R., Feliu, J., Martínez, G., and Vivaldi, J. tures to compare and match the user definitions with its (2004). The GENOMA-KB project: towards the integration of concepts, own, hence improving its precision. Definition classification terms, textual corpora and entities. 4th International Conference on is the first step in the creation of ontologies. Language Resources and Evaluation (LREC 2004), Lisboa, European In this paper we presented a methodology to automatical- Languages Resources Association, 87-90. ly obtain definition candidates to fill the knowledge base of Dutoit, D. y Nugues, P. (2002). A Lexical Database and an Algorithm onomasiological dictionaries and also classify these defini- to Find Words from Definitions. Proceedings of the 15th European tions according to the Aristotelic definitional model. The Conference on Artificial Intelligence, Lyon, 450-454. source of these definitions is the Internet, which allows us to El-Kahlout, I., and Oflazer, K. (2004). Use of Wordnet for Retrieving a very wide variety of speakers and, for that reason, a means Words from Their Meanings. 2nd Global WordNet Conference, Brno, Czech Republic. of expressing concepts. This methodology has been used Hernández, L. (2011). Creación semi-automática de la base de datos y and tested in the creation of onomasiological dictionaries of mejora del motor de búsqueda de un diccionario onomasiológico. Uni- Sexuality and Linguistics, among others, and can be applied versidad Autónoma de México (Master thesis). to other subject areas, such as Biomedicine, Epidemiology, Lanacaster, F (1972). Vocabulary control for information retrieval. Wa- Veterinary, Laws, etc.. shington: Information Resources Press. We also presented a tool which will make possible the Sierra, G. (1999). Design of a concept-oriented tool for terminology. creation of a corpus with both good and bad definition can- UMIST, Manchester (Doctoral thesis). 5 Sierra et al. Sierra G., Alarcón R., Aguilar C., and Bach C. (2008). Definitional verbal patterns for semantic relation extraction. Terminology 14(1), pp. 74-98. Soler, V., and Alcina, A. (2008). Patrones léxicos para la extracción de conceptos vinculados por la relación parte-todo en español. Termino- logy 14(1). Zock, M., and Rapp Reinhard (2011). Introduction to this special issue on Cognitive Aspects of Natural Language Processing. Journal of Cogni- tive Science 12(3). 6