Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 161 Extraction of Definitional Contexts from Restricted Domains by Measuring Synthetic Judgements and Word Relevance César Aguilar Olga Acosta Pontificia Universidad Católica de Chile Pontificia Universidad Católica de Chile Santiago de Chile Santiago de Chile caguuilara@uc.cl olgalimx@gmail.com Currently, due to the accelerated growth of digi- Abstract tal information on the Web and other media as well the urgent necessity of obtaining relevant In this article we present an ongoing information in a fast and efficient way from work for extracting conceptual infor- these huge text sources, automated methods or mation from specialized-domain texts. approaches have been developed. For instance, Concepts are forms of dividing the world in Maedche and Staab (2004) define ontology in classes and they are the fundamental learning as a number of complementary disci- pieces for constructing ontologies. In this plines that feed on different types of unstructured sense, ontology learning is the (semi-) and semi-structured data in order to support a automatic support for constructing an on- semi-automatic ontology engineering process. In tology. Input data are required for the on- line with this, Cimiano (2006) describes various tology learning and this data are the basic sub-processes for constructing an ontology from source from which to learn the relevant texts where the concept extraction is an im- concepts for a domain, their definitions portant phase. So, the ontology learning needs as well the relations holding between input data from which to learn the relevant con- them. With this necessity in mind, we cepts for a given domain. propose here a methodology that takes According to these ideas, in this paper we sketch into account the level of synthetic a methodology for recognizing candidates to judgements and word relevance in a sen- analytical definitional contexts, according to the tence in order to filter out and rank sen- work developed by Sierra et al. (2008). We or- tences. Sentences with high relevance ganize our work as follows: in section 2 we pre- and low level of synthetic judgements sent general information about analytical should have at least a predicative verb definitions and the automated extraction of con- characteristic of analytical definitions for ceptual information. In section 3 we describe the being good candidates. function of adjectives as modifiers of a noun as well the distinction among descriptive and rela-  Introduction tional adjectives and the relation of descriptive adjectives with synthetic judgements in an at- Concepts are one of the most fundamental pieces tributive form. In section 4 we summarize the of the cognition: humans daily use concepts for methodology proposed. In section 5 we show interacting with others and the world. According some preliminary results. Finally, in section 6 we to Smith (1988), concepts mirror the way that we present the future work. divide the world into classes, and much of what we learn, communicate, and reason involves re-  Conceptual information lations among these classes. Additionally, Rosch We consider as conceptual information the in- (1978) argues that concepts promote the cogni- formation expressed by specialized definitions, tive economy because the human beings attempt particularly in analytical definitions constituted to gain as much information as possible about its by Genus Term and Differentia, following the environment while minimizing cognitive effort criteria formulated by Smith (2004). In fact, this and resources. author considers that information expressed by Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 162 these kinds of definitions is relevant to create  Adjectives ontologies based in lexical relations, specifically hyponymy/hypernymy and meronymy/holonymy Based on Demonte (1999), adjectives are syntac- relations. Smith argues that these relations, from tic units modifying the noun’s meaning and as- a philosophical point of view, are basic and uni- sociating it with one or various attributes. There versal. are two kinds of adjectives which assign proper- ties to nouns: descriptive and relational adjec- 2.1 Analytical Definitions tives. On the one hand, descriptive adjectives An analytical definition is a formula for describ- refer to constitutive features of the modified ing a concept, denoted by a linguistic tag, in noun. These features are exhibited or character- terms of a superordinate concept (Genus Term), ized by means of a single physical property: col- and a differentia distinguishing the concept de- or, form, character, predisposition, sound, and so fined from others with the same Genus Term. on: la silla verde (e.g., the green chair). On the For example, the next definition provides a de- other hand, relational adjectives assign a set of properties, i.e., all the characteristics jointly de- scription of the concept lightning conductor us- fining names as sea: puerto marítimo (e.g., mari- ing one of the most common verbs (i.e., to be) time port). In terminology, relational adjectives for introducing a definition. In this case, the ge- represent an important element for building spe- nus is the concept device while the differentia cialized terms, e.g.: inguinal hernia, venereal describes the function of the lightning conductor: disease, psychological disorder and others are [Lightning conductor Term] is a [device Genus Term] considered terms in medicine. In contrast, rare [that allows to protect the electrical systems hernia, serious disease and critical disorder against surges of atmospheric origin Differentia]. seem more descriptive judgments and closely 2.2 Definitional contexts related with a specific context. Sierra et al., (2008) proposed a based-pattern 3.1 Syntactical Identification of Non- method for extracting terms and definitions in Relevant Adjectives Spanish. This relevant information is expressed In line with what was just mentioned, if we con- in textual fragments called definitional contexts sider the internal structure of adjectives, two (or DCs) and are constituted by: a term, a defini- kinds of adjectives can be identified: permanent tion, and linguistic or metalinguistic forms, such and episodic adjectives (Demonte, 1999). The as verb phrases, typographical markers and/or first kinds of adjectives represent stable situa- pragmatic patterns, for example: tions, permanent properties characterizing indi- The primary energy, in general terms, is de- viduals. These adjectives are located outside of fined as an energetic resource that has not been any spatial or temporal restriction (i.e., affected for any transformation, with the excep- psicópata- psychopath). On the other hand, epi- tion of its extraction. sodic adjectives refer to transient situations or We can see here a DC sequence formed by the properties implying change and with time-space term primary energy, the definition that re- limitations. Almost all descriptive adjectives de- source that… and the verb pattern is defined as, rived of participles belong to this latter class as as well other characteristic units such as the well all adjectival participles (i.e., harto-jaded, pragmatic pattern in general terms and the ty- limpio-clean). Spanish is one of the few lan- pographical marker (bold font) that in this case guages that in syntax represent this difference in emphasizes the presence of the term. the meaning of adjectives. In many languages For achieving this objective, the authors employ this difference is only recognizable through in- verb patterns operating as connectors between terpretation. In Spanish, individual properties can terms and definitions. Such patterns syntactical- be predicated with the verb ser, and episodic ly are predicative phrases (or PrP), configured properties with the verb estar. around a verb that operates as a head of this PrP Another linguistic heuristics for identifying de- (e.g., to be, to characterize, to conceive, to con- scriptive adjectives is that only these kinds of sider, to describe, to define, to understand, to adjectives accept degree adverbs, and they can be know, to refer, to denominate, to call, to name). part of comparative constructions, for example, muy alto (Eng.: very high). Finally, only de- scriptive adjectives can precede a noun because Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 163 in Spanish relational adjectives are always thon language, for implementing a chunker in postposed, i.e.: la antigua casa (Eng.: the old order to extract descriptive adjectives with heu- house). ristics described in section 3.1. In this work, we propose a phase of quantifi- 3.2 Synthetic Judgements and Descriptive cation of synthetic judgments in candidate sen- Adjectives tences as a further filter of non-relevant According to Kant (2013), analytic sentences are sentences. We assumed here that synthetic judg- those whose truth seems to be knowable by ments are descriptive adjectives in an attributive knowing the meanings of the constituent words position (e.g., rare syndrome). So, the higher alone (e.g., gynecologists are doctors), unlike the amount of synthetic judgments in a sentence, the more usual synthetic ones (e.g., gynecologists more likely sentence is non-relevant. We consid- are rich), whose truth is knowable by both ered the set of descriptive adjectives obtained by knowing the meaning of the words and some- heuristics as a mechanism for this quantification thing about the world. of syntheticity. We believe that synthetic judgements in an at- Acosta, Aguilar and Sierra (2013) point out rela- tributive position (e.g., rich gynecologists) are tional adjectives have a higher probability of be- common in non-relevant sentences in specialized ing part of terms. The heuristics considered in domains. This kind of judgements can be recog- this experiment are: nized from the descriptive adjectives obtained by linguistic heuristics mentioned in section 3.1.  Methodology Where RG, AQ and VAE as tagged with We present here our methodology for extracting FreeLing, correspond to adverbs, adjectives and conceptual information from a medical domain the verb estar, respectively. The tags corpus. The input data consist of a corpus with correspond to determinants, POS tagged with FreeLing (Carreras et al., pronouns, punctuation signs and prepositions. 2004). The expression is a re- striction to reduce noise, since elements wrongly 4.1 Sentence Segmentation tagged by FreeLing as adjectives are extracted The heuristics assumed here in order to segment without this restriction. our corpus by sentences take into account that a 4.4 Weighting Words sentence must be separated by a point, to have at least a main verb, and the number of words must We evaluated relevance of simple words by be greater than 10 words because the most short means of a corpus comparison approach by ap- DC would have a single word term, the most plying the relative frequency ratio (Manning and long predicative verb-is defined as, a possible Schütze, 1999) between two different corpora as article preceding genus, genus term and, in this in (1). Given that the syntactical pattern of most case, some arbitrary limit of words for the differ- common terms in Spanish is entia). (Vilvaldi, 2004), we take into account only nouns and adjectives in both corpora: 4.2 Filtering out Sentences by Predicative Verbs (1) The set of sentences obtained by the above step are filtered out by considering predicative verbs Where , correspond to the absolute occur- mentioned in section 2.2, that is, if there is at rence frequency of wi and the size of the domain least a predicative verb; then it is a good candi- corpus, respectively. Similarly, , corre- date to DC. For the case of to be, if it is the first spond to absolute occurrence frequency of wi word of the sentence, then it is discarded. and the size of the reference corpus. The measure in (1) is only calculated for wi’s, where 4.3 Chunking . Otherwise, wi can be used as part of a We have used the library of Natural Language list of non-relevant words for purposes of quanti- NLTK (Bird, Klein and Loper, 2009) in the Py- fying non-relevance in sentences. On the other Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 164 hand, words only occurring in domain are On the other hand, we will continue with the weighted as in (2). We assume that the reference recollection of more information for increasing corpus is large enough for filter out non-relevant the sections of science and technology in our words, hence words only occurring in the do- reference corpus, in order to improve the word main corpus will have a higher probability of weighing and the calculation of relevance sen- being relevant so that the word’s frequency can tences. reflect its importance: Acknowledgments (2) This paper has been supported by the National Com- mission for Scientific and Technological Research 4.5 Relevance of Sentences (CONICYT) of Chile, Project Numbers: 3140332 and The ranking of sentences is done by adding up 11130565. the individual ranks of words present in the sen- tence. Formally, if s (that is, a sentence) has a References length of n words, w1 w2 …wn, where n>10, then Olga Acosta, Gerardo Sierra and César Aguilar. 2011. the ranking of the candidate s is the sum of the Extraction of Definitional Contexts using Lexical weights of all the individual words wiW, where Relations. International Journal of Computer Ap- W are all of the relevant words weighted as men- plications, 34(6): 46-53. tioned in section 4.4. In contrast, if wi  W, then Steven Bird, Ewan Klein and Edward Loper. 2009. its weight is zero. Natural Language Processing whit Python. O'Reil- ly, Sebastropol, Cal.  Preliminary Results Xavier Carreras, Isaac Chao, Lluís Padró, and Muntsa Considering descriptive adjectives automatically Padró. 2004. FreeLing: An Open-Source Suite of extracted by heuristics for quantifying syntheti- Language Analyzers. In Proceedings of the 4th In- ternational Conference on Language Resources and city, the first results show to be a good filter in Evaluation LREC 2004, ed. by Maria Teresa Lino order to remove non-relevant fragments by set- et al., pp. 239-242. ELRA Publications, Lisbon, ting thresholds related with the number of de- Portugal. scriptive adjectives in sentences. At the same time, the ranking of words achieves to sort sen- Philipp Cimiano. 2006. Ontology Learning and Popu- lation from Text. Springer, Berlin. tences according to its relevance for the domain. Additionally, given that only sentences with pre- Violeta Demonte. El adjetivo. Clases y usos. La posi- dicative verbs are considered, a subset of the bet- ción del adjetivo en el sintagma nominal. In Gra- ter ranked sentences are analytical DCs. mática descriptiva de la lengua española, ed. by If we take into account words where relative fre- Ignacio Bosque and Violeta Demonte. Vol. 1, Ch. quency in reference is greater or equal than in 3, pp. 129-215. Espasa-Calpe, Madrid. domain (given its higher occurrence in reference Immanuel Kant. 2013. Crítica de la razón pura, edited than in domain, we assume they are non-relevant and traslated to Spanish by Pedro Ribas. Taurus, words) as part of this list for removing non- Madrid. relevant sentences by setting thresholds (here, Alexander Maedche and Steffen Staab. 2004. Ontolo- nouns and adjectives are included) improve sig- gy Learning. In Handbook on Ontologies, ed. by nificantly the results. Steffen Staab and Rudi Studer, pp. 173-190. Springer, Berlin.  Future results Christopher Manning and Hinrich Schütze. 1999. In a future phase of this experiment, we will im- Foundations of Statistical Natural Language Pro- cessing. MIT Press, Cambridge, Mass. plement a syntactic phase in order to remove more non-relevant sentences. For instance, sen- Rosch. 1978. Principles of categorization. In Cogni- tences with to be verb are the most common sen- tion and Categorization, ed. by Elinor Rosh and tences and which produce so much noise in Barbara Lloyd, pp. 27-48. Lawrence Erlbaum As- results. Given this, we consider that a syntactic sociates, Hillsdale, NJ. phase capable to assure the occurrence of specif- Gerardo Sierra, Rodrigo Alarcón, César Aguilar and ic syntactic structures will be an important ad- Carme Bach. 2008. Definitional verbal patterns for vance in order to perform a better filtering. Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 165 semantic relation extraction. Terminology, 14(1): 74-98. Barry Smith. 2004. Beyond concepts: ontology as reality representation. In Formal Ontologies in In- formation Systems, ed. by Achille Varzi and Laure Vieu, pp. 73-84., IOS Press, Amsterdam. Edward Smith. 1988. Concepts and Thought. In Psy- chology of human thought, ed. by Robert J. Stern- berg, pp. 19-49. Cambridge University Press, Cambridge, UK. Jorge Vivaldi. 2004. Extracción de candidatos a términos mediante la combinación de estrategias heterogéneas. Ph. D. Dissertation. IULA-UPF, Barcelona.